Artificial Intelligence Papers

SPIRAL: Learning to Search and Aggregate Jubayer Ibn Hamid, Ifdita Hasan Orney, Michael Y. Li, Omar Shaikh, Yoonho Lee, Dorsa Sadigh, Chelsea Finn, Noah Goodman https://t.co/CRBpj1Mjhk [𝚌𝚜.𝙰𝙸] https://t.co/kVEHyMHKpK

SPIRAL: Learning to Search and Aggregate arxiv.org

AI Weekly's analysis →

SPIRAL co-trains three reasoning primitives in one RL framework: sequential chain-of-thought, parallel sampling of traces, and learned aggregation of those traces.
The paper reports outperforming GRPO by up to 11× scaling efficiency and 15% higher performance when all three compute primitives are scaled.
Training uses set reinforcement learning to make parallel traces collectively useful, plus standard RL to train the aggregation step itself.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 3 from the directory shared this · 22d ago

Autodata: An agentic data scientist to create high quality synthetic data Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie, Swarnadeep Saha, Eryk Helenowski, Weizhe Yuan, Olga Golovneva, Jack Lanchantin, … https://t.co/iSchw5CkfT [𝚌𝚜.𝙰𝙸 𝚌𝚜.𝙲𝙻 𝚌𝚜.𝙻𝙶] https://t.co/fC2KJEmmyE

Autodata: An agentic data scientist to create high quality synthetic data arxiv.org

AI Weekly's analysis →

Meta researchers introduce Autodata, a method that casts an AI agent as a data scientist iteratively generating and refining synthetic training data.
The practical implementation is called Agentic Self-Instruct, and meta-optimizing the data scientist agent itself produced a larger uplift than static methods.
On legal reasoning tasks, a 4B parameter model trained on agent-made data reportedly beat a 397B parameter baseline.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 19d ago

Bayesian control for coding agents Theodore Papamarkou, Vladislav Smirnov, Viktor Mazanov, Artem Vazhentsev, Preslav Nakov, Timothy Baldwin, Artem Shelmanov https://t.co/1EUIZ7fmTy [𝚌𝚜.𝙰𝙸 𝚌𝚜.𝙲𝙻] https://t.co/5sFFzguwnn

Bayesian control for coding agents arxiv.org

AI Weekly's analysis →

A new arxiv paper recasts coding-agent orchestration as cost-sensitive sequential hypothesis testing managed by a Bayesian controller.
The controller decides dynamically whether to gather more evidence, refine the solution, run a verifier, or stop the run.
Authors report the approach is most valuable when verification is costly and critics are informative but imperfect, across six generators and nine benchmarks.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 21d ago

auto-psych: Automating the science of mind using agent-driven theory discovery and experimentation Ben Prystawski, Kushin Mukherjee, Daniel Wurgaft, Linas Nasvytis, Michael Y. Li, Noah D. Goodman, Michael C. Frank https://t.co/U0Bv3bc9yi [𝚌𝚜.𝙰𝙸] https://t.co/S9vC7tD3F3

auto-psych: Automating the science of mind using agent-driven theory discovery and experimentation arxiv.org

AI Weekly's analysis →

Auto-psych uses nested loops: an inner loop generates probabilistic cognitive models, an outer loop designs and runs online human experiments.
In three independent human experiments, the system's discovered theories fit the data better than theories drawn from the scientific literature.
The benchmark task was a classic cognitive psychology problem about how people perceive randomness in coin flip sequences.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 19d ago

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification Yunhao Feng, Ruixiao Lin, Ming Wen, Qinqin He, Yanming Guo, Yifan Ding, Yutao Wu, Jialuo Chen, Yunhao Chen, Xiaohu Du, Jianan Ma, Zixing Chen, … https://t.co/zQzYXScMTq [𝚌𝚜.𝙰𝙸] https://t.…

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification arxiv.org

AI Weekly's analysis →

The Vera framework reports average attack success rates of 93.9% against four production agent frameworks under multi-channel attacks.
Vera-Bench ships 1,600 executable safety cases spanning 124 risk categories, covering OpenClaw, Hermes, Codex, and Claude Code.
Verifiers judge outcomes using environment state and tool-call evidence rather than the agent's own self-report of what happened.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 12d ago

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models Maximo Rulli, Thomas Fontanari, Simone Petruzzi, Federico Alvetreti, Giorgio Strano, Donato Crisostomi, Giorgos Nikolaou, Tommaso Mencattini, Andrea Santilli, … https://t.co/jliEcX8tuE [𝚌𝚜.𝙰𝙸 𝚌𝚜.𝙲𝙻] https://…

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models arxiv.org

AI Weekly's analysis →

Diffusion language models lack explicit timestep conditioning yet still encode denoising progress in their residual streams, decodable by probes across layers.
Steering the model along a low-dimensional subspace tied to the inferred timestep produces predictable shifts in output confidence and entropy.
The latent time representation shows structured, interpretable geometry in activation space, per researchers at Sapienza University of Rome and EPFL.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 12d ago

Discrete Diffusion Language Models for Interactive Radiology Report Drafting Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert https://t.co/sCm225Db0x [𝚌𝚜.𝙰𝙸 𝚌𝚜.𝙻𝙶] https://t.co/Uag6VCLTZv

Discrete Diffusion Language Models for Interactive Radiology Report Drafting arxiv.org

AI Weekly's analysis →

DiffusionGemma-26B matches or exceeds its same-size autoregressive sibling Gemma-4-26B on every medical VQA dataset the authors tested.
Decoding is reported at 3.5-4.4x faster than the AR baseline, with 3.8B active parameters after LoRA fine-tuning of the MoE model.
Bidirectional denoising gives the model any-order infill, so a radiologist can fix report fragments and have the model fill between them.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 12d ago

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination Subhadeep Pal, Shashwat Sourav, Tirthankar Ghosal, Markus J. Buehler https://t.co/iLYMDMaIY4 [𝚌𝚜.𝙰𝙸 𝚌𝚘𝚗𝚍-𝚖𝚊𝚝.𝚖𝚝𝚛𝚕-𝚜𝚌𝚒 𝚌𝚜.𝙲𝙻 𝚌𝚜.𝙻𝙶] https://t.co/SMlyxB2nbf

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination arxiv.org

AI Weekly's analysis →

Graph-PRefLexOR uses Group Relative Policy Optimization to split reasoning into mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis.
On 100 open-ended materials science and mechanics questions, the system reports 40-65% improvements over base models, with the largest gains in reasoning traceability.
Output embeddings show roughly 2-3x greater semantic diversity than baselines, which the authors credit to long-range recombination inside a bounded semantic space.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 13d ago

Two AI Metrics Diverged: Will it Make All the Difference? Alex Fogelson, Zachary A. Brown, Hans Gundlach, Jayson Lynch, Neil Thompson https://t.co/lV6uJDlT4w [𝚌𝚜.𝙰𝙸] https://t.co/Qef4NNwKJB

Two AI Metrics Diverged: Will it Make All the Difference? arxiv.org

AI Weekly's analysis →

A new arXiv paper argues whether small AI models eventually catch up to frontier systems depends entirely on which performance metric you pick.
The authors show validation loss gaps shrink, but on other unbounded metrics frontier models grow their lead forever with more compute.
Bounded and unbounded metrics can suggest opposing policy responses for capabilities like software engineering, synthetic biology, or rhetorical persuasiveness.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 13d ago

Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda Chen https://t.co/GJb8WXUeGl [𝚌𝚜.𝙰𝙸] https://t.co/vSj7syoWEx

Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index arxiv.org

AI Weekly's analysis →

The Relative Surprisal Index combines a token's entropy with its selected probability to decide which tokens drive RLVR updates.
RSI-S filtering reportedly beat baseline GRPO on Qwen2.5-1.5B, 3B and 7B by 2.10 to 3.30 points on AIME and AMC math benchmarks.
Response lengths also fell by 108 to 265 tokens across the tested Qwen2.5 sizes, suggesting shorter outputs alongside the accuracy gains.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 13d ago

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes Yuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang Li https://t.co/o7ndg7FcHI [𝚌𝚜.𝙰𝙸 𝚌𝚜.𝙲𝙻] https://t.co/F25hfC03IN

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes arxiv.org

AI Weekly's analysis →

CDR-Bench evaluates LLMs on 3,462 data refinement tasks spanning four real-world domains and 29 distinct operators, with deterministic reference outputs enabling exact scoring.
Across more than ten state-of-the-art models, compositional performance degrades sharply and order-sensitive recipe success collapses.
The authors conclude current LLMs lack the procedural faithfulness required for reliable compositional data refinement.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 14d ago

AI-Assisted Discovery of Convex Relaxations via Dual Agents Sungyoon Kim, Mert Pilanci https://t.co/YvTTbH9LBq [𝚌𝚜.𝙰𝙸] https://t.co/VhXAAjoKhA

AI-Assisted Discovery of Convex Relaxations via Dual Agents arxiv.org

AI Weekly's analysis →

Sungyoon Kim and Mert Pilanci pair a coding agent that proposes constraints with a theory agent that verifies proposals and searches for counterexamples.
The system reports a tighter first autocorrelation bound (1.28 to 1.2937) and a tighter Erdős minimum-overlap bound (0.379005 to 0.37912).
Every reported bound is certified using an explicit dual-feasible point validated through interval arithmetic, not just an empirical estimate.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 2 from the directory shared this · 14d ago

Articles & links