arxiv.org web signal

Bayesian controller decides when coding agents verify or stop

TL;DR

  • A new arxiv paper recasts coding-agent orchestration as cost-sensitive sequential hypothesis testing managed by a Bayesian controller.
  • The controller decides dynamically whether to gather more evidence, refine the solution, run a verifier, or stop the run.
  • Authors report the approach is most valuable when verification is costly and critics are informative but imperfect, across six generators and nine benchmarks.

A new paper on arxiv, Bayesian Control for Coding Agents by Theodore Papamarkou, Vladislav Smirnov, Viktor Mazanov, Artem Vazhentsev, Preslav Nakov, Timothy Baldwin and Artem Shelmanov, is doing something I find more interesting than another agent scaffold. It treats the question of "what should this agent do next" as a decision-theoretic problem rather than a set of hand-written rules.

The setup is the familiar one: a coding agent pairs an LLM generator with tools, some cheap (diagnostics, linters) and some expensive (verifiers, tests). The usual move is to wire those together with fixed orchestration logic. The authors instead formulate orchestration as cost-sensitive sequential hypothesis testing, with a Bayesian controller that maintains a belief over whether the current solution is correct and decides at each step whether to gather more evidence, refine the candidate, run a verifier, or stop.

Why that framing is worth attention: the controller's belief state doubles as an interpretable correctness score, and the authors report it outperforms token-probability and raw tool-success as an uncertainty signal. If that holds outside the paper's benchmarks, it gives product teams a much cleaner number for deciding when to escalate to a human or to a more expensive checker, instead of the log-prob heuristics most stacks use today. The reported sweet spot is when verification is costly and critics are informative but imperfect, which describes most real coding workflows.

The honest caveat is that the abstract is the abstract. The authors say the approach was tested across six generators and nine coding benchmarks, but the names, per-benchmark numbers, the prior, and the overhead of running the controller are not things I can confirm from what is publicly readable yet. Calibration claims are the easy thing to oversell and the hard thing to reproduce.

What is worth watching is whether existing agent frameworks pick this up. Wrapping a Bayesian controller around an existing generator-verifier loop does not require retraining the model, which is the kind of detail that decides whether an idea like this stays in papers or shows up in your next agent release.

Shared on Bluesky by 2 AI experts