marktechpost.com web signal

NVIDIA HORIZON agent hits 100% RTL pass rate via git loops

nvidia agents coding tools ai-research

TL;DR

  • HORIZON reportedly hit 100% pass rate on ChipBench, RTLLM-2.0, Verilog-Eval-v2 and CVDP, up from a 47.8% aggregate first-iteration rate.
  • The agent stages edits inside git worktrees and commits only when an executable acceptance predicate for the RTL evaluator passes.
  • NVIDIA's team says agentic hardware design is not solved, calling out reward hacking and long-turnaround feedback as open problems.

A hardware design agent hitting 100% on every RTL benchmark it was pointed at sounds like the sort of number that should be read carefully, not celebrated. NVIDIA Research's new framework, HORIZON, reportedly does exactly that across ChipBench, RTLLM-2.0, Verilog-Eval-v2, and the CVDP categories, according to MarkTechPost. The catch, which the researchers lead with themselves, is that these results reflect iterative refinement rather than single-turn generation. The aggregate first-iteration pass rate was 47.8%.

What is genuinely interesting is the substrate. HORIZON treats hardware design as repository-level code evolution: the agent stages edits inside git worktrees, invokes an evaluator that can run compilation, simulation, coverage extraction, and testbench checks, then commits only when an acceptance predicate passes. As the NVIDIA writeup puts it, git is "the substrate here, not incidental bookkeeping," and the repository history becomes the experience buffer. That is a different mental model from stuffing a larger context window with more Verilog and hoping.

The convergence pattern is worth reading closely. ChipBench went from 20% at iteration zero to 100% after five convergence iterations. RTLLM-2.0 and Verilog-Eval-v2 needed only two. But one code completion category, CID 002, required 82 iterations and burned 56 million tokens by itself, out of roughly 210 million tokens across all benchmarks. Ninety-one percent of input was cached, which is what makes the loop tractable at all.

The honest caveat is right in the paper. The team explicitly says reward hacking and long-turnaround reward stay open. A pass can mean the design "satisfies the visible harness," not the full specification. The benchmarks are described as controlled proxies for engineering problems whose real feedback loops, PPA-oriented tapeout work, take weeks. What the reporting does not give you is which base model drives HORIZON, or whether any of the 100% numbers include power or area constraints at all.

The forward-looking piece is the pattern more than the score. If a git-worktree loop with an executable evaluator gate works for RTL, it is a template other domains with hard correctness checks will pick up quickly.