Claude Code tops AI game-building benchmark at 41%
Key insights
- Claude Code Opus-4.7 led all tested models at 41.46% on 140 Godot game-generation tasks spanning 15 game families.
- Core mechanics scored 55.34% for the top model, but content depth dropped to 39.48% and art and presentation to 36.86%.
- DeepSeek-V4-Pro scored just 2.15%, revealing a massive capability spread among tested coding agents.
Why this matters
The 41.46% ceiling for the top agent on a structured, multi-component game-generation task signals that coding agents remain far from reliably delivering complex interactive software, which directly constrains near-term commercial deployment of agentic development pipelines. The paper's three-part evaluation framework (Engine Grounding, Artifact Completeness, Interactive Verification) gives AI teams a concrete diagnostic structure to pinpoint exactly where agent pipelines break down beyond single-file code generation. The score spread, from 41.46% for Claude Code Opus-4.7 down to 2.15% for DeepSeek-V4-Pro, shows that capability differences across model families are dramatically amplified on tasks requiring coordinated multi-file, multi-system artifact delivery.
Summary
AI coding agents can't reliably ship a complete game. GameCraft-Bench, a 140-task benchmark from researchers at Chinese University of Hong Kong Shenzhen, Tencent's Hunyuan Team, NUS, and SJTU, tests whether agents can generate fully playable Godot games from natural-language specs across 15 game genres, verified through replayed gameplay and rubric-guided multimodal judging.
Claude Code Opus-4.7 topped the leaderboard at 41.46%. GPT-5.5 reached 39.49%, Kimi-K2.6 scored 30.65%, and most other models fell well below 40%.
Essentially: (Claude Code, GPT-5.5, Kimi-K2.6) can generate code fragments but not reliable, complete games.
- Core mechanics scored highest for Opus-4.7 at 55.34%, while content depth fell to 39.48% and art and presentation to just 36.86%.
- Existing benchmarks (OpenGame-Bench, GameDevBench, WebGameBench) don't satisfy all three of the paper's criteria: Engine Grounding, Artifact Completeness, and Interactive Verification.
- Agents produce recognizable local mechanics but fail to assemble them into complete, coherent interactive systems.
The ceiling isn't specific to games: it's a broader failure to reliably coordinate a multi-component software artifact from natural-language specification all the way to execution.
Potential risks and opportunities
Risks
- Game studios and indie publishers evaluating AI-assisted development pipelines may delay adoption further: the benchmark shows even the top agent fails on content depth (39.48%) and art and presentation (36.86%), the outputs most visible to end players
- Models at the bottom of the leaderboard, MiniMax-M2.7 at 10.95% and DeepSeek-V4-Pro at 2.15%, face competitive disadvantage in enterprise coding-agent procurement if GameCraft-Bench scores enter vendor evaluations
- If the multimodal judge's permissiveness meaningfully inflates scores, a stricter human-calibrated re-run could lower the 41.46% top result, weakening current-generation agents' case for any complex multi-component agentic software task beyond games
Opportunities
- Anthropic holds a concrete benchmark reference at 41.46% with Claude Code Opus-4.7, nearly two points above GPT-5.5 at 39.49%, giving it a specific data point to cite in developer-tool sales and marketing
- Godot's open-source ecosystem gains direct research spotlight as the benchmark's sole engine, creating demand for Godot-specific agent tooling, training corpora, and plugins from AI labs seeking to improve scores
- AI evaluation and testing firms can package the three-criterion framework (Engine Grounding, Artifact Completeness, Interactive Verification) as a reusable template for enterprise multi-component agentic software validation beyond games
What we don't know yet
- Whether the multimodal judge's noted permissiveness inflates all published scores meaningfully, and what a strict human-calibrated re-evaluation would show at the top tier
- Which of the 15 game families (platformer, open-world, horror, etc.) show the largest per-family score variance across models, since aggregate scores mask genre-specific failure modes
- Whether agents were evaluated on a single generation pass per task or allowed iterative self-correction loops, which would significantly change interpretation of the 41.46% ceiling
Originally reported by arxiv.org
Read the original article →Original headline: GameCraft-Bench: arXiv Benchmark Tests Whether AI Coding Agents Can Build Playable Games End-to-End in Godot — Best Model Achieves Only 41% on 140 Tasks Across 15 Game Families