arxiv.org via Reddit

Claude Code tops AI game-building benchmark at 41%

By Alexis Dufresne Published June 17, 2026 at 18:40 UTC Updated June 17, 2026 at 19:00 UTC

agents coding tools generative ai ai-research agents

Key insights

Claude Code Opus-4.7 led all tested models at 41.46% on 140 Godot game-generation tasks spanning 15 game families.
Core mechanics scored 55.34% for the top model, but content depth dropped to 39.48% and art and presentation to 36.86%.
DeepSeek-V4-Pro scored just 2.15%, revealing a massive capability spread among tested coding agents.

Why this matters

The 41.46% ceiling for the top agent on a structured, multi-component game-generation task signals that coding agents remain far from reliably delivering complex interactive software, which directly constrains near-term commercial deployment of agentic development pipelines. The paper's three-part evaluation framework (Engine Grounding, Artifact Completeness, Interactive Verification) gives AI teams a concrete diagnostic structure to pinpoint exactly where agent pipelines break down beyond single-file code generation. The score spread, from 41.46% for Claude Code Opus-4.7 down to 2.15% for DeepSeek-V4-Pro, shows that capability differences across model families are dramatically amplified on tasks requiring coordinated multi-file, multi-system artifact delivery.

Summary

AI coding agents can't reliably ship a complete game. GameCraft-Bench, a 140-task benchmark from researchers at Chinese University of Hong Kong Shenzhen, Tencent's Hunyuan Team, NUS, and SJTU, tests whether agents can generate fully playable Godot games from natural-language specs across 15 game genres, verified through replayed gameplay and rubric-guided multimodal judging. Claude Code Opus-4.7 topped the leaderboard at 41.46%. GPT-5.5 reached 39.49%, Kimi-K2.6 scored 30.65%, and most other models fell well below 40%. Essentially: (Claude Code, GPT-5.5, Kimi-K2.6) can generate code fragments but not reliable, complete games. - Core mechanics scored highest for Opus-4.7 at 55.34%, while content depth fell to 39.48% and art and presentation to just 36.86%. - Existing benchmarks (OpenGame-Bench, GameDevBench, WebGameBench) don't satisfy all three of the paper's criteria: Engine Grounding, Artifact Completeness, and Interactive Verification. - Agents produce recognizable local mechanics but fail to assemble them into complete, coherent interactive systems. The ceiling isn't specific to games: it's a broader failure to reliably coordinate a multi-component software artifact from natural-language specification all the way to execution.

Potential risks and opportunities

Risks

Game studios and indie publishers evaluating AI-assisted development pipelines may delay adoption further: the benchmark shows even the top agent fails on content depth (39.48%) and art and presentation (36.86%), the outputs most visible to end players
Models at the bottom of the leaderboard, MiniMax-M2.7 at 10.95% and DeepSeek-V4-Pro at 2.15%, face competitive disadvantage in enterprise coding-agent procurement if GameCraft-Bench scores enter vendor evaluations
If the multimodal judge's permissiveness meaningfully inflates scores, a stricter human-calibrated re-run could lower the 41.46% top result, weakening current-generation agents' case for any complex multi-component agentic software task beyond games

Opportunities

Anthropic holds a concrete benchmark reference at 41.46% with Claude Code Opus-4.7, nearly two points above GPT-5.5 at 39.49%, giving it a specific data point to cite in developer-tool sales and marketing
Godot's open-source ecosystem gains direct research spotlight as the benchmark's sole engine, creating demand for Godot-specific agent tooling, training corpora, and plugins from AI labs seeking to improve scores
AI evaluation and testing firms can package the three-criterion framework (Engine Grounding, Artifact Completeness, Interactive Verification) as a reusable template for enterprise multi-component agentic software validation beyond games

What we don't know yet

Whether the multimodal judge's noted permissiveness inflates all published scores meaningfully, and what a strict human-calibrated re-evaluation would show at the top tier
Which of the 15 game families (platformer, open-world, horror, etc.) show the largest per-family score variance across models, since aggregate scores mask genre-specific failure modes
Whether agents were evaluated on a single generation pass per task or allowed iterative self-correction loops, which would significantly change interpretation of the 41.46% ceiling

Shared on Bluesky by 1 AI expert

arxiv cs.CL @arxiv-cs-cl.bsky.social: Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu,… →

Originally reported by arxiv.org

Read the original article →

Original headline: GameCraft-Bench: arXiv Benchmark Tests Whether AI Coding Agents Can Build Playable Games End-to-End in Godot — Best Model Achieves Only 41% on 140 Tasks Across 15 Game Families