r/ClaudeAI: Five-Model Cold Code Review Benchmark — Grok, Opus 4.8, Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash Scored on Bug-Seeded React App With No Context
Summary
A team built 'Budget Harbor', a client-side household budget planner in React and TypeScript, with intentional bugs committed on the first commit and everything left uncommitted, then asked five AI models to review the working tree cold inside Kilo Code Reviewer with no additional context. Grok, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash were each given identical cold-start conditions. Community discussion is dissecting the Opus vs. Sonnet delta and how each model handles bug-dense uncommitted codebases, with no external leaderboard context provided to the models.
Originally reported by reddit.com
Read the original article →Original headline: r/ClaudeAI: Five-Model Cold Code Review Benchmark — Grok, Opus 4.8, Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash Scored on Bug-Seeded React App With No Context