reddit.com via Reddit

r/ClaudeAI: Five-Model Cold Code Review Benchmark — Grok, Opus 4.8, Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash Scored on Bug-Seeded React App With No Context

anthropic openai google xai coding tools benchmarks coding-tools code-review

Summary

A team built 'Budget Harbor', a client-side household budget planner in React and TypeScript, with intentional bugs committed on the first commit and everything left uncommitted, then asked five AI models to review the working tree cold inside Kilo Code Reviewer with no additional context. Grok, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash were each given identical cold-start conditions. Community discussion is dissecting the Opus vs. Sonnet delta and how each model handles bug-dense uncommitted codebases, with no external leaderboard context provided to the models.