reddit.com via Reddit June 2nd 2026

r/ClaudeAI: Five-Model Cold Code Review Benchmark — Grok, Opus 4.8, Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash Scored on Bug-Seeded React App With No Context

anthropic openai google xai coding tools benchmarks coding-tools code-review

Summary

A team built 'Budget Harbor', a client-side household budget planner in React and TypeScript, with intentional bugs committed on the first commit and everything left uncommitted, then asked five AI models to review the working tree cold inside Kilo Code Reviewer with no additional context. Grok, Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash were each given identical cold-start conditions. Community discussion is dissecting the Opus vs. Sonnet delta and how each model handles bug-dense uncommitted codebases, with no external leaderboard context provided to the models.

Originally reported by reddit.com

Read the original article →

Original headline: r/ClaudeAI: Five-Model Cold Code Review Benchmark — Grok, Opus 4.8, Sonnet 4.6, GPT-5.5, and Gemini 3.5 Flash Scored on Bug-Seeded React App With No Context