github.com via Hacker News July 4th 2026

GPT-5.5 Codex Reasoning Tokens Cluster at 516, Report Says

openai coding tools hallucinations ai-business

TL;DR

A Codex issue reports GPT-5.5 responses land at exactly 516 reasoning tokens far more often than other models, with echo peaks at 1034 and 1552.
Across 390,195 response records over 865 sessions, GPT-5.5 was 19.3% of responses but 82.0% of exact-516 events, a 33.6x clustering ratio.
Monthly exact-516 clustering rose from 0.11% in February 2026 to 53.30% in May, while mean reasoning tokens fell from 268.1 to 106.9.

A developer-filed bug report against OpenAI's Codex CLI has documented a suspicious statistical fingerprint: GPT-5.5 responses appear to bunch at exactly 516 reasoning tokens, with echo peaks at 1034 and 1552. The claim comes from issue #30364 on the openai/codex repo, filed on June 27, 2026 by a user posting as vguptaa45.

The methodology is aggregate rather than mechanistic. The reporter says they analyzed 390,195 response-level token records across 865 sessions from February through late June 2026. In that dataset, GPT-5.5 accounts for 19.3% of responses but 82.0% of the exact-516 events. The exact-516 to ≥516 ratio for GPT-5.5 comes in about 33.6x higher than the non-GPT-5.5 baseline, and the clustering intensified over time: 0.11% of GPT-5.5 responses in February, 53.30% by May, then 35.84% in June. During the same window, mean reasoning tokens for the model fell from 268.1 in February to 106.9 in May.

Why a practitioner should care is that these numbers describe a pattern in Codex telemetry that lines up with a widely-shared feeling that Codex quality has drifted. The reporter explicitly connects it to a companion report, #29353, in which a task allegedly returned the wrong answer whenever the run terminated at exactly 516 reasoning tokens. The recommendation to OpenAI is plain: look for a reasoning-budget, routing, truncation, fallback, or scheduler behavior that would explain the fixed cutoffs.

The honest caveat is that this is a single reporter's aggregate over one user's Codex sessions, and vguptaa45 says so directly: "I am not claiming this proves hidden chain-of-thought truncation." What the report does not give you is any OpenAI staff response, a controlled reproduction across independent users, or a mechanistic explanation for why 516, 1034 and 1552 in particular would be sticky values. Those are the things to watch for.

If the finding holds up under other people's data, the win for developers is a concrete lever to argue about. A quantized reasoning ceiling is the kind of behavior you can test for, price against, and route around, which beats the vibes-based 'Codex feels worse this month' complaint that has been circulating.

Originally reported by github.com

Read the original article →

Original headline: GPT-5.5 Codex Reasoning Tokens Cluster Hard at Exactly 516 — HN Front Page (246 pts) Amplifies Developer Bug Report Suggesting Hidden Reasoning Budget Behind Perceived Codex Regression