arxiv.org web signal

arXiv paper: low-bit quantization inflates reasoning tokens

TL;DR

  • INT4/INT3 quantized reasoning models generate longer thought chains even when they still answer correctly, offsetting the per-token speedup that low-bit inference is supposed to deliver.
  • The paper introduces a 'CoT Token Inflation Ratio' measured across mathematical reasoning, code generation, scientific question answering, and agentic tool-use benchmarks.
  • Quantization-aware training was the most promising mitigation, while prompting and decoding-time sampling produced inconsistent results in the authors' hands.

A new arxiv preprint makes a point that anyone quietly saving money by serving reasoning models at INT4 or INT3 should probably read. The claim is not that quantized reasoning models get the wrong answer more often. It is that they still get the right answer, but they take more tokens to get there.

The paper on arxiv, 'Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models', argues that low-bit post-training quantization can introduce a hidden test-time compute cost. Quantized reasoning models reportedly generate longer chains of thought even when they still answer correctly, and that extra token count offsets the per-token speedup you paid for by dropping precision. The authors propose a 'CoT Token Inflation Ratio' to make the effect legible, and they measure it across mathematical reasoning, code generation, scientific question answering, and agentic tool-use benchmarks.

Why this matters if you are running inference: the standard mental model for quantization is that you trade a small accuracy hit for a real throughput and cost win. The paper's finding says the accuracy side of that trade often holds up, but the cost side is smaller than the throughput numbers alone suggest, because the model is now producing more intermediate steps and, per the paper, some semantic repetition inside its reasoning trace.

On mitigations, the reporting is honest that this is not solved. Prompting and decoding-time sampling gave inconsistent results in the authors' hands, while quantization-aware training was the most promising route for reducing both degradation and inflation. The caveat is that the abstract does not give us the size of the inflation ratio, the specific model families tested, or how much of the extra length is genuine reasoning versus repetition, so treat the specifics as reported, not settled. The forward-looking piece worth watching is the paper's core recommendation, that reasoning-token usage should be reported alongside accuracy when evaluating quantized reasoning models. If that becomes an eval-suite default, teams picking checkpoints for production will get a much cleaner view of the actual cost curve.

Shared on Bluesky by 2 AI experts