paper web signal

Input Prompt Compression Backfires on Cost and Accuracy

TL;DR

  • Input prompt compression raises net API cost by roughly 1.15x on average, reaching 1.8x in worst cases across five benchmarks.
  • Output compression works in the opposite direction, cutting realized costs 1.4-2.4x per model and up to 3x in the best case.
  • Roughly half of non-reasoning model outputs stay technically correct under input compression but diverge semantically from baseline responses.

The conventional wisdom that shorter user prompts cut API inference costs has been put to a direct test, and the results cut against a widely promoted optimization technique. A paper submitted June 23, 2026 by Morayo Danielle Adeyemi, Ryan A. Rossi, and Franck Dernoncourt evaluates the "talk short, drop grammar, save token" approach through a two-channel protocol they call CAVEWOMAN, covering eight models across five datasets at five compression levels.

The core finding turns on a distinction most optimization guides collapse. According to the paper on arxiv, compressing what the model says back is effective: output compression reduces realized costs by 1.4-2.4x per model, reaching up to 3x in the best case. Compressing what you say to the model — the user prompt — does the reverse. Input compression raises net cost by roughly 1.15x on average across the five benchmarks, with worst cases hitting 1.8x and climbing to 2.7x under stronger compression. The apparent mechanism is compensatory: models respond with longer outputs when given clipped inputs, erasing any savings from the shorter prompt and then some.

The accuracy picture compounds the problem. Under heavy input compression, correctness collapses while verbosity persists. Roughly half of non-reasoning model generations remain technically correct but diverge semantically from what the model would have produced with a normal prompt, and that divergence survived length-controlled re-scoring and multiple-comparisons correction.

The honest caveat is scope. Eight models and five datasets is a real experiment, but whether the finding holds in specialized domains like code generation or medical question-answering is not addressed. The paper notes that reasoning models appear to behave differently from the non-reasoning cohort, without detailing how — a thread worth watching as that model category grows.

For teams that have already wired input compression into production pipelines, this is worth auditing before the next billing cycle. The actionable move is to redirect compression effort to the output side, instructing the model to be concise in its responses rather than stripping user prompts to caveman syntax. That direction has the data behind it.

Shared on Bluesky by 2 AI experts