theinformation.com web signal

OpenAI engineers say they've more than halved inference costs

TL;DR

  • OpenAI engineers reportedly told colleagues this month they have a method to more than halve inference costs, according to The Information.
  • Applied to ChatGPT's logged-out tier, the change reportedly cut the Nvidia GPUs needed at one point to roughly a couple hundred.
  • The gains come from better utilization of existing servers, not new chips, and the underlying technique was not disclosed in the reporting.

OpenAI engineers reportedly told colleagues earlier this month that they had figured out a way to more than halve the cost of inference, according to The Information. When the optimization was applied to power ChatGPT for visitors without a free or paid account, it cut the number of Nvidia GPUs needed at one point to just a couple hundred, described in the reporting as a shockingly small number.

The reason this matters more than the usual cost-cutting headline is that inference is now where the money actually goes. Training a frontier model is a one-time bill, but every chat reply, every API call, and every agent step runs through inference. If a single software change can take the bill on a slice of free-tier traffic from many GPUs down to a few hundred, the unit economics of running a chatbot at planet scale shift in a way that hardware contracts cannot match on their own.

The honest caveat is that the reporting is thin on the actual mechanism. The piece refers to newly-discovered optimizations but does not specify the exact technique used, and the gains reportedly come from improving utilization of existing servers rather than deploying additional chips. That framing is consistent with smarter batching, better cache reuse, quantization, or routing simpler queries to cheaper models, but the source does not say which. Without the method, it is hard to know how broadly the win generalizes beyond the logged-out ChatGPT path it was first applied to.

What the reporting doesn't give you is whether paying API customers, enterprise tenants, or the larger reasoning models see the same gain, or whether this is a one-time efficiency win on a specific traffic pattern. If it does generalize, the downstream effect is that OpenAI gets to either drop prices, widen free access, or absorb more agentic workloads without buying more chips. That last option is the one worth watching, because it is the cheapest path to protecting margins while the rest of the industry keeps racing to build out fabs.