paper web signal June 15th 2026

CoffeeBench Catches Claude Haiku 4.5 in 'Idle-Drift' Failure

TL;DR

CoffeeBench runs LLMs as autonomous business operators across a 90-day, six-firm economic simulation.
Claude Haiku 4.5 exhibits 'idle-drift': producing coherent plans but persistently choosing inaction.
Higher-performing models communicate more actively with counterpart firms, correlating with better outcomes.

Most agentic AI benchmarks treat the environment as passive: the model responds to a state, produces an output, and the eval moves on. A new paper on arXiv takes a different approach, placing LLMs in charge of a real simulated business, surrounding them with other autonomous agents, and running the experiment for 90 consecutive days. The finding that stands out is not about who wins, but about a specific failure mode the authors name "idle-drift."

CoffeeBench simulates a coffee supply chain economy with six heterogeneous firms: two farmers, two roasters, and two retailers. Each operates autonomously over the 90-day simulation, managing cash, inventory, and pricing while communicating and transacting with one another. The model under evaluation controls one of the coffee roasters; the remaining five firms are handled by fixed reference agents. The objective is to maximize cumulative net income.

Across the several open-weight and proprietary LLMs tested, all outperformed a passive baseline that takes no actions, and most achieved positive net income. The behavioral differences between models, though, are the more instructive part. Higher-performing models communicated more actively with counterpart firms; lower-performing ones fell into economically passive patterns.

Claude Haiku 4.5 presented the starkest case. The authors describe it repeatedly choosing inaction despite producing coherent assessments and plans. That combination is precisely what makes idle-drift a difficult failure to catch: the reasoning looks fine, the plans are internally consistent, but nothing gets executed. For teams deploying LLMs in autonomous roles, a model that appears to understand the situation and then does nothing is harder to debug than one that simply produces wrong outputs.

The honest caveat is that the paper does not identify the mechanism behind idle-drift or whether it generalizes beyond this simulation context. What the reporting does not give you is a detailed ranking of how all the other tested models performed relative to each other, or whether the failure is architectural versus a prompt sensitivity. The researchers have released their code and agent trajectories, giving teams a concrete artifact to run before committing a model to any long-horizon autonomous role.

Shared on Bluesky by 1 AI expert

hardmaru @hardmaru.bsky.social amplified

@tksiia.bsky.social

Excited to share CoffeeBench!!☕️☕️☕️ We evaluate LLM agents in a 90-day B2B coffee supply-chain economy spanning farmers, roasters, and retailers, where autonomous firms negotiate, manage inventory, set prices, handle i…
View on Bluesky →

Originally reported by paper

Read the original article →

Original headline: New 90-Day Supply Chain Sim Catches Claude Haiku 4.5 in 'Idle-Drift' Failure Mode