CDR-Bench finds LLMs collapse on order-sensitive data recipes
TL;DR
- CDR-Bench evaluates LLMs on 3,462 data refinement tasks spanning four real-world domains and 29 distinct operators, with deterministic reference outputs enabling exact scoring.
- Across more than ten state-of-the-art models, compositional performance degrades sharply and order-sensitive recipe success collapses.
- The authors conclude current LLMs lack the procedural faithfulness required for reliable compositional data refinement.
Data refinement is one of the less-glamorous LLM jobs that shows up in almost every real pipeline: take a messy text state, run it through a sequence of cleaning, filtering, and transformation steps, and get something usable at the end. A new benchmark posted on arXiv, called CDR-Bench, sets out to measure how faithfully today's models actually execute those multi-step recipes when both the composition and the order of the steps matter.
The setup is straightforward. The authors assembled 3,462 tasks spanning four real-world data refinement domains, built around 29 distinct operators, and evaluated more than ten state-of-the-art LLMs in three settings: atomic (a single operator), order-agnostic (composition where order does not matter), and order-sensitive (composition where order does). Because the reference outputs are deterministic, correctness can be checked exactly rather than judged.
The headline finding is a fall-off pattern the authors describe as performance degrading sharply in compositional settings, with order-sensitive recipe success collapsing outright. Their conclusion is that current LLMs lack the procedural faithfulness required for reliable compositional data refinement. If you have ever handed a model a five-step cleaning recipe and quietly noticed that steps three and four sometimes trade places, that is the phenomenon being measured here.
The honest caveat is that a benchmark is a proxy, and the abstract as retrieved does not spell out per-model scores, which specific operators drive the collapse, or how planning-style prompting or tool use would move the numbers. It is a diagnosis more than a fix. What is useful about it is that it gives a common yardstick for a failure mode that data engineers already suspect, and points at where the next round of work needs to land: not making models fluent about recipes, but making them faithful to the exact recipe you asked for.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes