paper web signal

AgenticDataBench benchmarks data agents on real fintech tasks

TL;DR

  • AgenticDataBench spans 15 vertical domains, including five real-world B2B use cases from a fintech company.
  • The paper defines 'data science skills' as recurring operational patterns extracted from Stack Overflow solutions.
  • For domains lacking real tasks, the authors use a systematic LLM-based task generation approach to create workflows.

Data agents keep getting shipped, but there hasn't been much beyond aggregate pass-rate to say what any of them are actually good at. A new arXiv paper from a 13-author team, AgenticDataBench, tries to push the evaluation conversation past the leaderboard.

The setup is that the benchmark spans 15 vertical domains and includes five real-world B2B use cases from a fintech company. Rather than treating each task as a monolithic pass or fail, the authors decompose data science work into what they call 'skills', or recurring operational patterns extracted from Stack Overflow solutions. They then use skill-aligned hierarchical clustering to strip out redundancy, so the benchmark isn't just fifty variants of the same join.

For domains where real task-solution pairs already exist, they select pairs that maximise skill diversity. For domains that lack real workflows, they lean on a 'systematic LLM-based task generation approach' to create realistic tasks. The evaluation then reports skill-level performance for state-of-the-art data agents, which is the actual product here: a diagnostic that says a given agent handles one kind of operation well but falls apart on another, rather than just a single number.

The honest caveat is that the paper's abstract doesn't say which agents were tested, how they scored, or whether the benchmark and its labels are being released publicly. It also isn't clear how the LLM-generated tasks in less-covered domains are validated against real practitioner intent, which matters given the taxonomy is drawn from Stack Overflow rather than production pipelines.

If the artefacts do come out under a permissive licence, the immediate winners are teams building LLM data pipelines who want to know where their agent breaks before a customer trial, and vendors who can point to specific skill strengths instead of averages. The fintech grounding is the piece to watch: benchmarks with real B2B tasks are rare, and if this one holds up under external replication it becomes the reference for the next round of data-agent releases.