huggingface.co web signal

BeyondArena finds trees still beat tabular FMs off-IID data

TL;DR

  • BeyondArena spans 142 curated datasets and 11 models across IID, temporal, and grouped task types, with feature types including text and high cardinality.
  • Tabular foundation models led only on tiny to medium sized IID data, while tree based and deep learning models still dominated non-IID, large, and high-dimensional datasets.
  • The authors also released Data Foundry, a Python framework and metadata schema for curating tabular datasets for predictive machine learning.

A new benchmark called BeyondArena, presented in a paper titled "Beyond IID: How General Are Tabular Foundation Models, Really?" and posted on Hugging Face, pushes back on the narrative that foundation models have come for tabular machine learning the way they came for text and images. Across 142 curated datasets and 11 models, the headline finding is that tabular foundation models excel on tiny to medium sized IID data, but traditional tree based and deep learning models still dominate on non-IID, large, and high-dimensional datasets.

The reason that lands is that the tabular foundation model has been the genuinely interesting thread in this corner of ML for the last year, with the implicit promise that one pretrained model could displace the gradient boosted tree as the default for spreadsheet shaped data. The authors, whose list is headed by Lennart Purucker, argue the field has been measuring progress on the wrong slice. The arXiv abstract puts it bluntly: standard benchmarks are "mostly defined for tasks where tabular foundation models already excel," and the most challenging scenarios are excluded. BeyondArena instead spans IID, temporal, and grouped task types, and includes feature types like free text and high cardinality categoricals drawn from a broad range of disciplines.

Alongside the benchmark, the team released Data Foundry, a Python framework and metadata schema for curating tabular datasets for predictive machine learning, so that future evaluations can run on the same protocol rather than each lab inventing its own.

The honest caveat is that this is a single benchmark paper and the abstract leaves the most useful practical questions unanswered. We do not see which specific tabular foundation models were among the eleven evaluated, where exactly the crossover sits between tiny and large, or how big the gap is on temporal versus grouped non-IID splits. Take the dominance claim as reported, not settled, until independent groups run the same protocol.

If the result holds up, the immediate beneficiaries are practitioners already shipping gradient boosted trees in production who were quietly waiting to see whether to retrain on a foundation model. The longer term beneficiary is the model research community itself, because a benchmark that puts the hard cases back in is what pushes the next generation of tabular foundation models past the regimes they currently win.