Google Research unveils TabFM, a zero-shot model for tables
TL;DR
- TabFM predicts on previously unseen tables in a single forward pass, framing tabular classification and regression as in-context learning.
- The model was trained entirely on hundreds of millions of synthetic datasets generated by structural causal models, avoiding open-source data scarcity.
- Google plans to expose TabFM via a BigQuery AI.PREDICT SQL command in the coming weeks; weights are already on HuggingFace and GitHub.
The interesting move in tabular ML this year hasn't been a new gradient boosted tree, it's the idea that you shouldn't have to train a model at all. Google Research just put out TabFM, a foundation model that takes a table it has never seen and returns predictions for classification and regression in a single forward pass. No feature engineering, no hyperparameter sweep, no training run.
The trick is what it was trained on. Research scientists Weihao Kong and Abhimanyu Das write that TabFM was trained entirely on hundreds of millions of synthetic datasets, generated dynamically using structural causal models. Their justification for going synthetic is blunt: high-quality, diverse tabular datasets are 'critically scarce in the open-source space,' so rather than scraping CSVs from the web they built a generator that could produce effectively unlimited variations of structured problems. The architecture itself does alternating row and column attention over the raw table, compresses each row into a single dense vector, and lets a dedicated Transformer do in-context learning over those compressed vectors.
Why this matters if you actually ship tabular models: Google plans to expose TabFM through BigQuery so that in the coming weeks you should be able to call it with a single AI.PREDICT SQL statement. That is a very different developer experience from spinning up an AutoML pipeline. The weights are also open on HuggingFace and GitHub, so the model isn't locked behind the SQL front door either.
The honest caveats are worth reading. The post evaluates TabFM on TabArena across 38 classification and 13 regression datasets ranging from 700 to 150,000 samples, and reports competitive Elo, with a stronger 32-way ensemble variant that uses a non-negative least squares solver, SVD features, and Platt scaling. What the post does not give you is per-dataset head-to-head numbers against XGBoost, CatBoost, or earlier tabular foundation models, and it does not test on the messy enterprise tables, drift, rare categorical values, leakage, that decide whether tabular ML actually works in production. Take 'competitive on TabArena' as a plausible headline, not a settled result.
If it holds up on real data, the shape of the work changes. Less AutoML sweeping, more SQL, with analysts prototyping churn or fraud models the same way they run aggregations today.
Shared on Bluesky by 1 AI expert
Originally reported by research.google
Read the original article →Original headline: Google Research Releases TabFM — Zero-Shot Foundation Model for Tabular Data Trained on Hundreds of Millions of Synthetic Datasets, Coming to BigQuery