huggingface.co web signal July 2nd 2026

UW-Madison's WARP recovers training-data mixtures from weights

safety open source ai-research

TL;DR

WARP recovers a fine-tuned model's domain mixture using only its base and final weights, with no access to intermediate training checkpoints.
In controlled tests it hit mean absolute error as low as 0.046 on BERT and 0.104 on GPT-2, cutting the strongest baseline's error by 35% and 30%.
Interpolated pseudo-checkpoints outperformed even an oracle variant that used the real training trajectory, thanks to what the authors call a smoothness gap.

A new paper from University of Wisconsin-Madison researchers, posted on Hugging Face's papers hub, takes a swing at something the open-weights ecosystem mostly treats as impossible. Given only a base model and its fine-tuned descendant, with no training logs and no intermediate checkpoints, can you reconstruct the domain proportions of the data used in between? The authors say yes, with usable accuracy.

Their framework, WARP (Weight-Space Analysis for Recovering Training Data Portfolios), leans on model merging in a way that has nothing to do with merging's usual purpose. Instead of combining capabilities, they interpolate between the base and fine-tuned checkpoints using operators like LERP and TIES to build pseudo-checkpoints that stand in for the training path that was never released. A geometric footprint is extracted at each pseudo-checkpoint by projecting per-domain gradients onto the direction pointing to the reference model, then mapped to a mixture estimate either by a parameter-free softmax readout or a supervised MLP projector trained on synthetic mixtures.

The reported numbers are worth pausing on. Across forty experimental trials, WARP recovered domain mixtures with mean absolute error as low as 0.046 on BERT and 0.104 on GPT-2, averaged across four text datasets, reducing the strongest baseline's error by 35% and 30% respectively. It beat sample-level membership inference, and, more interestingly, also beat an oracle variant that had access to the real intermediate training checkpoints. The authors attribute that inversion to what they call a smoothness gap: real fine-tuning paths are noisy from stochastic mini-batches and learning-rate schedules, while the interpolated pseudo-trajectories give a cleaner monotone footprint the projector can learn from.

The honest caveat is that these results come from a tightly controlled setting. Reference models were BERT and GPT-2-Small fine-tuned on datasets including AG News, mixtures were drawn from a known simplex, and the practitioner is assumed to have access to the same data source the fine-tune came from. The paper does not claim to have recovered a frontier lab's proprietary mixture from a released instruct model, and it says nothing about how the technique scales to modern billion-parameter systems, multimodal weights, or heavily post-trained checkpoints where the base is far away in weight space. What the reporting does not give you is a cost model for running WARP at that scale.

Even with those caveats, the direction matters. Releasing weights and withholding the data recipe has been the default open-weights posture. If a technique like this holds up beyond the small-model controlled setting, that separation starts to look less durable, and auditors, copyright plaintiffs, and competitor labs all inherit a new tool.

Originally reported by huggingface.co

Read the original article →

Original headline: WARP Paper: Recovering a Fine-Tuned Model's Training-Data Portfolio Directly From Its Weights