Self-play driving AI aligns with humans using just 30-minute demos
TL;DR
- A new method trains driving AI using only 30 minutes of human demonstrations, 2,500 times fewer than comparable imitation learning approaches.
- Human demonstrations serve as a regularization signal on top of a basic goal-reaching reward, not as the primary training objective.
- Resulting policies coordinate with held-out human trajectories and finish training in 15 hours on a single consumer-grade GPU.
For autonomous vehicle research, the persistent tension has been this: self-play reinforcement learning is cheap and scalable, but agents trained through it develop driving behaviors that work internally while remaining incompatible with real human drivers. Prior approaches tried to close that gap through reward engineering and domain randomization, which a new paper on arXiv describes as "brittle and labor-intensive."
The paper's core insight is that you do not have to choose between cheap simulation and human-compatible behavior. A small amount of human data, used not as the primary training objective but as "a regularization objective on top of a minimal safe goal-reaching reward," is enough to nudge self-play policies toward driving that meshes with human drivers. The authors put the quantity at just 30 minutes of human demonstrations -- 2,500 times fewer than comparable imitation learning approaches require.
The practical results are notable: training completes in 15 hours on a single consumer-grade GPU, and the resulting policies coordinate with held-out human trajectories. The paper's own framing captures it well -- "like the spice in a good stew, we find that a little human data goes a long way."
The honest caveat is that coordinating with held-out trajectories in a controlled evaluation is not the same as performing reliably across the full variance of real-world traffic. The abstract does not address whether the 30-minute demonstration set requires specific curation, or how performance changes as the deployment environment diverges from the simulation. Source code and videos are available at spiced-self-play.com, which should let the research community probe those limits directly.
Shared on Bluesky by 2 AI experts
-
New Paper: arxiv.org/abs/2606.19370 Self-play yields capabilities but requires frustrating cost-function tuning. Surprisingly, just 30 minutes of demonstration data produces much more human-like driving policies! Led …
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: Human-like autonomy emerges from self-play and a pinch of human data