arxiv.org via Reddit

VPO Diversity Training Improves Math and Code Search

synthetic data open source rl-training test-time-compute policy-diversity

Key insights

  • VPO replaces single scalar rewards with per-objective reward vectors, directly training models to generate diverse solution candidates at inference time.
  • VPO-trained models outperform scalarized RLVR baselines on Best-of-N search, with the advantage growing on harder math and code problems.
  • VPO requires no additional training compute, making diversity-aware reinforcement learning a drop-in upgrade for existing RLVR pipelines.

Why this matters

Teams scaling inference-time compute have treated output diversity as a byproduct of temperature sampling rather than a training target, leaving potential Best-of-N yield on the table. VPO establishes that diversity can be directly optimized during training, and that doing so compounds on harder problems where scalarized baselines plateau. For practitioners building math and code reasoning systems, this raises the effective ceiling of inference-time scaling without requiring any additional training budget.

Summary

VPO (arXiv 2605.22817) beats standard RLVR training on math reasoning and code generation benchmarks when models use Best-of-N test-time search to select the strongest candidate from multiple samples. The method replaces a single scalar reward with a vector of per-objective rewards during training, explicitly pushing models toward diverse solution strategies rather than collapsing onto a single high-reward mode. The performance gap over scalarized baselines widens on harder problems, where a varied candidate pool is most valuable for selection. Essentially: VPO researchers show that diversity is a trainable property that makes inference-time search more effective at no added training cost. - VPO outperforms scalarized RLVR baselines on Best-of-N across math reasoning and code generation benchmarks. - The performance gap with baselines grows on harder problems, where diverse candidate sampling matters most. - No additional training compute is required, making this a low-friction upgrade to existing pipelines. For teams already investing in inference-time scaling, this repositions output diversity from a sampling artifact to a direct training objective.

Potential risks and opportunities

Risks

  • Labs with large scalarized RLVR investments (Meta, Mistral, Qwen teams) face retraining costs if VPO becomes the competitive baseline for math and code benchmark leaderboards over the next 6-12 months
  • If VPO's diversity gains are specific to Best-of-N evaluation setups and do not transfer to production tasks, teams optimizing on these benchmarks risk overfitting to test distributions without real-world gains
  • Incorrect objective decomposition for vector rewards could produce miscalibrated diversity, degrading rather than improving model outputs for teams without the domain expertise to specify reward components precisely

Opportunities

  • Open-source fine-tuning frameworks (OpenRLHF, Axolotl, LlamaFactory) that add native multi-objective vector reward support first will attract labs actively benchmarking math and code reasoning models
  • Managed training infrastructure providers (Modal, Together AI, Lightning AI) can offer VPO-compatible RLVR pipelines as a differentiated service for teams without the infra to implement vector reward training from scratch
  • Evaluation platform providers (Scale AI, LM-Sys) can build Best-of-N diversity metrics as a distinct assessment category, capitalizing on renewed practitioner interest in how training choices interact with inference-time search strategies

What we don't know yet

  • Whether VPO's vector reward decomposition must be manually specified per domain or can be learned automatically, and how much expert effort objective design requires in practice
  • Whether VPO generalizes beyond structured tasks like math and code to open-ended generation domains that lack clear per-objective decompositions
  • Whether combining VPO with orthogonal inference-time scaling methods such as process reward models or tree search compounds the diversity benefit or saturates it at similar sample counts