Evolution Fine-Tuning Teaches 2B-9B LLMs to Reuse Search Skill
TL;DR
- The Finch Collection packs 156K search trajectories across 10 domains and 371 optimization tasks, used to fine-tune open LLMs from 2B to 9B parameters.
- Across 22 held-out tasks, EFT-tuned models beat their base counterparts by 10.22% on average, showing cross-task generalization from search experience.
- Paired with test-time RL, the fine-tuned model matches state-of-the-art on two circle-packing tasks and beats its base on the Erdős minimum-overlap problem.
A quiet, technical result worth flagging from arxiv this week: a paper called Evolution Fine-Tuning argues that a small open-source language model can be taught the skill of iteratively evolving a solution, and then carry that skill across genuinely different optimization problems rather than learning each one from scratch.
The setup, as the authors describe it, is that the last couple of years of LLM-plus-search wins on hard problems have mostly kept the evolutionary logic in the scaffold around the model, not in the model itself. Those problems include open mathematical conjectures, GPU kernel design, scientific law discovery, and combinatorial puzzles. Every new task starts over, and whatever the model figured out mid-search is discarded once it finishes its attempt. The authors' claim is that the interesting capability, knowing which part to mutate and how, deciding when to backtrack, could live in the weights instead.
Their proposal is EFT, described as a mid-training paradigm that converts evolutionary search trajectories into supervision. They built a dataset they call the Finch Collection, 156K trajectories across 10 domains and 371 optimization tasks, and fine-tuned open-source LLMs from 2B to 9B parameters on it. On 22 held-out tasks, the fine-tuned models beat their base counterparts by 10.22% on average. Paired with test-time RL, the fine-tuned model reportedly matches state-of-the-art performance on two circle-packing tasks and outperforms its base counterpart on the Erdős minimum-overlap problem.
The honest caveats are pretty clear from the paper itself. A 10.22% average across held-out tasks is a headline number that can hide wide variance, the strongest reported result needed test-time RL on top of the fine-tune rather than the fine-tune alone, and everything the model learned is downstream of whichever search scaffolds generated the 156K trajectories in the first place. What the abstract does not tell you is how EFT scales past 9B, how expensive the practice phase actually is to reproduce, or which of the 10 training domains carried most of the transfer.
The forward-looking part is who this helps. If the pattern holds, groups running small open models, the 2B to 9B tier that fits on ordinary hardware, get a plausible route into discovery-style optimization work that has, until now, mostly belonged to labs stacking scaffolds on top of frontier hosted systems.
Originally reported by paper
Read the original article →Original headline: Evolution Fine-Tuning: 156K-Trajectory Finch Collection Teaches 2B–9B LLMs to Generalize Across 371 Unseen Optimization Tasks