paper web signal

Biology AI Study: SFT Degrades OOD, RL Recovers Generalization

TL;DR

  • SFT consistently raises in-domain accuracy but causes out-of-distribution performance to peak early and then decline.
  • RL applied to a strong SFT checkpoint with aligned rewards partially recovers OOD generalization across genomics, RNA, and protein tasks.
  • Under fixed budgets, brief SFT followed by a larger RL allocation with asymmetric adapter capacity produced the best in-domain/OOD trade-off.

When building biology AI models, the field has converged on a familiar stack: continued pre-training to align the base model with biological language, then supervised fine-tuning to specialize it, then reinforcement learning to sharpen reasoning. The implicit assumption is that each stage accumulates capability uniformly. A new arXiv paper, "How Post-Training Shapes Biological Reasoning Models," from a team including Marinka Zitnik and Sham M. Kakade tests that assumption directly, training and evaluating more than 100 biological reasoning models under controlled variations across genomics, transcriptomics, and protein domains.

The sharpest signal in the study is about SFT. Supervised fine-tuning consistently raises in-domain accuracy, but out-of-distribution performance peaks early and then declines as models fit the training distribution more tightly. The paper describes this as SFT progressively concentrating models on the training distribution -- the longer you run it, the better the model performs on tasks it has already seen and the worse it performs on genuinely novel biology.

RL, applied to a strong SFT checkpoint with aligned rewards, partially reverses this. It shifts models toward a better in-domain/OOD trade-off, according to the paper, with the largest gains appearing within the first few RL epochs. Under fixed training budgets, the best configurations combined brief SFT with a larger RL allocation and asymmetric adapter capacity -- higher rank for SFT, lower for RL. The paper also finds that CPT improves downstream performance by aligning models with biological language before either stage begins, and that it boosts the effectiveness of both subsequent stages.

The honest caveat is that these findings are scoped to genomics, transcriptomics, and protein function tasks. The paper does not address whether the same SFT-then-RL dynamic holds in other scientific domains, how sensitive OOD recovery is to the choice of RL reward signal, or what the compute cost comparison looks like against conventional long-SFT pipelines.

For teams actively building biology foundation models, the practical risk is that standard in-domain benchmarks will not surface generalization loss until a model meets genuinely novel biology. The paper's core finding -- that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains -- suggests that OOD evaluation should be tracked explicitly at each stage rather than assumed to follow in-domain scores.

Shared on Bluesky by 1 AI expert