DataEvolver Turns Rejected Samples Into Text-Image Training Data
TL;DR
- Central South University and HKUST propose DataEvolver, a four-agent loop that converts rejected samples into feedback for later data construction rounds.
- At 0.75M scale on PixArt-α, DataEvolver improves OCR-F1 by 85.3% on TextScenesHQ and 35.3% on LongTextBench over the strongest baseline.
- Ablations show removing the Critic or Generator degrades OCR-F1, and the improvements transfer to Show-o2 as a second downstream generator.
Most dataset construction pipelines for text-rich image generation follow the same shape: crawl candidates, apply a filter, freeze the accepted set, throw the rest away. A new paper from Central South University and HKUST, DataEvolver, argues that the rejects are actually the interesting part.
The authors reframe data construction as a closed loop with four agents. A Retriever gathers candidates, a Verifier scores them and records why the failures failed, tagging things like OCR errors, semantic mismatches, blur and layout corruption. A Critic then distills those rejection statistics into natural-language semantic feedback, and a Generator synthesizes new samples to fill under-covered regions. That memory guides the next round's retrieval queries and generation prompts. The refinement, notably, operates entirely in natural-language policy space rather than through gradient updates to the generator itself.
The headline result is a big one. At the 0.75M-sample scale on PixArt-α, DataEvolver improves OCR-F1 over the strongest baseline by 85.3% on TextScenesHQ and 35.3% on LongTextBench, and the paper reports that the same gains transfer to Show-o2, suggesting the benefit is data-side rather than tied to a particular generator. Ablations show that removing either the Critic or the Generator consistently degrades OCR-F1, which is the paper's argument that both feedback-based policy revision and targeted synthesis are pulling weight, not just one of them.
The honest caveat: OCR-F1 rewards legible text rendering, which is exactly what a pipeline built around OCR feedback signals should improve, so the metric is aligned with the training signal by construction. What the paper does not give you is throughput or wall-clock cost for running the four-agent loop, a sense of how many rounds you need before returns flatten, or how the natural-language feedback holds up on non-Latin scripts or handwritten text where OCR itself is noisy.
If the approach generalizes, the interesting downstream effect is on teams sitting on large piles of filter-rejected web scrapes. Those piles have historically been dead weight; the claim here is that they are a training signal you already paid to collect.
Originally reported by huggingface.co
Read the original article →Original headline: HF Paper 'DataEvolver' (CSU + HKUST): Self-Evolving Multi-Agent Pipeline for Text-Rich Image Generation Data Recycles Rejected Samples, Boosts OCR-F1 by 85% on TextScenesHQ Over Static Crawl-Filter-Freeze Baselines