USTC, Tencent Hunyuan unveil 2M-pair Goku video edit dataset
TL;DR
- Goku contains 2 million instruction-aligned video-editing pairs across 10 task classes including camera movement, subject movement, reference edits and 2-to-5 task compositions.
- Goku-Edit fine-tunes a Wan2.2-5B backbone into a dual-branch design with a Qwen3-VL 8B text encoder, RoPE-aligned cross-attention and SpatialCFG.
- On the 1,000-case Goku-Bench the authors report up to 8% instruction-following gains over open-source baselines but still trail Runway and Luma Ray3 on perceptual metrics.
Instruction-based video editing has been stuck on appearance edits, swap this object or change that colour, while structural moves like relocating a subject or panning the camera have lived almost entirely inside closed-source commercial models. A joint team from the University of Science and Technology of China and Tencent Hunyuan is trying to change that with Goku, a 2 million pair dataset that spans 10 task classes including camera movement, subject movement, reference edits, and multi-task combinations of two to five operations at once.
The dataset is built by decomposing complex edits into sub-problems and delegating each to a specialist model. Object removal runs through Minimax-Remover, style transfer piggybacks on Flux plus depth-guided VACE, subject motion is generated via Wan2.2, and camera motion uses RecamMaster covering over 20 camera motion patterns. Gemini 2.5 Pro sits over the whole pipeline as an instruction generator and quality judge, and the authors report their progressive filter throws out approximately 88% of synthesised samples before anything makes it to the training set.
The companion model, Goku-Edit, fine-tunes a pre-trained Wan2.2-5B backbone into a dual-branch design where a main video branch handles appearance and an auxiliary mask branch predicts the edit region. A frozen Qwen3-VL 8B MLLM handles text conditioning, and a new RoPE-aligned spatial cross-attention lets the two branches share coordinates despite operating at different resolutions. On the accompanying Goku-Bench of 1,000 human-verified cases and 7 editing-specific metrics, the authors report up to 8% instruction-following gains over open-source baselines like LucyEdit and Omni-Video, plus higher physical-rule and spatial-relationship scores than Runway and Luma Ray3, though the commercial models still lead on perceptual metrics such as CLIP, MS and AES.
Take the specifics as reported, not as settled. Benchmark wins on a benchmark you built yourself are worth scrutinising, and the paper does not disclose licensing terms for the released data, full training compute, or how the 30-participant user study was recruited. What it does open up, if Goku ships under permissive terms, is a much bigger playground for open-source teams to start closing the structural-editing gap with the closed commercial models that have quietly owned this space.
Originally reported by huggingface.co
Read the original article →Original headline: Goku: USTC + Tencent Hunyuan Release 2M-Pair Universal Video-Editing Dataset and Goku-Edit Model, Bench Beats LucyEdit and Rivals Runway Gen-4 on Complex Multi-Task Edits