huggingface.co web signal

GEAR trains VQ tokenizer and AR image model jointly end-to-end

TL;DR

  • GEAR trains the VQ tokenizer and autoregressive generator jointly end-to-end using a soft differentiable branch to guide the tokenizer without next-token loss touching it.
  • On class-conditional ImageNet 256 at 300 epochs, gFID improves from 8.20 to 6.76 at 775M parameters versus the LlamaGen-REPA baseline.
  • The authors report up to 10x faster ImageNet gFID convergence and gains across VQVAE, LFQ and IBQ tokenizers, plus a Qwen3-1.7B text-to-image variant.

The interesting move in this paper is not another autoregressive image model chasing a benchmark. It is a training-time fix for the quietly awkward standard recipe in VQ-plus-AR image generation, where the tokenizer is trained first, frozen, and then the autoregressive generator is trained on top of tokens the tokenizer chose without ever knowing what the AR would find easy to predict.

The GEAR paper, from Peking University and Tencent Hunyuan, proposes training the vector-quantized tokenizer and the autoregressive generator jointly, end-to-end, guided by representation alignment. The mechanism is a dual read-out of the codebook assignment: a hard, one-hot branch trains the AR with next-token prediction, and a differentiable soft branch carries a representation-alignment loss back to the tokenizer. Straight-through estimators, which would be the naive way to make the whole pipeline differentiable, collapse to a gFID around 105 in this setting. The authors' reason is that next-token prediction rewards low-entropy token sequences while reconstruction wants a fully-used codebook, so letting the NTP gradient touch the tokenizer converges it to a few dominant codes.

The reported numbers are the headline. On class-conditional ImageNet 256 at 300 epochs, GEAR moves gFID from 20.16 to 16.96 at 111M parameters, from 12.70 to 8.66 at 343M, and from 8.20 to 6.76 at 775M, all versus the LlamaGen-REPA baseline. The paper also claims up to 10x faster ImageNet gFID convergence relative to that baseline. Optimal classifier-free guidance sits at scale 1.5, giving a gFID of 3.388. The trick generalizes to LFQ and IBQ tokenizers, not just VQVAE, and the authors show a text-to-image variant using a Qwen3-1.7B text encoder.

The honest caveat is that these are gains against one specific baseline family. The paper itself concedes discrete VQ tokenizers hit a reconstruction ceiling around rFID 1.64 that continuous VAE tokenizers pass at 0.28, and that 16x downsampling fixes sequence length in a way diffusion can decouple from compute. What the reporting doesn't give you is a head-to-head against non-LlamaGen AR systems, or a wall-clock cost for running the extra soft branch during training.

What is worth watching is the direction. If jointly-trained tokenizer plus AR keeps closing the gap without diffusion's multi-step sampling, teams deploying image generation get a cheaper inference story to plan around, and the frozen-tokenizer convention that has quietly sat under most VQ-based image and video models starts to look optional.