Tencent ViQ Matches Continuous Encoders with Discrete Vision Tokens
TL;DR
- ViQ achieves aggregate multimodal scores of 57.2 and 63.9 on nine benchmarks, matching previous state-of-the-art continuous encoders.
- Discrete visual codes from ViQ accelerate multimodal training by 20% to 70% depending on model size and sequence length.
- ViQ stores images at 1/96 of raw file size while preserving enough detail for competitive reconstruction and understanding.
Most multimodal models today hold a quiet contradiction at their core: they process text as discrete tokens but vision as continuous floating-point vectors, then try to reconcile the two inside a language model. A new paper from Tencent HY Vision Team, Tsinghua University, Nanyang Technological University, and the Chinese Academy of Sciences proposes ViQ (Visual Quantized Representations) as a direct fix: turn images into the same kind of discrete codes that text already uses, without sacrificing the semantic fidelity or low-level detail that prior quantized encoders have struggled to preserve.
The approach works in two stages. First, a text-aligned pre-training phase gives the visual encoder semantic supervision from a pretrained language model and adapts it to process images at native resolution rather than a fixed crop size. Second, a feature discretization phase progressively compresses the continuous feature space using what the authors call proximal representation learning, applying an L∞ normalization to constrain the latent space before handing it off to a Finite Scalar Quantization step with a codebook of 64,000 entries. A 2D rotary position embedding encodes spatial layout so the quantization generalizes to arbitrary image dimensions.
The benchmark numbers are competitive. On an aggregated score across nine multimodal tasks spanning visual question answering, world knowledge, OCR, and chart recognition, ViQ with a Qwen2.5-1.5B backbone scores 57.2 versus the previous best of 57.0 among encoders under 6B parameters; with Qwen2.5-7B it reaches 63.9 versus a prior 63.8. Earlier quantized encoders like QLIP and UniTok score 29.7 and 33.0 respectively with the same 1.5B backbone, so the gap ViQ closes over its discrete predecessors is substantial.
The more immediately practical result is training speed. Because ViQ codes can be extracted offline before training begins, the vision encoder does not run during fine-tuning at all. That yields forward-time speedups ranging from 46% to 78% across Qwen2.5 model sizes from 0.5B to 7B, and full-iteration speedups exceeding 20% to 40% depending on sequence length. The codes also compress images to 1/96 of their raw file size, which is aggressive enough that the paper notes a comparable JPEG quality setting would visibly degrade images.
The honest caveat is that discrete tokenization still costs something on the most detail-intensive tasks. On OCRBench, ViQ trails some continuous encoders with fewer parameters, and the authors attribute this to high-frequency detail loss that is a systemic property of discrete codes rather than a specific flaw in their design. The paper also tests only with LLMs up to 7B and explicitly flags integration with much larger models as an open question. Take the aggregate scores as a proof of concept for the approach, not as a settled claim about production readiness across the full range of document understanding use cases.
Originally reported by huggingface.co
Read the original article →Original headline: ViQ: Tencent HY Vision Team and Tsinghua Introduce Text-Aligned Visual Quantized Representations at Native Resolution for Unified Discrete Multimodal Modeling