MarkTechPost web signal

VibeThinker-3B Scores 94.3 on AIME26, Matching 671B DeepSeek V3.2

small language models open-weights AI reasoning benchmarks efficiency breakthrough

TL;DR

  • Sina Weibo's 3B VibeThinker model scores 94.3 on AIME26, reportedly matching DeepSeek V3.2 at 671 billion parameters.
  • The four-stage Spectrum-to-Signal pipeline includes RL and self-distillation; CLR test-time scaling pushes AIME26 to 97.1.
  • Weights fit in roughly 6GB BF16, run on a single consumer GPU, and are MIT-licensed on Hugging Face.

Small models hitting frontier math scores have become a recurring 2026 story, and each claim deserves its own check. According to MarkTechPost, Sina Weibo's research team released VibeThinker-3B, a 3-billion-parameter model built on Qwen2.5-Coder-3B that reportedly scores 94.3 on AIME26, matching what DeepSeek V3.2 achieves at 671 billion parameters. The model also reports 89.3 on HMMT25, 76.4 on IMO-AnswerBench, 80.2 on LiveCodeBench v6, and a 96.1% first-attempt acceptance rate on LeetCode contest problems from April and May 2026. Weights are publicly available on Hugging Face under WeiboAI/VibeThinker-3B with an MIT license.

The training approach, which the team calls the Spectrum-to-Signal Post-Training Pipeline, runs through four stages: curriculum-based supervised fine-tuning, multi-domain reinforcement learning, offline self-distillation, and instruction RL. The standout technique is a test-time scaling method called Claim-Level Reliability Assessment, or CLR, which generates multiple solution trajectories and validates individual claims across them without requiring additional parameters. Applied at inference time, CLR reportedly lifts AIME26 to 97.1 and BruMO25 to 99.2.

The practical accessibility case is concrete. Weights come in at roughly 6GB in BF16, runnable on a single consumer GPU, and the MIT license removes the main deployment barriers for independent researchers and small teams. For anyone currently paying per-token for math or coding reasoning tasks, a model at this price point and at this benchmark level is worth testing.

The honest caveat is that matching DeepSeek V3.2 on AIME26 is a narrow claim. The article notes VibeThinker-3B trails larger models on knowledge-heavy benchmarks like GPQA-Diamond, and it is explicitly positioned as a specialist in verifiable reasoning rather than general knowledge. What the reporting does not give you is any treatment of benchmark contamination: if AIME 2026 and HMMT25 competition problems were publicly available before the model's training cutoff, the headline scores warrant scrutiny. The CLR inference cost is also unquantified, and generating multiple trajectories per answer could close the cost advantage against simply calling a larger model.

Who benefits most clearly is the open-source math and coding research community. A strong 3B reasoning base with a permissive license is useful as a verifier, a reward model component, or a low-cost tutoring layer. Whether the Spectrum-to-Signal pipeline is genuinely novel or a careful assembly of existing techniques is the question the field will work out over the next few months.