interconnects.ai web signal

DeepSeek V4 Flash tops May open-model rankings

deepseek open source google bytedance alibaba ai-models open-source

Key insights

  • DeepSeek-V4-Flash ranked best-in-class for local agentic coding tasks across the full May open-model cohort benchmarked by Lambert.
  • Poolside's Laguna XS.2, a 33B MoE model under Apache 2.0, is the strongest open-weight release at its size class in this cohort.
  • The open-to-closed capability gap stands at 3-7 months and has been narrowing since DeepSeek R1 launched, per CASI data.

Why this matters

Open-weight models capable of agentic coding at 33B parameters, released under permissive licenses, give engineering teams a viable path to production AI systems without API dependency or data-sharing exposure. The narrowing 3-7 month capability gap compresses the pricing window for frontier labs, which restructures competitive dynamics across every AI product category built on closed-model exclusivity. Poolside's methodological work on reward hacking in coding evaluations exposes a measurement gap that currently inflates benchmark scores industry-wide, meaning model selection decisions made today may be based on systematically overestimated performance.

Summary

Five open-weight models released in May now have a side-by-side benchmark comparison from Nathan Lambert at Interconnects. DeepSeek-V4-Flash leads for local agentic coding tasks, while Poolside's Laguna XS.2, a 33B mixture-of-experts model released under Apache 2.0, ranks as the strongest open-weight release at its size class. The open-to-closed capability gap sits at roughly 3-7 months, per Center for AI Standards and Innovation data. That gap has been narrowing since DeepSeek R1, meaning frontier labs have less time to exploit exclusive capability advantages before open alternatives catch up. Essentially: (DeepSeek, Poolside) are compressing the distance between open-weight and proprietary model performance. - Laguna XS.2's Apache 2.0 license makes it commercially deployable without restriction at enterprise scale. - Poolside separately flagged reward hacking in coding evaluations as a methodological problem, a contribution that stands independent of the model release itself. - DeepSeek-V4-Flash's edge in agentic coding positions it as the practical default for developers running models locally. If CASI's 3-7 month gap figure holds and the narrowing trend continues, open models could match current closed-model performance before the end of 2026.

Potential risks and opportunities

Risks

  • If reward hacking inflates coding benchmark scores as Poolside suggests, developers selecting models on those metrics risk deploying underperforming systems into production agentic pipelines before benchmark standards are corrected.
  • Poolside's Apache 2.0 license on Laguna XS.2 enables commercial competitors to fine-tune and redistribute without restriction, potentially commoditizing Poolside's differentiation faster than the company can recoup its research investment.
  • DeepSeek's sustained open-weight leadership increases geopolitical scrutiny risk; enterprise procurement teams at US defense contractors and regulated financial institutions may face compliance barriers to adopting DeepSeek-origin models regardless of benchmark performance.

Opportunities

  • Developers and startups building local-first agentic coding tools can now ship production systems using DeepSeek-V4-Flash without cloud API costs or data exposure, a market segment previously gated by capability gaps.
  • Poolside's methodology work on reward hacking creates a tooling and services opportunity for AI evaluation firms (Scale AI, Braintrust, Weights and Biases) to build benchmark-auditing products before the problem becomes a procurement liability.
  • Apache 2.0 releases like Laguna XS.2 give cloud providers (AWS, Azure, GCP) licensable model inventory to compete on hosted inference pricing without licensing negotiation overhead, accelerating open-model hosting as a product line.

What we don't know yet

  • Whether DeepSeek-V4-Flash's agentic coding lead holds on next-generation SWE-bench variants designed to be harder to game, with no updated benchmark results cited in the piece.
  • Poolside's reward hacking analysis is described as unpublished with no timeline given for peer review or public release, leaving its findings unverifiable by third parties.
  • CASI's 3-7 month gap figure is not decomposed by task type, leaving unclear whether it applies equally to reasoning, coding, and multimodal benchmarks or is driven by one category.