ML this week: open weights closed the gap from underneath April 25th 2026

Curated by Alexis

For two years, the architecture argument was framed as a binary: dense transformers versus mixture-of-experts. This week put that question to bed in three different ways at once, and replaced it with a more interesting one. The new fault line is not how you compose the model — it is which silicon, which workload, and which hour of its lifecycle you are optimising for. The labs that shipped this week are no longer building one model for one chip. They are building model families for routing fabrics, and the hardware vendors are responding in kind.

The Big Story

Google splits the TPU into two chips at Cloud Next: TPU 8t for training, TPU 8i for inference · 2026-04-22 · [blog.google]
→ This is the first time a major accelerator vendor has formally bifurcated its roadmap, and the numbers explain why: TPU 8i ships with 288GB HBM and a Collectives Acceleration Engine designed specifically for chain-of-thought serving on large MoE models, claiming roughly 80% better perf-per-dollar than Ironwood for inference. Training and serving have had different cost curves for a while; Google is the first to admit they need different transistors. Expect Nvidia's response — and AWS's Trainium roadmap — to follow the same logic within a generation.

Also This Week

DeepSeek V4-Pro: 1.6T-param MoE with 1M context, Apache 2.0 · 2026-04-24 · [Hugging Face]
→ The open-weight ceiling moves again — V4-Pro reportedly approaches Claude Opus 4.6 non-thinking on agentic coding, and the smaller V4-Flash (284B) is the more interesting release for anyone actually self-hosting.

Alibaba's Qwen3.6-27B beats its own 397B MoE on SWE-bench Verified · 2026-04-22 · [qwen.ai]
→ A dense 27B beating a sparse 397B (77.2 vs 76.2) on coding is the loudest signal yet that data quality and post-training now dominate parameter count for narrow capability — and it runs at ~18GB quantised on a single consumer GPU.

Google releases Gemma 4 in four sizes under Apache 2.0 · 2026-04-22 · [blog.google]
→ Note the lineup: E2B, E4B, 26B MoE, 31B Dense. Google is now shipping both architectures in the same family and letting deployers pick — the strategic admission embedded in TPU 8t/8i, expressed at the model layer.

Tencent open-sources Hy3 Preview (295B MoE, 21B active) and replaces DeepSeek inside Yuanbao · 2026-04-23 · [Decrypt]
→ The headline is the SWE-bench jump from 53% to 74.4% in one generation, but the under-reported story is Tencent dropping DeepSeek as the default behind its consumer chatbot — Chinese labs are no longer comfortable depending on each other.

Meta signs multi-billion-dollar deal for millions of AWS Graviton5 CPU cores · 2026-04-24 · [TechCrunch]
→ Agentic workloads are CPU-heavy in ways the GPU-centric narrative misses: tool calls, retrieval, orchestration, and policy gates all pile onto general-purpose silicon. Meta committing to Graviton at this scale is the clearest sign yet that the next infrastructure bottleneck is not FLOPs.

From the Lab

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning · [arxiv.org/abs/2604.19048]
→ Submitted April 21, SAMoRA routes between LoRA adapters using semantic features of the input rather than learned gating tokens, and reports gains on multi-task benchmarks at a fraction of full-MoE training cost. The interesting bit for practitioners: it suggests the cheap way to get MoE-style benefits on a dense backbone is to expert-ise the adapters, not the FFN — a recipe friendly to anyone fine-tuning Qwen3.6-27B or Gemma 4 31B Dense this weekend.

State of Open Source on Hugging Face: Spring 2026 · [huggingface.co/blog]
→ Hugging Face's spring report puts a number on what the model releases this week confirm: mean downloaded open-model size went from 827M params in 2023 to 20.8B in 2025, almost entirely on the back of quantisation and MoE. The corollary is that "open-source AI" now means "I can run the weights at home" for a vanishingly small slice of users — most consumption is happening on rented inference, which is exactly the workload Google's TPU 8i is built for.

Worth Reading

[Moonshot AI releases Kimi K2.6 with 300-agent swarm coding] — 13-hour continuous coding sessions and 4,000-step agent coordination is the most aggressive long-horizon claim from any open-weight lab this year; worth reading for the orchestration design even if the benchmarks don't fully replicate.
[Nvidia and Google Cloud expand AI supercomputing partnership with Vera Rubin NVL72] — The companion piece to TPU 8i: Google is buying inference performance from itself and from Nvidia simultaneously, hedging on which architecture wins the agentic-serving cost war.
[SK Hynix posts record $35.6B quarter on 198% YoY HBM demand] — The CFO says HBM demand exceeds supply for at least three more years; every model release this week implicitly assumes that statement is true.

The labs spent two years arguing about architecture. The next two will be spent arguing about which architecture goes on which chip — and whoever owns the routing layer between them owns the margin.

— Alexis