Open models got scary good, inference hardware hit a new arms race, and the ecosystem quietly doubled.
This was the week open-source ML stopped being "good enough" and started being genuinely frontier-competitive. Google dropped Gemma 4 under Apache 2.0, MLCommons published the most ambitious MLPerf Inference benchmark round ever, and Hugging Face's spring census confirmed what practitioners already felt: the open ecosystem has doubled in a year, and the center of gravity is shifting east. If you deploy models for a living, every story below affects your stack.
Watch & Listen First
Key Takeaways
The Big Story
Google Releases Gemma 4: Four Open Models Under Apache 2.0 · April 2, 2026 · Google BlogGoogle DeepMind shipped four variants -- E2B (2.3B effective), E4B (4.5B), 26B MoE (4B active), and 31B dense -- all under a fully permissive Apache 2.0 license, a first for the Gemma family. The 31B model supports 256K context, native vision and audio, fluency in 140+ languages, and scores 85.7% on GPQA Diamond and 80.0% on LiveCodeBench v6. Architecture-wise, the dense model keeps Gemma 3's hybrid sliding-window + GQA attention with added QK/V normalization and softcapping, while the MoE variants use separate expert blocks alongside standard MLP layers rather than the DeepSeek-style replacement pattern.
→ The Apache 2.0 pivot is the real story. Gemma 3's custom license was a friction point for startups building products on top of it. Combined with immediate support across vLLM, Ollama, and llama.cpp, Google is clearly betting that permissive licensing plus competitive benchmarks will win the derivative-model war -- where Alibaba's Qwen currently dominates with 113K+ forks on Hugging Face.
Also This Week
MLPerf Inference v6.0: NVIDIA Hits 2.49M Tokens/Sec on DeepSeek-R1 · April 1 · MLCommons The largest system ever submitted -- 288 NVIDIA GPUs across 72 nodes -- achieved 2.49M tokens/sec on DeepSeek-R1 offline. Software optimizations alone delivered 2.7x throughput gains on the same hardware vs. six months ago, cutting per-token cost by 60%. Twenty-four organizations submitted; multi-node systems jumped 30%.
AMD MI355X Crosses 1M Tokens/Sec · April 1 · AMD Blog In single-node head-to-heads (8 GPUs), MI355X hit 92-119% of NVIDIA B300 performance depending on model and scenario, with FP4 quantization driving a 4.4x offline improvement on Llama 2 70B over prior rounds. The interactive mode result -- 104% of B300 -- signals real competition for latency-sensitive workloads.
Hugging Face: State of Open Source, Spring 2026 · April 4 · Hugging Face Blog The platform hit 13M users and 2M+ models. Independent developers now account for 39% of downloads (up from 17%), while industry share fell from 70% to 37%. Robotics datasets exploded from 1,145 to 26,991 in one year, jumping from rank #44 to #1 dataset category.
PyTorch/XLA 2.7 Ships JAX Bridge and Ragged Paged Attention · April 2026 · PyTorch Blog The experimental JAX Bridge lets you call `jax.experimental.shard_alike` and other JAX functions directly inside PyTorch/XLA graphs. The new Pallas-based ragged paged attention kernel delivers up to 5x speedup over padded attention for variable-length sequences on Llama 3 8B, plus GPU CI is back with CUDA 12.6 support.
State of MLOps Newsletter Highlights Kubernetes GPU Scheduling · April 6 · Substack This week's roundup flagged KAI Scheduler (open-source Kubernetes GPU scheduling), Google's five strategies for efficient LLM inference, and a comparative benchmark of 10 embedding models for RAG from Zilliz.
From the Lab
"Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives" · CLeaR 2026 · arXiv cs.LG Accepted to the 5th Conference on Causal Learning and Reasoning. Uses diffusion model denoising objectives to smooth the combinatorial landscape of causal graph search, making structure learning more tractable on high-dimensional observational data.
"Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap" · AISTATS 2026 · arXiv stat.ML Proposes learned deconfounding scores that maintain valid causal inference even when treatment and control groups have poor covariate overlap -- a persistent headache in observational studies. Accepted at AISTATS 2026.
Worth Reading
The 2026 open-model landscape has a new shape: permissive licenses, multimodal by default, and a derivative ecosystem that matters more than any single benchmark score. If your inference stack hasn't changed in six months, it's already behind.