The transformer's grip loosens as state space models, edge architectures, and hardware-aware kernels all advance in the same week.
Three parallel shifts defined deep learning this week. Google released Gemma 4 with both dense and MoE architectures under Apache 2.0, giving practitioners real alternatives to proprietary frontier models. Princeton published Mamba-3, an inference-first state space model that challenges transformer dominance on sequence tasks. And MLPerf Inference v6.0 results dropped, revealing that hardware optimization — not just model architecture — is where the next performance gains live.
Watch & Listen First
- Simon Willison on agentic patterns and tool use — Practical deep dive into how models actually use tools in production. Relevant to anyone building inference pipelines.
- Weights & Biases: "State Space Models in 2026" — Walkthrough of Mamba-3 and why SSMs are gaining ground against attention-based architectures for long sequences.
Key Takeaways
- Gemma 4 makes open-weight multimodal competitive. Four model sizes (2B to 31B), 256K context, Apache 2.0 — the first open family that handles text, image, and audio with frontier-level quality per parameter.
- Mamba-3 proves SSMs aren't just a curiosity. 1.8-point accuracy gain over Gated DeltaNet at 1.5B scale, with half the state size of Mamba-2 — inference efficiency is now a design-time priority, not an afterthought.
- FlashAttention-4 hits 1,605 TFLOPS on Blackwell. 71% utilization, 3.6x faster forward passes than FA2 at 32K sequence length. The bottleneck has moved from tensor cores to SFU units and shared memory bandwidth.
- MLPerf Inference v6.0 adds five new benchmarks. Text-to-video, GPT-OSS 120B, DLRMv3, vision-language, and YOLOv11 — NVIDIA Blackwell Ultra leads throughput, AMD MI355X competitive at cluster scale.
- PyTorch 2.7 ships FlexAttention and Context Parallel. 3,262 commits from 457 contributors — prologue fusion, Intel GPU compile support on Windows, and a unified attention API replace scattered backends.
The Big Story
Google Releases Gemma 4: Four Open Models Under Apache 2.0 With Native Multimodal Support · April 2, 2026 · Google Blog
→ Gemma 4 ships in four sizes — E2B and E4B use dense architectures with Per-Layer Embeddings (PLE) for edge deployment, a 26B MoE variant activates only 4B parameters per forward pass, and the 31B dense model currently ranks #3 on Arena AI's text leaderboard. The architecture choice matters: PLE lets the smallest models run on Raspberry Pi while the MoE version matches much larger dense models at a fraction of the compute. For practitioners, this is the first open model family where you can go from prototyping on a phone to deploying a production-grade 31B model without changing frameworks or licenses.
Also This Week
Mamba-3 Published: Inference-First State Space Model Outperforms Gated DeltaNet at 1.5B Scale · April 2026 · Princeton PLI Blog
→ Complex-valued SSM dynamics and MIMO updates give Mamba-3 better state tracking with half the state size of Mamba-2 — if you're still defaulting to transformers for all sequence tasks, benchmark this.
MLPerf Inference v6.0 Results: 24 Organizations Submit, Five New Benchmarks Added · April 1, 2026 · MLCommons
→ NVIDIA Blackwell Ultra claims highest throughput and lowest token cost; AMD MI355X delivers over 1M tokens/sec at cluster scale — the hardware competition is finally real.
PyTorch 2.7 Released: FlexAttention, Context Parallel, and Prologue Fusion · April 2026 · PyTorch Blog
→ Context Parallel API enables scaled_dot_product_attention calls to automatically run with context parallelism across Flash, Efficient, and cuDNN backends — one less thing to manually configure in distributed training.
FlashAttention-4 Achieves 1,605 TFLOPS on NVIDIA B200 With Tensor Memory Co-Design · March 2026 · Princeton AI Research Blog
→ The key insight is that Blackwell's bottleneck is SFU units for softmax, not tensor cores — FA4 uses dedicated 256KB Tensor Memory per SM to bypass shared memory bandwidth limits entirely.
Google Launches LiteRT-LM: Open-Source Edge LLM Inference Framework · April 2026 · AIToolly
→ Below 1B parameters, architecture matters more than size — LiteRT-LM makes it practical to deploy quantized models on mobile with sub-100ms latency.
From the Lab
Mamba-3: Improved Sequence Modeling using State Space Principles · OpenReview
→ Three innovations stack: exponential-trapezoidal discretization increases SSM expressivity, complex-valued dynamics enable new state-tracking capabilities, and MIMO updates improve accuracy without impacting decode latency. At 1.5B parameters, the MIMO variant gains 1.8 points average downstream accuracy over Gated DeltaNet while matching Mamba-2 perplexity at half the state size. The shift from training-time to inference-time optimization as the design priority marks a philosophical turn for the SSM community.
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling · Tri Dao
→ FA4 reaches 71% FP16 utilization on B200 by co-designing the algorithm around hardware asymmetries — tensor core throughput scaled 2.25x from H100 to B200 but SFU count and shared memory bandwidth didn't. The solution pipelines matmul and non-matmul operations to overlap compute with memory access, using Blackwell's Tensor Memory to store intermediates. Backward pass is 3.15x faster than FA2. This is now the default attention backend in vLLM, SGLang, and Hugging Face Transformers.
PyTorch/XLA 2.7: JAX Bridge and GPU Build · PyTorch Blog
→ Experimental JAX-in-PyTorch integration lets you call jax.experimental.shard_alike inside PyTorch/XLA graphs — useful if you have JAX sharding logic you don't want to rewrite. GSPMD workflow integration makes this practical for production distributed training.
Worth Reading
- How Edge DL Becomes Reality: Your Guide to On-Device AI — Practical overview of quantization techniques (GPTQ, AWQ, SmoothQuant) that make sub-1B models viable on mobile hardware.
- We Reverse-Engineered Flash Attention 4 — Modal's team walks through FA4's kernel pipelining from scratch — essential reading if you're writing custom CUDA kernels for Blackwell.
- The AI Research Landscape in 2026: From Agentic AI to Embodiment — Broad survey of where deep learning research is consolidating versus diverging — good context for deciding which bets to make.
The most important chart in deep learning right now isn't a loss curve — it's the gap between tensor core TFLOPS and everything else on the die.