A week dominated by ICLR and a sudden cluster of trillion-parameter open weights produced an unusually clean signal: the dominant architecture is not just empirically strong, it is provably more compact than its alternatives. That changes the framing of every "post-transformer" debate. Meanwhile, the frontier of efficient long-context attention quietly shifted again — KV caches are being treated less like a memory bottleneck and more like a compression problem with optimal rates.
Get more from AI Weekly
More signal, less noise — pick your channels.
You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.
-
→ Explore 16 deep divesWeekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.Browse all 16 deep dives →
-
→ Breaking AI alertsWhen something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.Get breaking alerts →
-
→ AI News Today (live)Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.Open AI News Today →
The Big Story
Transformers are Inherently Succinct (ICLR 2026 Outstanding Paper) · April 23 · [arxiv:2510.19315]
→ Bergsträßer, Cotterell and Lin formalize what practitioners have been observing for a decade: transformers can encode formal languages exponentially more succinctly than RNNs and modern state-space models, and doubly-exponentially more succinctly than finite automata. The paper recasts "expressive power" not as which functions a model class can represent, but how compactly it can represent them — and proves, as a direct consequence, that verifying transformer properties is EXPSPACE-complete. For a field still arguing about whether attention is fundamental or incidental, this is the first separation result that actually bites.
Also This Week
Mamba-3: Improved Sequence Modeling using State Space Principles (ICLR 2026 Oral) · April 23 · [arxiv:2603.15569]
→ Princeton's Goomba Lab introduces complex-valued state updates and a MIMO formulation that closes much of the state-tracking gap with attention; at 1.5B scale, Mamba-3 matches Mamba-2 perplexity with half the state size, and the MIMO variant adds 1.8 points of downstream accuracy without touching decode latency.
DeepSeek-V4-Pro Technical Report — 1.6T MoE, 1M context, 10% KV cache · April 24 · [Hugging Face]
→ The architecture combines Compressed Sparse Attention with Heavily Compressed Attention, hitting 27% of V3.2's per-token FLOPs and 10% of its KV cache at 1M context; equally notable is FP4 training for MoE experts under FP8 elsewhere, plus Manifold-Constrained Hyper-Connections — the first frontier model where long-context efficiency is a first-class architectural objective rather than a post-hoc patch.
The Art of Scaling Test-Time Compute · April 21 · [arxiv:2512.02008]
→ A 30-billion-token sweep across eight open models (7B-235B) and four reasoning datasets establishes three empirically clean trends: no single TTS strategy dominates, reasoning models show distinct trace-quality regimes by problem difficulty, and within a model family the optimal TTS curve is monotonic in compute. The closest thing the field now has to a "Chinchilla for inference."
Predicting and Improving Test-Time Scaling Laws via Reward Tail-Guided Search · April · [arxiv:2602.01485]
→ Estimating the tail of the reward distribution lets you forecast scaling behavior without exhaustive evaluation, and the proposed Scaling-Law Guided search reallocates compute toward intermediate states with the highest predicted potential — a step toward making test-time compute itself a learned policy rather than a hyperparameter.
Gemma 4 Technical Overview (E2B / E4B / 26B MoE / 31B Dense) · April 22 · [blog.google]
→ Per-Layer Embeddings on the small variants, alternating local sliding-window and global full-context attention, and shared KV cache between global layers; the 31B Dense and 26B MoE both ship with 256K context via proportional RoPE on the global layers — a near-complete catalog of the efficiency tricks the open ecosystem has converged on in the last 12 months.
From the Lab
Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models · [arxiv:2603.19183]
→ SAEs trained on Vision-Language-Action hidden states yield monosemantic features that act as a steering basis for robot policies — pushing mechanistic interpretability out of pure-text territory and into the embodied-agent stack, where most of the deployment risk now lives. Combined with the protein-LM SAE work in PNAS this month, sparse dictionaries are quietly becoming the default lens for any transformer trained on a non-language modality.
Routing Mamba: Scaling SSMs with Mixture-of-Experts Projection · [Microsoft Research]
→ Sparse linear-projection experts inside Mamba layers; the interesting result is not the raw quality lift but that sparse routing composes cleanly with SSMs at all, which had not been obvious. The post-attention efficient-architecture stack is starting to look more like a lattice than a tree.
Worth Reading
- [On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy] — The first rigorous account of when SAE recovery actually fails (ground-truth features that aren't extremely sparse), plus a weighted-SAE remedy. Required reading before you trust any feature-attribution claim.
- [Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks] — Disentangles memorization and reasoning regimes under MoE sparsity; the core finding — optimal sparsity is jointly determined by active FLOPs and tokens-per-parameter — should reshape how labs pick activation ratios for their next pretrain.
Two separation results, one MoE-SSM hybrid, and a 1.6T model that fits a million tokens in 10% of yesterday's KV cache. The post-transformer era keeps arriving on transformer terms.
— Alexis