NVIDIA's first CPU ships, fresh KV-cache papers land, and the week's real race was cheaper tokens — not bigger models.
Not one frontier lab kicked off a headline training run this week — the action moved entirely to the inference layer. NVIDIA hand-delivered its first custom CPU to four labs, a new arXiv preprint went after the KV-cache memory wall, and Alibaba shipped an agent model tuned to run for a day and a half straight. The through-line is economics: nearly every release this week optimized cost-per-token and memory footprint, not parameter count.
Watch & Listen First
AI-Native Healthcare: How Abridge Built the Clinical Intelligence Layer · May 14 · Latent Space
→ Chai Asawa and Janie Lee go deep on the unglamorous ML production stack — in-house models, specialty-specific evals, de-identification, and real-time agents — across 100M+ medical conversations; the best available case study on shipping ML into a high-stakes regulated domain.
The Modern AI Drone Tech Stack & the Economics of Autonomy · May 18 · Latent Space
→ Yaroslav Azhnyuk of The Fourth Law breaks down terminal-guidance CV running on ~$400 hardware and a five-level autonomy framework — a vivid lesson in edge inference under brutal latency, power, and cost constraints.
Key Takeaways
- Inference is the new capex. Custom CPUs, NVL72 racks, and faster serving runtimes are where the spend is going — training-cluster announcements went quiet.
- The KV cache is the bottleneck everyone is attacking. Long-context and multimodal workloads made KV memory the dominant serving cost; sub-4-bit quantization is now a live research front.
- Agent workloads are dictating hardware design. Vera's core layout and Qwen 3.7-Max's 35-hour runs are both architected around long-horizon, branchy, tool-calling loops — not throughput benchmarks.
- Photonics is inching toward the datacenter. Femtojoule-scale optical switching is still lab-stage, but the energy math is getting hard to ignore.
The Big Story
NVIDIA hand-delivers its first custom CPU, and it is built for agents · May 18, 2026 · Wccftech
→ Vera is NVIDIA's first in-house CPU — 88 custom "Olympus" cores, 1.2 TB/s memory bandwidth, ~50% faster per-core — and the first units went by hand to Anthropic, OpenAI, SpaceXAI, and Oracle. The pitch is the Vera Rubin NVL72: roughly 1/10 the cost per token on agentic inference, agent sandboxes 50% faster, enterprise queries up to 3x. The significance is co-design — the CPU is tuned for the small, branchy, tool-calling work of agent loops rather than HPC throughput, and Oracle's plan to deploy hundreds of thousands of Vera chips this year signals the serving-fleet buildout is now the main event.
Also This Week
Alibaba's Qwen 3.7-Max runs autonomously for 35 hours, paired with its own silicon · May 20, 2026 · TechNode
→ In an internal test the model chained 1,000+ tool calls to optimize a kernel for a ~10x inference speedup and scored 57 on the Artificial Analysis Intelligence Index — a signal that long-horizon agent reliability, not raw benchmark deltas, is the new product axis.
Gemini 3.5 Flash reaches general availability at a fraction of frontier pricing · May 20, 2026 · BuildFastWithAI
→ At $1.50/$9 per 1M tokens with a 1M-token context and 76.2% on Terminal-Bench 2.1, near-frontier quality at Flash economics quietly collapses the cost of running multi-step agent loops in production.
Penn physicists switch optical signals at femtojoule energy, hinting at light-based AI chips · May 18, 2026 · ScienceDaily
→ Bo Zhen's group coupled light into a nanocavity to form exciton-polaritons that interact strongly enough to compute, demonstrating all-optical switching at ~4 femtojoules — orders of magnitude below electronic logic, and a real, if early, path past the energy ceiling on large-model inference.
From the Lab
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond · arXiv:2605.19660
→ The paper names the real obstacle to aggressive KV-cache quantization — Token Norm Imbalance, where shared quantization parameters spanning norm-disparate token groups amplify error. Its fix (Omni-Scaled Canalized Rotation, plus dedicated CUDA kernels) pushes the accuracy-efficiency Pareto front past prior work like ICLR's TurboQuant. If you serve long-context or multimodal models, sub-4-bit KV with negligible quality loss directly buys you longer context and lower memory cost.
LLM Pretraining Shapes a Generalizable Manifold: Cross-Modal Transfer to Time Series · arXiv:2605.20449
→ A quieter result worth your time: language pretraining appears to carve out a low-dimensional, generalizable manifold that transfers to non-text sequential data, helping explain why LLM backbones fine-tune so well on forecasting and sensor streams. The practical read — if you're modeling time series, a pretrained transformer is a defensible starting point, not a hack.
Worth Reading
- New AI Models May 2026: The Frontier Took a Breath, Architecture Took the Stage — Documents the exact pivot from scaling to efficiency and architecture that defined this week.
- State of AI: May 2026 — Air Street's monthly synthesis of where compute, research, and deployment actually stand — practitioner-grade, not press-release-grade.
- State of Open Source on Hugging Face: Spring 2026 — Hard numbers on how fast open-weight models are closing the gap — essential for any build-vs-buy call.
The compute bill stopped chasing bigger models and started chasing cheaper tokens — and squeezing the KV cache is its own kind of progress.