magazine.sebastianraschka.com via Reddit

Raschka Maps KV Cache and Attention Compression in LLMs

inference open source ai-research

Key insights

  • MLA, deployed in DeepSeek V3/V4 and GLM-5, compresses KV cache by projecting keys and values into a shared low-rank latent space.
  • The mHC modification targets the residual connection path in transformer blocks, not the attention heads themselves.
  • Compressed Sparse Attention limits which token pairs interact, reducing quadratic memory growth at long context lengths.

Why this matters

Inference efficiency is now the primary competitive surface for LLM deployment, and MLA is already running in production in DeepSeek V4-Pro, meaning teams building on these models need to understand the architectural trade-offs to tune serving infrastructure correctly. Raschka's consolidation closes a real gap: the relevant advances appeared across separate papers with inconsistent terminology, making coherent mental models nearly impossible without doing the synthesis yourself. Founders and technical leaders evaluating which model families to build on in 2026 now have a single reference mapping how each architectural choice affects memory, latency, and context window costs.

Summary

Sebastian Raschka's Ahead of AI newsletter has consolidated three LLM architectural advances that have been scattered across recent paper releases: Multi-Head Latent Attention (MLA), Multi-Head Compressed (mHC) residual-path modifications, and Compressed Sparse Attention. MLA, deployed in DeepSeek V3, V4, and GLM-5, compresses the KV cache by routing keys and values through a shared low-rank latent space, cutting memory overhead at long context. The mHC change operates on the residual connection path rather than the attention heads themselves. Compressed Sparse Attention limits which token pairs attend to each other, addressing quadratic memory scaling at long context lengths. Essentially: (DeepSeek, GLM-5) are the production deployments proving these techniques work at scale; Raschka's piece is the first single-source reference pulling the fragmented paper drops into one map. - MLA in DeepSeek V3/V4 and GLM-5 is the most deployment-validated KV cache compression technique in the survey. - mHC modifies the residual connection path, not the attention mechanism itself, making it architecturally distinct from prior compression work. - Compressed Sparse Attention is the primary lever for reducing long-context memory overhead without full attention rewrites. The survey traces a single architectural lineage from GPT-2 through DeepSeek V4-Pro, giving inference engineers a concrete map for choosing between efficiency trade-offs in current production stacks.

Potential risks and opportunities

Risks

  • Teams that build serving infrastructure optimized around MLA's KV cache layout could face costly refactors if DeepSeek V5 shifts architectures, as roadmap continuity from the DeepSeek team is not publicly committed.
  • Compressed Sparse Attention's accuracy trade-offs at very long contexts remain unquantified in production settings, creating latent quality risks for teams deploying retrieval-heavy workloads on these architectures in the next 6 months.
  • mHC's residual-path modification, if adopted broadly, could break existing model surgery and quantization tooling (llama.cpp, vLLM, AWQ) before maintainers ship compatibility updates, delaying local-inference adoption.

Opportunities

  • Inference serving vendors (vLLM, SGLang, TensorRT-LLM) can capture enterprise adoption by shipping optimized MLA KV cache management for DeepSeek V4-Pro before competitors do.
  • Hardware vendors with HBM roadmaps (SK Hynix, Micron) can use growing practitioner awareness of KV cache as a memory bottleneck to accelerate enterprise memory upgrade conversations in Q3 2026.
  • Quantization and fine-tuning tool maintainers (Unsloth, Hugging Face Transformers, llama.cpp) who add native mHC and Compressed Sparse Attention support early will capture the r/LocalLLaMA and r/MachineLearning communities already engaged with this survey.

What we don't know yet

  • Quantitative KV cache memory reduction percentages for MLA across DeepSeek V3, V4, and GLM-5 are not compared head-to-head in the survey.
  • Whether mHC has been adopted in any production models beyond academic paper stage as of May 2026 is not addressed.
  • Trade-off benchmarks between Compressed Sparse Attention and full attention at specific context lengths (e.g., 128K, 1M tokens) are absent from the piece.