huggingface.co web signal

InfoKV Adds Token Entropy to Outperform Attention-Only KV Cache

inference open source ai-infrastructure

TL;DR

  • InfoKV combines token entropy and attention scores to outperform attention-only KV cache compression on long-context reasoning benchmarks.
  • High-entropy tokens show substantially stronger influence on distant future contexts than tokens selected by attention scores alone.
  • On IFEval, retaining 25% of the KV cache with InfoKV surpassed the full-cache baseline for DeepSeek-R1-Distill-Llama-8B.

Standard KV cache compression picks which tokens to keep by looking at attention scores -- which tokens recent context has attended to most. The intuition seems sound, but a paper from LUMIA Lab at Shanghai Jiao Tong University and the University of Edinburgh exposes a persistent gap: attention scores capture what matters nearby, but routinely discard tokens that turn out to matter a great deal for distant future context during long reasoning chains.

The key concept the authors introduce is "Forward Influence," a metric measuring how much removing a given token from the KV cache shifts the model's future predictive distributions. Their analysis, conducted with Llama-3.1-8B-Instruct on documents from Arxiv-Summarization, finds that tokens with high predictive uncertainty have substantially stronger long-range forward influence than tokens selected by attention alone. Content words -- nouns, adjectives, conjugated verbs -- tend to cluster at high entropy. Function words -- conjunctions, determiners, prepositions -- tend to cluster low. Attention-based selection, according to the paper, is biased toward the function-word end of that spectrum when compressing long sequences.

The proposed framework, InfoKV, combines three signals per token: its standard attention score, its predictive entropy (computed over the top-k most probable tokens rather than the full vocabulary, which the authors find more stable), and a layer-wise measure of how much the token's hidden representation changed from an early layer to the final layer. That third component lets each transformer layer weigh token importance slightly differently, rather than applying a single global entropy score uniformly.

Experiments on the LongReason benchmark with Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct at cache budgets of 40% and 20% show InfoKV consistently outperforming SnapKV, PyramidKV, and Expected Attention across context lengths from 16k to 64k tokens, with the advantage growing as sequence length increases. On long decoding tasks -- IFEval, AIME 2024, and LiveCodeBench -- using DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, InfoKV again outperforms the attention-only baseline. The authors report that on IFEval, retaining only 25% or 12.5% of the KV cache with InfoKV surpasses the full-cache baseline for R1-Distill-Llama-8B, attributing this to long reasoning trajectories containing substantial redundancy where keeping all tokens can introduce distracting contexts.

The honest caveat is that the code has not yet been released, so independent reproduction is not yet possible, and the paper does not quantify how much entropy computation adds to inference latency compared to attention-only methods. Results are also specific to a handful of benchmarks and model families. What the paper does not address is whether these gains hold at larger model scales or in retrieval-heavy task distributions. The forward-looking case is clear enough: if entropy-aware selection generalizes, any team running long-context or chain-of-thought-heavy workloads on fixed GPU memory budgets has a practical lever to improve quality while reducing memory pressure.