arxiv.org web signal

KV Cache Marketplace Cuts Agent Prefill Cost 49x

By Alexis Dufresne Published June 13, 2026 at 14:08 UTC Updated June 13, 2026 at 14:20 UTC

inference agents rag inference agents economics

Key insights

On Qwen3-4B, precomputed KV cache reuse matches prefill output exactly at 24/24 greedy tokens with no accuracy degradation.
Serving one 3,774-token document to 80 million agents costs 49.7x less via cache reuse than repeated prefill, at $0.03M versus $1.5M.
The paper identifies lossless KV compression and a cross-party payment layer as the two main unsolved engineering problems blocking adoption.

Why this matters

AI agents running RAG or long-context workflows against shared corpora pay prefill compute costs that scale linearly with every request, and this paper quantifies that waste at $1.5M versus $0.03M for a single popular document at 80 million agents. The 'agent-native prefill CDN' framing suggests inference providers and content publishers need new contractual and billing infrastructure that does not yet exist in any production system. If the KV egress problem is solved, the unit economics of inference shift from compute-at-query-time to compute-once-distribute-many, changing how inference providers price access to shared corpora.

Summary

Luoyuan Zhang proposes a marketplace where publishers precompute KV caches for documents and sell access to AI agents, letting agents skip the prefill step entirely. Economics: one 3,774-token document served to 80 million agents costs $1.5M in repeated prefill versus $0.03M with cache reuse, a 49.7x reduction. On Qwen3-4B, results are token-exact at 24/24 greedy tokens with no accuracy cost. Essentially: (Luoyuan Zhang) proposes treating popular documents as purchasable compute assets. - Reuse runs 9-50x cheaper than prefill, savings growing with document length. - KV caches are "nearly incompressible," creating egress cost challenges. - Lossless compression and a cross-party payment layer remain unsolved. The paper names this model an "agent-native prefill CDN."

Potential risks and opportunities

Risks

Inference providers who precompute and distribute KV caches take on new invalidation risk if cached document content changes or is flagged, requiring cache rollout across millions of agent clients.
KV cache egress costs at 80-million-agent scale could erode the 49.7x compute savings if network transfer fees approach even a fraction of prefill costs, a figure the paper explicitly leaves unquantified.
The cross-party payment layer for KV cache transactions does not exist today, meaning the proposed market cannot function without coordination between inference providers and content publishers that may take years to standardize.

Opportunities

CDN infrastructure providers are positioned to offer KV cache distribution as an inference edge service once the egress compression problem is solved, directly matching the paper's 'agent-native prefill CDN' model.
Content owners with high-query corpora such as legal databases, code repositories, or product documentation could precompute and monetize KV caches as a recurring revenue stream, capturing what the paper estimates as 'millions of dollars per popular document.'
A new infrastructure category exists for building the cross-party payment layer the paper flags as missing, sitting between content publishers and AI inference providers with no incumbent occupying it today.

What we don't know yet

KV egress bandwidth costs versus prefill savings at scale: the paper acknowledges KV caches are 'nearly incompressible' but does not quantify what egress fees would do to the 49.7x savings figure.
No production implementation or pilot is described, leaving the cross-party payment layer mechanism entirely theoretical as of the paper's June 2026 submission.
Which document lengths or types achieve the 9x lower bound versus the 50x upper bound of compute savings is not broken down in the paper.

Originally reported by arxiv.org

Read the original article →

Original headline: arXiv 2606.13361: 'Can I Buy Your KV Cache?' — Precomputed Attention Caches Can Be Sold to AI Agents to Skip Prefill, Delivering 9-50x Compute Savings With Zero Accuracy Cost