剩余流是所有你需要的:关于变形器推论中 KV缓存的冗余
The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

作者
Authors Kaleem Ullah Qasim|Jiashu Zhang|Muhammad Kafeel Shaheen|Razan Alharith|Heying Zhang

期刊
Journal 暂无期刊信息

年份
Year 2026

分类
Category

国家
Country 中国China

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

📊 文章统计
Article Statistics

基础数据
Basic Stats

308 浏览
Views

0 下载
Downloads

9 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

影响因子分析
Impact Analysis

5.90 综合评分
Overall Score

引用影响力
Citation Impact

浏览热度
View Popularity

下载频次
Download Frequency

剩余流是所有你需要的:关于变形器推论中 KV缓存的冗余
The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles

剩余流是所有你需要的:关于变形器推论中 KV缓存的冗余The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

📝 摘要Abstract

📊 文章统计Article Statistics

基础数据Basic Stats

引用趋势Citation Trend

阅读国家分布Country Distribution

阅读机构分布Institution Distribution

月度浏览趋势Monthly Views

相关关键词Related Keywords

影响因子分析Impact Analysis

📄 相关文章Related Articles

剩余流是所有你需要的:关于变形器推论中 KV缓存的冗余
The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

📝 摘要
Abstract

📊 文章统计
Article Statistics

基础数据
Basic Stats

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

📄 相关文章
Related Articles