DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference (ICLR 2026)

This repository contains the official implementation of DefensiveKV and LayerDefensiveKV, two novel KV cache compression methods introduced in our paper. This project is forked from the excellent KVPRESS library by NVIDIA but provides a more efficient implementation of head-wise KV-cache methods.

If your issue has not received a response within 48 hours, please feel free to contact me via email—we will try to help. If you find this work useful for your research, we would greatly appreciate a star or a citation. ^_^

Overview

We tackle the fragility of existing cache eviction via defensive aggregation, implementing two variants:

DefensiveKV: Introduces defensive aggregation on top of the current SOTA method, (Ada-)CriticalKV.
Layer-DefensiveKV: Extends DefensiveKV by adopting AdaKV-style layer-wise budget allocation.

Defensive aggregation is implemented in just two lines of code, yet yields substantial improvements.

## Mechanism of Defensive Aggregation 
max_scores = scores.max(dim=2).values.max(dim=-2).values
scores = max_scores.clamp(min=max_scores.mean(dim=-1, keepdim=True))

Installation

# Clone the repository
git clone <repository-url>
cd defensive_kvpress

# Install dependencies
pip install -e .

# Build the kernel for efficient head-wise computation
cd kvpress/csrc/
make

# Install flash attention for better performance
pip install flash-attn --no-build-isolation

Preparing Models and Datasets & Setting Environment Variables

Set the following environment variables before running evaluations:

# Preprocessed LongBench and 4K Ruler datasets are already provided in the datasets folder and can be downloaded directly.
# Make sure git-xet is installed (https://hf.co/docs/hub/git-xet)
git clone https://huggingface.co/datasets/yuanfengustc/defensivekv_dataset

export KVPRESS_DATASETS=/path/to/datasets  # Directory containing evaluation datasets 
export MODELS_DIR=/path/to/your/models  # Directory containing HuggingFace models

Strong Recommendation: A Real Quick Evaluation (≤ 1 hour)

We provide a quick evaluation on 10% of the RULER benchmark to demonstrate the performance of DefensiveKV and Layer-DefensiveKV under 20% cache size.

💡 Rapid Verification: Validate with several popular methods in just 1 hour on a single RTX 4090.

📉 The Truth: Correcting previous benchmark flaws reveals SnapKV scores drop to 39.0 at 20% compression, shattering previous "lossless" illusions.

🚀 Our Progress: We advanced from AdaKV to CriticalKV then DefensiveKV, boosting performance from 39.0 to 91.4.

🧩 Stackable Gains: Our orthogonal approaches integrate seamlessly with existing methods for additive improvements. Explore with us!

cd evaluation
bash quick_evaluate.sh

Ruler Tasks	cwe	fwe	MK1	MK2	MK3	MQ	MV	S1	S2	S3	qa1	qa2	vt	Ave.
SnapKV	72.6	74	26	8	2	36.5	34	74	52	4	44	42	38	39.0
AdaKV	92.4	87.3	24	20	16	29.5	26.5	88	56	4	46	42	65.6	46.0
AdaCriticalKV	93.8	93.3	60	30	26	71	72	88	92	4	56	52	79.2	62.9
DefensiveKV	95.2	92	98	86	72	95.5	90	100	100	62	78	46	94.4	85.3
LayerDefensiveKV	93.6	94	100	98	92	100	94.5	98	96	92	78	56	96.4	91.4

Comprehensive Evaluation (LongBench + RULER)

Run the following script to evaluate on LongBench and RULER:

cd evaluation
bash evaluate.sh

This script will:

Run DefensiveKV and Layer-DefensiveKV on LongBench with compression ratio 0.8 (20% Cache Size)
Run the same methods on RULER (4096 context length) with compression ratio 0.8 (20% Cache Size)
Save logs to evaluation/logs/ directory

Efficiency Evaluation

Run the following script to evaluate efficiency:

cd evaluation
bash efficiency_evaluate.sh

Press Names in Evaluation

efficient_defensivekv - DefensiveKV (per-head)
efficient_layer_defensivekv - GlobalDefensiveKV (global per-head)
criti_adasnapkv - CriticalKV built on AdaKV
adasnapkv - AdaKV
snapkv - SnapKV

Analyzing Results

After running evaluations, use the provided scripts to analyze results:

cd evaluation/results

# Generate statistics from JSON result files
python statistic.py

TODO

Future updates are listed below—feel free to open an issue or send a PR with ideas and feature requests:

Initialize release.
Efficiency evaluation.
Upgrade Transformers to v4.57 for Qwen-3 support.
Qwen-3-MoE Support.

Contact

For questions or discussions about this work, please open an issue in this repository.

Paper

If you find this work useful, please cite our paper:

@article{feng2025taming,
  title={Taming the Fragility of KV Cache Eviction in LLM Inference},
  author={Feng, Yuan and Guo, Haoyu and Lv, JunLin and Zhou, S Kevin and Xie, Xike},
  journal={arXiv preprint arXiv:2510.13334},
  year={2025}
}

Paper Link: https://arxiv.org/pdf/2510.13334

Acknowledgments

This project is built upon the KVPRESS library by NVIDIA. We thank the original authors for their excellent work on KV cache compression methods and the flexible framework that made this research possible.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
evaluation		evaluation
kvpress		kvpress
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Longbench_Performance.png		Longbench_Performance.png
Makefile		Makefile
README.md		README.md
Task_Performance.png		Task_Performance.png
kvpress.jpg		kvpress.jpg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference (ICLR 2026)

Overview

Installation

Preparing Models and Datasets & Setting Environment Variables

Strong Recommendation: A Real Quick Evaluation (≤ 1 hour)

Comprehensive Evaluation (LongBench + RULER)

Efficiency Evaluation

Press Names in Evaluation

Analyzing Results

TODO

Contact

Paper

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference (ICLR 2026)

Overview

Installation

Preparing Models and Datasets & Setting Environment Variables

Strong Recommendation: A Real Quick Evaluation (≤ 1 hour)

Comprehensive Evaluation (LongBench + RULER)

Efficiency Evaluation

Press Names in Evaluation

Analyzing Results

TODO

Contact

Paper

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages