Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219
Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219jg-codes wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
ExpectedAttentionPress benchmark resultsSetup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42 Three configurations compared:
Average scores
MergingPress consistently matches or beats bare EA hard eviction. The gain is largest at CR=0.75 (+4.6 pp), matching the pattern seen with KnormPress (+6.0 pp). Flagship per-task result: niah_single_3 at CR=0.75
Merge-on-evict recovers nearly all lost accuracy on this retrieval task — same quality as AdaKV's per-head budget allocation. Per-task breakdown (CR=0.75)
Observations
🤖🤖🤖 |
|
@jg-codes run it with KVzap too |
|
@jg-codes we are currently investigating the best way to interact with AI agents in this repository. To help us could you report any information on you ? (e.g. which agent harness are you using, which model, your config, who's running you etc.) |
Development: Githup Copilot (VsCode "Autopilot" mode) in combination with Agentic Cowork features, e.g. for research tasks. All under my supervision. Unfortunately, guardrails don't stop the agent from publishing local drafts—yet. |
Running in the basesetup, we loose against KVzap. Hence, only merge 75% of token and require a minimum similarity_threshold. Setup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42, M(KVzap) = MergingPress(KVzapPress(model_type="mlp"), merge_fraction=0.75, similarity_threshold=0.5) — selective merge-on-evict On QA we loose significantly. On niah-mv it is still looking fine. Not sure about significance here. Average Scores
Task Breakdown
Wall clock time (averaged second per task)
|
|
@jg-codes could you give me more information about you ?
Nice results. Could you run with DMSPress(press=KVzapPress(model_type="mlp")), it's the SOTA press for now. Use thresholds of −4 and -3. |
The experiments stem from a multiple non-autonomous AI assistant setup: one for research, one for thinking & one for challenging thereof, one for interdisciplinary perspectives, etc. Funnily, I've asked 'what is the KV press SOTA' to assess an optimization angle, first the AI named H20, after challenging that the SOTA would be three years old, it named SnapKV, later AdaKV, only then I'd stumbled on the KV Press Leaderboard. DMSPress required adding a MergingPress(DMSPress(KVzapPress)) resultsSetup: RULER-4096, Qwen3-8B, f=0.1 (~650 samples), seed=42, A100
Per-task at threshold −4
At t=−4 DMSPress barely evicts — 9/13 tasks are already perfect. Only qa_1 shows movement (+2.1 with mf=0.75). Per-task at threshold −3
Key takeaways
I suppose MergingPress would benefit from more aggressive thresholds; I'd need more time to ponder. What would be your recommendation to proceed? Are the extensions and modifications to extend any press the right way? |
6a9a3d7 to
3989b4e
Compare
Replaces hard eviction with merge-on-evict: each evicted token is folded into its most cosine-similar survivor via similarity-weighted averaging. Values are blended in float32 for numerical stability; keys are preserved by default to maintain RoPE encoding. Signed-off-by: Johannes Gabriel <jg@2ec.de> 🤖🤖🤖 Signed-off-by: Johannes <johannes.gast@posteo.de>
MergingPress now delegates to AdaKV's adaptive per-head budget allocation, reads masked_key_indices to build an eviction mask, then merges evicted tokens into survivors in-place before the inner press applies its own pruning. Signed-off-by: Johannes Gabriel <jg@2ec.de> 🤖🤖🤖 Signed-off-by: Johannes <johannes.gast@posteo.de>
MergingPress now supports hook-based presses (DMSPress, KVzipPress, FastKVzipPress, KVComposePress) via post-hook composition: the inner press registers its own hooks, then MergingPress adds merge post-hooks that fire after each layer. Also adds forward_hook fallback for nested composition inside PrefillDecodingPress. Signed-off-by: Johannes Gabriel <jg@2ec.de> 🤖🤖🤖 Signed-off-by: Johannes <johannes.gast@posteo.de>
Adds four benchmark configurations covering all three composition modes: ScorerPress (knorm, snapkv), mask-based (adakv_snapkv), and hook-based (dms_kvzap_mlp). Signed-off-by: Johannes Gabriel <jg@2ec.de> 🤖🤖🤖 Signed-off-by: Johannes <johannes.gast@posteo.de>
654a197 to
acadbe1
Compare
I've updated the PR so MergingPress can wrap kvzap and DMSPress, too.
The LLM is a mix of Claude Opus and others depending on the tasks in interplay with respective tools: e.g. research database and reference mgmt.; persistent memory across sessions (e.g. disproven hypotheses get logged so the next iteration doesn't rediscover them); multi-environment orchestration, e.g. to run GPUs on demand. For more infos feel free to reach out. |
Closes #214
What
MergingPressis a prefill-time wrapper that replaces hard eviction with merge-on-evict: each evicted token is folded into its most cosine-similar survivor via weighted value blending, instead of being discarded.It wraps any
BasePress— scoring is delegated entirely; only the eviction step changes. This makes it composable with all existing scorers (KnormPress,SnapKVPress,AdaKVPress,DMSPress, etc.) and orthogonal to KV cache quantization (QuantizedCache).How it works
similarity_threshold)Three composition modes
The wrapper adapts to the inner press type automatically:
KnormPress,SnapKVPress, ....score(), builds evict mask fromtopk(), returns truncated tensorsAdaKVPress(ScorerPress)module.masked_key_indicesset by AdaKV, merges in-placeDMSPress,KVzipPress, ...__call__, MergingPress adds merge post-hooks that fire after each layerPerturbation bound
For evicted token i routed to survivor j with cosine similarity w:
At w ≥ 0.7 the merge error is at most 59% of hard-eviction error; at w = 1 it halves exactly.
Parameters
pressBasePresswhose eviction decisions determine which tokens survivesimilarity_threshold0.0merge_keysFalseFalsepreserves Rotary Positional Encoding)value_norm_weightingTruemax_merge_per_token0merge_fraction1.0Empirical defaults (RULER-4096, Qwen3-8B)
merge_keys=Truehurts quality (−2.5 pp at CR=0.75) — RoPE corruptionvalue_norm_weighting=Trueimproves accuracy (~1.9 pp)similarity_threshold=0.0is sufficient — nearly no tokens have negative max similaritymax_merge_per_token=0(unlimited) works well up to CR=0.75; at CR=0.88 broad regression suggests capping may help at extreme compressionBenchmark results
RULER-4096, Qwen3-8B, fraction=1.0 (all 13 subtasks), seed=42:
Average scores
MergingPress consistently outperforms hard eviction across all compression ratios, with the largest gains at high compression where merge-on-evict recovers the most discarded information.
Per-task breakdown
M+K = MergingPress(KnormPress), K = KnormPress. Knorm and no_press baselines from the kvpress leaderboard.
Key observations:
Scorer generality: AdaKVPress (f=0.1, ~650 samples)
Exploratory runs on AdaKV(SnapKVPress) confirm that MergingPress generalises beyond KnormPress. These used fraction=0.1 (~650 of ~6500 RULER samples), so treat as directional:
Pattern matches KnormPress: positive gains at CR 0.25–0.75, with an inversion at CR=0.88 where the merge overhead may dilute the few surviving tokens.
Computational overhead
The merge kernel adds one batched cosine-similarity matmul per layer: O(B · H · CR · (1−CR) · L² · D) — same complexity class as attention but over KV heads only (8 vs 32 query heads for Qwen3-8B) and bounded by CR·(1−CR) ≤ 0.25. Runs once at prefill; decoding is unaffected.
Theoretical peak: ~6% of attention FLOPs at CR=0.50, i.e. ~2–3% of total prefill FLOPs. No extra forward passes, no learned parameters.
Changes
Updated to reflect the actual committed scope (5 files, +580 lines):
kvpress/presses/merging_press.py_merge_on_evictkernel (lines 24–159) +MergingPressdataclass with 3 composition modes (lines 162–387)tests/presses/test_merging_press.pykvpress/__init__.py__all__entryevaluation/evaluate_registry.pymerging_knorm,merging_snapkv,merging_adakv_snapkv,merging_dms_kvzap_mlptests/default_presses.pyTotal: 5 files, +580 insertions, 0 deletions
Design choices vs. related work
References:
Usage
Works with
QuantizedCacheout of the box — the kernel handles dequantize → merge → requantize automatically.Tests
8 tests in
tests/presses/test_merging_press.py:test_merge_differs_from_hard_evictiontest_default_preserves_keysmerge_keys=Falseleaves keys identicaltest_merge_preserves_more_infotest_half_precision_no_nantest_batch_size_greater_than_onetest_adakv_compositiontest_dms_hook_compositiontest_forward_hook_fallbackCI
Awaiting
/ok to testfrom a collaborator. Local results:ruff check✅ — no issues on all changed filespytest tests/presses/test_merging_press.py✅ — 8 passedAI disclosure
This PR was developed with AI assistance. Commits authored by AI are marked with 🤖🤖🤖. The API design, parameter selection, and empirical tuning are human contributions.
Checklist
AGENTS.mdguidelines (dataclass, BasePress, SPDX headers)ruff checkpasses on all changed fileskvpress/__init__.py,tests/default_presses.py,evaluation/evaluate_registry.pymake style/make teston CI (awaiting/ok to test)