Skip to content

Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219

Open
jg-codes wants to merge 4 commits intoNVIDIA:mainfrom
jg-codes:pr/merging-press
Open

Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219
jg-codes wants to merge 4 commits intoNVIDIA:mainfrom
jg-codes:pr/merging-press

Conversation

@jg-codes
Copy link
Copy Markdown

@jg-codes jg-codes commented Apr 15, 2026

Closes #214

What

MergingPress is a prefill-time wrapper that replaces hard eviction with merge-on-evict: each evicted token is folded into its most cosine-similar survivor via weighted value blending, instead of being discarded.

It wraps any BasePress — scoring is delegated entirely; only the eviction step changes. This makes it composable with all existing scorers (KnormPress, SnapKVPress, AdaKVPress, DMSPress, etc.) and orthogonal to KV cache quantization (QuantizedCache).

How it works

  1. Score tokens using the wrapped press
  2. Partition into keep/evict sets by score (or mask, or threshold)
  3. Compute cosine similarity between evicted and surviving keys (per-head loop)
  4. Route each evicted token to its most similar survivor (gated by similarity_threshold)
  5. Blend values via similarity-weighted scatter-add (float32 accumulation)
  6. Keys are preserved unchanged by default (protects RoPE positional encoding)

Three composition modes

The wrapper adapts to the inner press type automatically:

Mode Inner press Mechanism
ScorerPress KnormPress, SnapKVPress, ... Calls .score(), builds evict mask from topk(), returns truncated tensors
Mask-based AdaKVPress(ScorerPress) Reads module.masked_key_indices set by AdaKV, merges in-place
Hook-based DMSPress, KVzipPress, ... Post-hook composition: inner press registers its own hooks via __call__, MergingPress adds merge post-hooks that fire after each layer

Perturbation bound

For evicted token i routed to survivor j with cosine similarity w:

‖ΔO_merge‖ ≤ 1/(1+w) · ‖ΔO_evict‖

At w ≥ 0.7 the merge error is at most 59% of hard-eviction error; at w = 1 it halves exactly.

Parameters

Parameter Default Description
press Any BasePress whose eviction decisions determine which tokens survive
similarity_threshold 0.0 Minimum cosine similarity to merge (0.0 blocks only opposite-direction)
merge_keys False Merge key vectors too (False preserves Rotary Positional Encoding)
value_norm_weighting True Scale merge weight by relative value-vector L2 norm
max_merge_per_token 0 Cap merges per survivor to prevent dilution (0 = unlimited)
merge_fraction 1.0 Fraction of evicted tokens (by similarity rank) to merge

Empirical defaults (RULER-4096, Qwen3-8B)

  • merge_keys=True hurts quality (−2.5 pp at CR=0.75) — RoPE corruption
  • value_norm_weighting=True improves accuracy (~1.9 pp)
  • similarity_threshold=0.0 is sufficient — nearly no tokens have negative max similarity
  • max_merge_per_token=0 (unlimited) works well up to CR=0.75; at CR=0.88 broad regression suggests capping may help at extreme compression

Benchmark results

RULER-4096, Qwen3-8B, fraction=1.0 (all 13 subtasks), seed=42:

Average scores

CR MergingPress(KnormPress) KnormPress Δ % lift
0.25 88.3 87.2 +1.1 +1.3%
0.50 72.2 68.3 +3.9 +5.7%
0.75 38.6 32.6 +6.0 +18.3%
0.88 13.6 8.9 +4.7 +53.3%

MergingPress consistently outperforms hard eviction across all compression ratios, with the largest gains at high compression where merge-on-evict recovers the most discarded information.

Per-task breakdown

Task no_press M+K 0.25 K 0.25 Δ M+K 0.50 K 0.50 Δ M+K 0.75 K 0.75 Δ M+K 0.88 K 0.88 Δ
cwe 98.9 96.9 96.7 +0.2 92.4 89.2 +3.1 53.9 38.1 +15.9 9.8 5.9 +3.9
fwe 95.3 89.7 89.4 +0.3 83.7 80.9 +2.9 65.3 54.9 +10.4 33.2 18.6 +14.6
niah_mk1 100.0 100.0 99.8 +0.2 95.2 92.0 +3.2 42.2 38.4 +3.8 9.6 8.0 +1.6
niah_mk2 100.0 93.8 92.0 +1.8 46.6 39.2 +7.4 2.8 3.2 −0.4 0.2 0.2 0.0
niah_mk3 100.0 66.8 61.8 +5.0 11.6 8.4 +3.2 0.8 1.2 −0.4 0.0 0.0 0.0
niah_mq 99.9 99.8 99.7 +0.1 94.5 92.8 +1.6 47.8 37.9 +9.9 8.7 5.8 +3.0
niah_mv 100.0 99.9 99.6 +0.3 93.6 92.1 +1.5 57.9 48.9 +8.9 10.9 7.0 +3.8
niah_s1 100.0 100.0 100.0 0.0 100.0 100.0 0.0 93.6 75.0 +18.6 40.6 19.6 +21.0
niah_s2 100.0 100.0 100.0 0.0 99.6 99.4 +0.2 87.4 79.2 +8.2 43.4 32.8 +10.6
niah_s3 100.0 97.2 97.2 0.0 89.8 87.0 +2.8 19.6 17.6 +2.0 0.0 0.0 0.0
qa_1 81.6 60.0 58.4 +1.6 31.2 29.4 +1.8 13.8 11.8 +2.0 10.8 8.6 +2.2
qa_2 63.4 47.4 46.2 +1.2 26.0 24.6 +1.4 11.8 11.0 +0.8 10.2 9.2 +1.0
vt 100.0 96.9 93.0 +3.9 74.8 53.1 +21.7 5.2 7.2 −2.0 0.0 0.0 0.0
Average 95.3 88.3 87.2 +1.1 72.2 68.3 +3.9 38.6 32.6 +6.0 13.6 8.9 +4.7

M+K = MergingPress(KnormPress), K = KnormPress. Knorm and no_press baselines from the kvpress leaderboard.

Key observations:

  • Largest per-task gains at CR=0.50: vt +21.7, niah_mk2 +7.4, niah_mk3 +3.2
  • At CR=0.75: niah_s1 +18.6, cwe +15.9, fwe +10.4, niah_mq +9.9
  • At CR=0.88: niah_s1 +21.0, fwe +14.6, niah_s2 +10.6
  • A few minor regressions at CR=0.75–0.88 on near-zero tasks (niah_mk2/mk3, vt) where both methods are near noise floor

Scorer generality: AdaKVPress (f=0.1, ~650 samples)

Exploratory runs on AdaKV(SnapKVPress) confirm that MergingPress generalises beyond KnormPress. These used fraction=0.1 (~650 of ~6500 RULER samples), so treat as directional:

CR MergingPress(AdaKV) AdaKV(SnapKV) Δ % lift
0.25 93.0 92.2 +0.8 +0.9%
0.50 66.6 64.0 +2.6 +4.1%
0.75 39.0 37.4 +1.6 +4.2%
0.88 23.8 24.6 −0.8 −3.3%

Pattern matches KnormPress: positive gains at CR 0.25–0.75, with an inversion at CR=0.88 where the merge overhead may dilute the few surviving tokens.

Computational overhead

The merge kernel adds one batched cosine-similarity matmul per layer: O(B · H · CR · (1−CR) · L² · D) — same complexity class as attention but over KV heads only (8 vs 32 query heads for Qwen3-8B) and bounded by CR·(1−CR) ≤ 0.25. Runs once at prefill; decoding is unaffected.

Theoretical peak: ~6% of attention FLOPs at CR=0.50, i.e. ~2–3% of total prefill FLOPs. No extra forward passes, no learned parameters.

Changes

Updated to reflect the actual committed scope (5 files, +580 lines):

File Lines Description
kvpress/presses/merging_press.py +387 _merge_on_evict kernel (lines 24–159) + MergingPress dataclass with 3 composition modes (lines 162–387)
tests/presses/test_merging_press.py +177 8 tests: merge correctness, key preservation, info preservation, fp16 stability, batching, AdaKV composition, DMS hook composition, forward_hook fallback
kvpress/__init__.py +2 Import + __all__ entry
evaluation/evaluate_registry.py +6 merging_knorm, merging_snapkv, merging_adakv_snapkv, merging_dms_kvzap_mlp
tests/default_presses.py +8 Parametrized test matrix entry

Total: 5 files, +580 insertions, 0 deletions

Design choices vs. related work

Aspect MergingPress (this PR) CAMPress (#196, merged)
Phase Prefill Decoding
Merge routing Position-agnostic (max cosine similarity) Sequential neighbors
Merge weight Cosine similarity + optional value-norm weighting Bernoulli sampling from cumulative attention ratio
Scorer Any BasePress (composable via 3 modes) Any ScorerPress via DecodingPress
Key handling Keys preserved by default (RoPE-safe) Keys not merged

Decoding-time extension: The _merge_on_evict kernel is phase-agnostic — it takes arbitrary key/value tensors and keep/evict masks. Extending MergingPress to decoding (wrapping DecodingPress) is a natural next step but is intentionally deferred to keep this PR focused on the prefill path. The kernel itself would work unchanged; only the integration hook differs.

References:

Usage

from kvpress import KnormPress, MergingPress, KVPressTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# ScorerPress composition
press = MergingPress(KnormPress(compression_ratio=0.5))
pipe = KVPressTextGenerationPipeline(model=model, tokenizer=tokenizer, press=press)
output = pipe("Your long context here...", max_new_tokens=50)

Works with QuantizedCache out of the box — the kernel handles dequantize → merge → requantize automatically.

Tests

8 tests in tests/presses/test_merging_press.py:

Test What it verifies
test_merge_differs_from_hard_eviction Merged values differ from plain eviction
test_default_preserves_keys merge_keys=False leaves keys identical
test_merge_preserves_more_info Reconstruction error ≤ hard eviction
test_half_precision_no_nan fp16 produces finite results (float32 accumulation)
test_batch_size_greater_than_one Handles batch_size > 1
test_adakv_composition Mask-based path with AdaKVPress
test_dms_hook_composition Hook-based path with DMSPress
test_forward_hook_fallback Delegation for nested composition (PrefillDecodingPress)

CI

Awaiting /ok to test from a collaborator. Local results:

  • ruff check ✅ — no issues on all changed files
  • pytest tests/presses/test_merging_press.py ✅ — 8 passed

AI disclosure

This PR was developed with AI assistance. Commits authored by AI are marked with 🤖🤖🤖. The API design, parameter selection, and empirical tuning are human contributions.

Checklist

  • Code follows AGENTS.md guidelines (dataclass, BasePress, SPDX headers)
  • All commits signed off (DCO)
  • AI commits marked with 🤖🤖🤖
  • ruff check passes on all changed files
  • 8/8 tests pass locally (no GPU needed for unit tests)
  • Added to kvpress/__init__.py, tests/default_presses.py, evaluation/evaluate_registry.py
  • make style / make test on CI (awaiting /ok to test)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jg-codes
Copy link
Copy Markdown
Author

ExpectedAttentionPress benchmark results

Setup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42

Three configurations compared:

  • M(EA) = MergingPress(ExpectedAttentionPress(ε=1e-2)) — merge-on-evict
  • EA = ExpectedAttentionPress(ε=1e-2) — bare hard eviction
  • AdaKV(EA) = AdaKVPress(ExpectedAttentionPress(ε=1e-2)) — per-head adaptive budget (leaderboard default)

Average scores

CR M(EA) EA (bare) AdaKV(EA) no_press M(EA)−EA
0.25 93.4 92.8 94.2 94.9 +0.6
0.50 86.4 86.4 94.2 +0.0
0.75 74.4 69.8 88.3 +4.6
0.875 62.3 60.3 72.0 +2.0

MergingPress consistently matches or beats bare EA hard eviction. The gain is largest at CR=0.75 (+4.6 pp), matching the pattern seen with KnormPress (+6.0 pp).

Flagship per-task result: niah_single_3 at CR=0.75

Config Score
M(EA) 90.5
EA (bare) 38.1
AdaKV(EA) 90.5
Δ M(EA) vs EA +52.4 pp

Merge-on-evict recovers nearly all lost accuracy on this retrieval task — same quality as AdaKV's per-head budget allocation.

Per-task breakdown (CR=0.75)

Task no_press M(EA) EA AdaKV(EA) M(EA)−EA
cwe 100.0 65.6 73.5 98.4 −7.9
fwe 93.3 95.3 93.3 93.3 +2.0
niah_mk1 100.0 98.2 94.4 100.0 +3.7
niah_mk2 100.0 16.2 10.8 100.0 +5.4
niah_mk3 100.0 0.0 0.0 45.6 0.0
niah_mq 100.0 100.0 98.7 100.0 +1.3
niah_mv 100.0 100.0 99.6 100.0 +0.4
niah_s1 100.0 100.0 100.0 100.0 0.0
niah_s2 100.0 98.5 95.5 100.0 +3.0
niah_s3 100.0 90.5 38.1 90.5 +52.4
qa_1 83.0 59.6 57.5 70.2 +2.1
qa_2 56.8 43.2 45.5 50.0 −2.3
vt 100.0 100.0 100.0 100.0 0.0
Avg 94.9 74.4 69.8 88.3 +4.6

Observations

  1. MergingPress generalises to EA — the +4.6 pp gain at CR=0.75 parallels KnormPress (+6.0 pp), confirming scorer-agnostic value.
  2. AdaKV's head-wise budget allocation dominates — AdaKV(EA) adds +18.5 pp over bare EA at CR=0.75, vs +4.6 pp from merge-on-evict. Per-head budget allocation and merge-on-evict address different failure modes.
  3. Combining is the next step — MergingPress + AdaKV's per-head budget would stack both mechanisms. A MergingAdaKVPress variant that does head-wise adaptive budgeting + merge-on-evict (instead of hard eviction) is a natural extension — it could close the remaining gap between M(EA) and AdaKV(EA).

Fraction=0.1 (~650 samples) — directional only. Happy to run f=1.0 if needed.

🤖🤖🤖

@SimJeg
Copy link
Copy Markdown
Collaborator

SimJeg commented Apr 16, 2026

@jg-codes run it with KVzap too

@SimJeg
Copy link
Copy Markdown
Collaborator

SimJeg commented Apr 16, 2026

@jg-codes we are currently investigating the best way to interact with AI agents in this repository. To help us could you report any information on you ? (e.g. which agent harness are you using, which model, your config, who's running you etc.)

@jg-codes
Copy link
Copy Markdown
Author

@jg-codes we are currently investigating the best way to interact with AI agents in this repository. To help us could you report any information on you ? (e.g. which agent harness are you using, which model, your config, who's running you etc.)

Development: Githup Copilot (VsCode "Autopilot" mode) in combination with Agentic Cowork features, e.g. for research tasks. All under my supervision. Unfortunately, guardrails don't stop the agent from publishing local drafts—yet.
Infra: VPS / modal to run GPU tasks (usually A100).

@jg-codes
Copy link
Copy Markdown
Author

jg-codes commented Apr 16, 2026

@jg-codes run it with KVzap too

Running in the basesetup, we loose against KVzap. Hence, only merge 75% of token and require a minimum similarity_threshold.

Setup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42, M(KVzap) = MergingPress(KVzapPress(model_type="mlp"), merge_fraction=0.75, similarity_threshold=0.5) — selective merge-on-evict

On QA we loose significantly. On niah-mv it is still looking fine. Not sure about significance here.

Average Scores

CR M(KVzap) KVzap Δ Δ%
0.50 91.6 87.3 +4.2 +4.9%
0.75 73.0 71.8 +1.2 +1.7%
0.88 40.9 39.5 +1.4 +3.6%

Task Breakdown

Task no_press M(KVzap) KVzap Δ M(KVzap) KVzap Δ M(KVzap) KVzap Δ
CR=0.50 CR=0.75 CR=0.88
cwe 100.0 95.3 93.7 +1.6 84.0 82.1 +1.9 65.6 60.0 +5.6
fwe 93.3 94.0 94.0 0.0 86.7 88.7 −2.0 86.7 82.0 +4.7
niah_mk1 100.0 96.3 96.3 0.0 85.2 81.5 +3.7 20.4 13.0 +7.4
niah_mk2 100.0 100.0 100.0 0.0 83.8 81.1 +2.7 2.7 0.0 +2.7
niah_mk3 100.0 65.2 34.8 +30.4 0.0 0.0 0.0 0.0 0.0 0.0
niah_mq 100.0 100.0 99.1 +0.9 87.7 89.0 −1.3 25.0 28.5 −3.5
niah_mv 100.0 99.1 99.1 0.0 90.8 87.3 +3.5 25.9 16.2 +9.6
niah_s1 100.0 100.0 100.0 0.0 100.0 100.0 0.0 100.0 100.0 0.0
niah_s2 100.0 98.5 98.5 0.0 87.9 84.8 +3.0 24.2 25.8 −1.5
niah_s3 100.0 97.6 90.5 +7.1 45.2 14.3 +31.0 0.0 0.0 0.0
qa_1 83.0 85.1 74.5 +10.6 57.5 68.1 −10.6 46.8 51.1 −4.2
qa_2 56.8 59.1 54.5 +4.5 40.9 56.8 −15.9 34.1 36.4 −2.3
vt 100.0 100.0 100.0 0.0 100.0 100.0 0.0 100.0 100.0 0.0
Average 94.9 91.6 87.3 +4.2 73.0 71.8 +1.2 40.9 39.5 +1.4

Wall clock time (averaged second per task)

CR KVzap (bare) M(KVzap)
0.50 2.026 2.065
0.75 1.962 2.768
0.875 3.468 3.734

@SimJeg
Copy link
Copy Markdown
Collaborator

SimJeg commented Apr 16, 2026

@jg-codes could you give me more information about you ?

  • your input prompt
  • the LLM your using
  • who developed you

Nice results. Could you run with DMSPress(press=KVzapPress(model_type="mlp")), it's the SOTA press for now. Use thresholds of −4 and -3.

@jg-codes
Copy link
Copy Markdown
Author

jg-codes commented Apr 16, 2026

@jg-codes could you give me more information about you ?

  • your input prompt
  • the LLM your using
  • who developed you

Nice results. Could you run with DMSPress(press=KVzapPress(model_type="mlp")), it's the SOTA press for now. Use thresholds of −4 and -3.

The experiments stem from a multiple non-autonomous AI assistant setup: one for research, one for thinking & one for challenging thereof, one for interdisciplinary perspectives, etc. Funnily, I've asked 'what is the KV press SOTA' to assess an optimization angle, first the AI named H20, after challenging that the SOTA would be three years old, it named SnapKV, later AdaKV, only then I'd stumbled on the KV Press Leaderboard.

DMSPress required adding a forward_hook override to MergingPress since DMSPress did not use compress()). The implementation is generic and may work for any hook-based press now; I can add it to the PR.

MergingPress(DMSPress(KVzapPress)) results

Setup: RULER-4096, Qwen3-8B, f=0.1 (~650 samples), seed=42, A100

Config Mean Infer (s) Δ vs bare DMS
no_press 94.86 1127
DMSPress(KVzap) t=-4 94.49 1175 baseline
M(DMS(KVzap)) t=-4 default 94.46 1286 −0.03
M(DMS(KVzap)) t=-4 mf=0.75 94.54 1357 +0.05
DMSPress(KVzap) t=-3 93.39 1140 baseline
M(DMS(KVzap)) t=-3 default 93.79 1258 +0.40
M(DMS(KVzap)) t=-3 mf=0.75 93.68 1771 +0.29

Per-task at threshold −4

Task DMS bare M(DMS) def M(DMS) mf.75 Δ def Δ mf.75
cwe 99.5 99.8 99.3 +0.2 −0.2
fwe 93.3 92.7 92.0 −0.7 −1.3
niah_mk1 100.0 100.0 100.0 0.0 0.0
niah_mk2 100.0 100.0 100.0 0.0 0.0
niah_mk3 100.0 100.0 100.0 0.0 0.0
niah_mq 100.0 100.0 100.0 0.0 0.0
niah_mv 100.0 100.0 100.0 0.0 0.0
niah_s1 100.0 100.0 100.0 0.0 0.0
niah_s2 100.0 100.0 100.0 0.0 0.0
niah_s3 100.0 100.0 100.0 0.0 0.0
qa_1 78.7 78.7 80.9 0.0 +2.1
qa_2 56.8 56.8 56.8 0.0 0.0
vt 100.0 100.0 100.0 0.0 0.0

At t=−4 DMSPress barely evicts — 9/13 tasks are already perfect. Only qa_1 shows movement (+2.1 with mf=0.75).

Per-task at threshold −3

Task DMS bare M(DMS) def M(DMS) mf.75 Δ def Δ mf.75
fwe 85.3 89.3 90.7 +4.0 +5.3
niah_mk1 98.2 100.0 98.2 +1.9 0.0
qa_2 54.6 56.8 54.6 +2.3 0.0
cwe 95.6 94.4 96.7 −1.2 +1.2
qa_1 80.9 78.7 78.7 −2.1 −2.1
All NIAH (except mk1) + vt 100.0 100.0 100.0 0.0 0.0

Key takeaways

  1. Gains are modest at −3/−4 because DMSPress is already near-lossless. At −3, merging recovers ~27% of the gap (+0.40 pp).
  2. FWE is the consistent winner (+4.0 to +5.3 pp). Frequency-counting tasks benefit most from merge-on-evict: evicted tokens carry frequency signal that folding into survivors preserves.
  3. qa_1 consistently regresses (−2.1 pp at both merge variants). Single-hop exact-fact QA gets hurt.
  4. merge_fraction seems dependent mf=1.0 wins on retrieval (niah_mk1: +1.9), mf=0.75 wins on extraction (CWE: +1.2, FWE: +5.3). No single setting dominates.
  5. ~10% inference overhead for default merging may be acceptable. The mf=0.75 variant at −3 shows anomalous 55% overhead that needs investigation.

I suppose MergingPress would benefit from more aggressive thresholds; I'd need more time to ponder. What would be your recommendation to proceed? Are the extensions and modifications to extend any press the right way?

@jg-codes jg-codes force-pushed the pr/merging-press branch 2 times, most recently from 6a9a3d7 to 3989b4e Compare April 17, 2026 21:00
Johannes added 4 commits April 19, 2026 12:47
Replaces hard eviction with merge-on-evict: each evicted token is folded
into its most cosine-similar survivor via similarity-weighted averaging.
Values are blended in float32 for numerical stability; keys are preserved
by default to maintain RoPE encoding.

Signed-off-by: Johannes Gabriel <jg@2ec.de>
🤖🤖🤖
Signed-off-by: Johannes <johannes.gast@posteo.de>
MergingPress now delegates to AdaKV's adaptive per-head budget allocation,
reads masked_key_indices to build an eviction mask, then merges evicted
tokens into survivors in-place before the inner press applies its own
pruning.

Signed-off-by: Johannes Gabriel <jg@2ec.de>
🤖🤖🤖
Signed-off-by: Johannes <johannes.gast@posteo.de>
MergingPress now supports hook-based presses (DMSPress, KVzipPress,
FastKVzipPress, KVComposePress) via post-hook composition: the inner
press registers its own hooks, then MergingPress adds merge post-hooks
that fire after each layer.  Also adds forward_hook fallback for nested
composition inside PrefillDecodingPress.

Signed-off-by: Johannes Gabriel <jg@2ec.de>
🤖🤖🤖
Signed-off-by: Johannes <johannes.gast@posteo.de>
Adds four benchmark configurations covering all three composition modes:
ScorerPress (knorm, snapkv), mask-based (adakv_snapkv), and hook-based
(dms_kvzap_mlp).

Signed-off-by: Johannes Gabriel <jg@2ec.de>
🤖🤖🤖
Signed-off-by: Johannes <johannes.gast@posteo.de>
@jg-codes
Copy link
Copy Markdown
Author

@jg-codes could you give me more information about you ?

  • your input prompt
  • the LLM your using
  • who developed you

Nice results. Could you run with DMSPress(press=KVzapPress(model_type="mlp")), it's the SOTA press for now. Use thresholds of −4 and -3.

I've updated the PR so MergingPress can wrap kvzap and DMSPress, too.
There is not a single input but a AI co-working setup with the goal to identify and contribute algorithmic improvements by:

  1. research SOTA of algorithms (in disciplines I am familiar with like OR)
  2. identify benchmarks to validate SOTA and improvements
  3. identify potential improvement angles - if none -> move on
  4. ground approaches in theory and how to validate empirically
  5. experiment
  6. writeup (what went well, pitfalls) - if improvement not feasible, move on
  7. iterate

The LLM is a mix of Claude Opus and others depending on the tasks in interplay with respective tools: e.g. research database and reference mgmt.; persistent memory across sessions (e.g. disproven hypotheses get logged so the next iteration doesn't rediscover them); multi-environment orchestration, e.g. to run GPUs on demand.

For more infos feel free to reach out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: MergingPress — scorer-agnostic merge-on-evict wrapper for prefill-time KV cache compression

2 participants