AdaKVPress by SimJeg · Pull Request #38 · NVIDIA/kvpress

SimJeg · 2025-01-09T11:36:25Z

PR description

This PR introduce a new press called AdaKVPress following the great work of @FFY0.

This is the first press achieving head-wise compression. Instead of adding a new kernel as initially proposed, I instead "fake" the compression by replacing the pruned keys by a fake key K such that exp(Q, K) = 0 (i.e. no effect in attention). The computation of the fake keys is done at every decoding step and is achieved through patching the newly introduced ALL_ATTENTION_FUNCTIONS in transformers. The patch is applied only when the attention module has a masked_key_indices attribute which is not None, ensuring compatibility with previous work.

Tests have not been written yet.

New press checklist (if applicable)

I added mypress_press.py in the presses directory
I added MyPress in __init__.py
I updated the README.md with a 1 liner about my new press in the Available presses section
I added my press in the default_presses list in tests/default_presses.py

SimJeg · 2025-01-09T13:48:48Z

On RULER with 25% compression ratio and llama 3.1 8b instruct:

SnapKV: 81.8% --> 88.1% with AdaKV
ExpectedAttention: 88.5% --> 93.6% with AdaKV

cc @FFY0 for the results

FFY0 · 2025-01-09T15:07:38Z

On RULER with 25% compression ratio and llama 3.1 8b instruct:

SnapKV: 81.8% --> 88.1% with AdaKV

ExpectedAttention: 88.5% --> 93.6% with AdaKV

cc @FFY0 for the results

@SimJeg These results seem to align well with my previous implementation. In my earlier evaluation, with a 30% compression ratio and llama 3.1 8b instruct, SnapKV's average score was 79.9, which increased to 86.9 when combined with AdaKV.

SimJeg · 2025-01-09T16:46:00Z

More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw

FFY0 · 2025-01-10T03:40:12Z

More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw

Hi @SimJeg, the results look great and align well with my implementation:

At a 0.1 compression ratio, SnapKV improves from 87.7 to 92.9 with Ada-SnapKV.
At a 0.5 compression ratio, SnapKV improves from 69.8 to 77.7 with Ada-SnapKV.

Additionally, thank you for the results of Ada-expected-attention; it looks really promising! I believe this will significantly aid future research on head-specific compression.

SimJeg · 2025-01-13T08:55:38Z

@FFY0 I'm launching additional benchmarks using alpha_safeguard=0 for both SnapKV and Expected Attention. One question: have you tried to apply the AdaKV logic to the level of the model itself ? i.e. instead of taking the top-k scores per layer you take the top-k score across all layers ? Could be nice to try at some point

FFY0 · 2025-01-13T12:20:09Z

@FFY0 I'm launching additional benchmarks using alpha_safeguard=0 for both SnapKV and Expected Attention. One question: have you tried to apply the AdaKV logic to the level of the model itself ? i.e. instead of taking the top-k scores per layer you take the top-k score across all layers ? Could be nice to try at some point

@SimJeg Yes, your points are absolutely right, and these directions could indeed further enhance the performance of AdaKV. I’d be happy to share my thoughts as well:

On the Setting of Alpha

Our experiments on LongBench indicate that smaller alpha values perform better under smaller budgets, while larger values are more effective for larger budgets. In AdaKV, we deliberately avoided fine-tuning this parameter and instead used a fixed value across all experiments to demonstrate its robustness. However, if further optimization is desired, adjusting alpha could yield noticeable performance improvements. Additionally, I observed unstable performance drops on certain datasets when alpha was set to 0. To mitigate this, I recommend using a very small but non-zero alpha value, such as 0.05, to improve performance for smaller budgets while maintaining stability.

On Head-Wise Budget Allocation Across Layers

In the early stages, we experimented with head-wise budget allocation across layers and observed performance gains. However, we eventually discontinued this approach for the following reasons:

Theoretical Alignment: Intra-layer scheduling aligns more closely with our theoretical framework, providing stronger interpretability and a solid foundation for future research.
Memory Efficiency: Cross-layer scheduling requires retaining the KV cache for all layers until a unified scheduling step is performed, delaying compression. In contrast, intra-layer scheduling allows immediate compression after prefill for each layer, which is critical for methods like SnapKV that aim to minimize peak memory usage during prefill, particularly in question-aware compression scenarios.

That said, in context-only compression scenarios—where the compressed cache can be stored offline and reused for future questions—frequent prefill operations are unnecessary. In such cases, cross-layer scheduling could be a promising optimization direction. While I didn't conduct experiments on this aspect, I believe it is worth exploring, as related works on layer budget allocation have supported the feasibility of this direction.

SimJeg · 2025-01-13T12:53:30Z

Thanks for sharing your thoughts. I will keep using 0.2 as default as you initially proposed. Using alpha=0 I get the same results on RULER for SnapKV and ExpectedAttention (+- 0.001). I agree for head-wise allocation, interesting to know you get performance gains too !

maxjeblick

LGTM, thanks a lot for the neat implementation!

Signed-off-by: Max Jeblick <maximilianjeblick@gmail.com>

SimJeg added 12 commits January 6, 2025 12:44

Handle transformers breaking changes

82cc93e

Add AdaKVPress (first version)

6d5f34e

Add alpha_safeguard

9a46d7a

Move from least squares to perceptron

2693558

Remove GQA

b16ab6a

Fix attention patch

d19edc7

Align with ScorerPress

79156c7

Update evaluate

002ac9d

Fix attention patch

5817935

Some cleaning

31f6b12

Add check

f5cb200

Add first test

64f0e99