Conversation
|
On RULER with 25% compression ratio and llama 3.1 8b instruct:
cc @FFY0 for the results |
@SimJeg These results seem to align well with my previous implementation. In my earlier evaluation, with a 30% compression ratio and llama 3.1 8b instruct, SnapKV's average score was 79.9, which increased to 86.9 when combined with AdaKV. |
|
More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw |
Hi @SimJeg, the results look great and align well with my implementation: At a 0.1 compression ratio, SnapKV improves from 87.7 to 92.9 with Ada-SnapKV. Additionally, thank you for the results of Ada-expected-attention; it looks really promising! I believe this will significantly aid future research on head-specific compression. |
|
@FFY0 I'm launching additional benchmarks using |
@SimJeg Yes, your points are absolutely right, and these directions could indeed further enhance the performance of AdaKV. I’d be happy to share my thoughts as well: On the Setting of AlphaOur experiments on LongBench indicate that smaller alpha values perform better under smaller budgets, while larger values are more effective for larger budgets. In AdaKV, we deliberately avoided fine-tuning this parameter and instead used a fixed value across all experiments to demonstrate its robustness. However, if further optimization is desired, adjusting alpha could yield noticeable performance improvements. Additionally, I observed unstable performance drops on certain datasets when alpha was set to 0. To mitigate this, I recommend using a very small but non-zero alpha value, such as 0.05, to improve performance for smaller budgets while maintaining stability. On Head-Wise Budget Allocation Across LayersIn the early stages, we experimented with head-wise budget allocation across layers and observed performance gains. However, we eventually discontinued this approach for the following reasons:
That said, in context-only compression scenarios—where the compressed cache can be stored offline and reused for future questions—frequent prefill operations are unnecessary. In such cases, cross-layer scheduling could be a promising optimization direction. While I didn't conduct experiments on this aspect, I believe it is worth exploring, as related works on layer budget allocation have supported the feasibility of this direction. |
|
Thanks for sharing your thoughts. I will keep using 0.2 as default as you initially proposed. Using alpha=0 I get the same results on RULER for SnapKV and ExpectedAttention (+- 0.001). I agree for head-wise allocation, interesting to know you get performance gains too ! |
32d6269 to
3ac3df2
Compare
maxjeblick
left a comment
There was a problem hiding this comment.
LGTM, thanks a lot for the neat implementation!
Signed-off-by: Max Jeblick <maximilianjeblick@gmail.com>


PR description
This PR introduce a new press called
AdaKVPressfollowing the great work of @FFY0.This is the first press achieving head-wise compression. Instead of adding a new kernel as initially proposed, I instead "fake" the compression by replacing the pruned keys by a fake key K such that exp(Q, K) = 0 (i.e. no effect in attention). The computation of the fake keys is done at every decoding step and is achieved through patching the newly introduced ALL_ATTENTION_FUNCTIONS in transformers. The patch is applied only when the attention module has a
masked_key_indicesattribute which is not None, ensuring compatibility with previous work.Tests have not been written yet.
New press checklist (if applicable)
mypress_press.pyin thepressesdirectoryMyPressin__init__.pyREADME.mdwith a 1 liner about my new press in the Available presses sectiondefault_presseslist intests/default_presses.py