Conversation
|
Below are additional results for x3 and x4 compression. I'm wondering if I should add |
In my opinion, it’s worth adding.😀 This operation provides clear benefits while introducing little overhead in real-world deployment. My co-author, Junlin Lv, and I have developed a Triton kernel that optimizes this computation through kernel fusion, reducing memory usage significantly while maintaining high computational efficiency. We plan to open-source this kernel soon. Even with a naive implementation, I believe its overhead in inference remains negligible. Moreover, to my best knowledge, this is the first attempt to leverage pre-trained model parameters to identify critical KV cache entries, making it a promising new research direction. I believe this direction is worth exploring further, as incorporating additional pre-trained parameter information could drive meaningful advancements in the future. |
|
I will merge it as is and we'll investigate later ! Using ||WoV|| instead of ||V|| makes a lot of sense, but the main contribution of this PR is the addition of epsilon which is far more important. I believe it has still to be investigated why this epsilon works so well. |
Signed-off-by: Max Jeblick <maximilianjeblick@gmail.com>

Inspired by experiments on CriticalKVPress (#46) I noticed the most important parameter is epsilon. This parameter appears to be a key for big performances boost. In this PR I propose a very simple to the ExpectedAttentionPress to include this epsilon. I get even better perfs using ||WoV|| instead of ||V|| (see branch simon/update-vnorm).