Blog post on KV cache quantization#2045
Conversation
younesbelkada
left a comment
There was a problem hiding this comment.
Great work @zucchini-nlp ! Looking forward to 🚢 this in the HF ecosystem ! 🚀 Left few comments
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
SunMarc
left a comment
There was a problem hiding this comment.
Very nice blogpost @zucchini-nlp ! 🔥I left a few comments
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
gante
left a comment
There was a problem hiding this comment.
Very cool blog post!
I've added some tiny nits, but also approved as it could be merged as is (except for the missing thumbnail)
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
There was a problem hiding this comment.
This is a new post, right? In that case, we need to add an entry to _blog.yml before this can be merged.
For example: https://github.com/huggingface/blog/blob/main/_blog.yml#L3999-L4007
|
@pcuenca thanks, added it to the yaml file |
|
|
||
| Although it is worth noting that processing input prompt tokens (aka pre-fill stage), unlike subsequent generated tokens, still require computing the entire key-value matrices in one go for the whole input that may be another memory bottleneck for long contexts. Respectively, the latency associated with generating the first token tends to be higher compared to subsequent tokens. There are other different strategies to decrease memory burden for the pre-fill stage by optimizing the attention computation stage, such like [Local Windowed Attention](https://arxiv.org/abs/2004.05150) or [Flash-Attention](https://arxiv.org/abs/2307.08691). If you are out of memory for the pre-fill stage, you can use `FlashAttention` in 🤗 Transformers along with the kv cache quantization to decrease memory usage even more for long input prompts. See [docs](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) for more information on that. | ||
|
|
||
| If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens. |
There was a problem hiding this comment.
| If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens. | |
| If you are interested to know how many tokens we can fit in the context when we push memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens. |
pcuenca
left a comment
There was a problem hiding this comment.
Approving to unblock, please make sure the name in the yaml matches the filename :)
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
gante
left a comment
There was a problem hiding this comment.
🔥 (ready to merge on my end)
|
I don't have merge rights on this repo, who can merge? |
|
@zucchini-nlp merged |
Add the blog post for kv cache quantization. Mostly the finalized version, but I could not find a way to group figures in a row to save space. Let me know if you have any idea on how it's done :)
cc @younesbelkada @SunMarc @gante