Skip to content

Blog post on KV cache quantization#2045

Merged
gante merged 48 commits intohuggingface:mainfrom
zucchini-nlp:Raushan/CacheQuantization
May 23, 2024
Merged

Blog post on KV cache quantization#2045
gante merged 48 commits intohuggingface:mainfrom
zucchini-nlp:Raushan/CacheQuantization

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

Add the blog post for kv cache quantization. Mostly the finalized version, but I could not find a way to group figures in a row to save space. Let me know if you have any idea on how it's done :)

cc @younesbelkada @SunMarc @gante

Copy link
Copy Markdown
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @zucchini-nlp ! Looking forward to 🚢 this in the HF ecosystem ! 🚀 Left few comments

Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
zucchini-nlp and others added 4 commits May 7, 2024 19:27
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice blogpost @zucchini-nlp ! 🔥I left a few comments

Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
zucchini-nlp and others added 7 commits May 7, 2024 20:48
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool blog post!

I've added some tiny nits, but also approved as it could be merged as is (except for the missing thumbnail)

Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
zucchini-nlp and others added 7 commits May 8, 2024 19:48
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
zucchini-nlp and others added 2 commits May 13, 2024 14:58
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new post, right? In that case, we need to add an entry to _blog.yml before this can be merged.

For example: https://github.com/huggingface/blog/blob/main/_blog.yml#L3999-L4007

@zucchini-nlp
Copy link
Copy Markdown
Member Author

@pcuenca thanks, added it to the yaml file

Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, great work! 🔥

Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md

Although it is worth noting that processing input prompt tokens (aka pre-fill stage), unlike subsequent generated tokens, still require computing the entire key-value matrices in one go for the whole input that may be another memory bottleneck for long contexts. Respectively, the latency associated with generating the first token tends to be higher compared to subsequent tokens. There are other different strategies to decrease memory burden for the pre-fill stage by optimizing the attention computation stage, such like [Local Windowed Attention](https://arxiv.org/abs/2004.05150) or [Flash-Attention](https://arxiv.org/abs/2307.08691). If you are out of memory for the pre-fill stage, you can use `FlashAttention` in 🤗 Transformers along with the kv cache quantization to decrease memory usage even more for long input prompts. See [docs](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) for more information on that.

If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.
If you are interested to know how many tokens we can fit in the context when we push memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.

Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread kv_cache_quantization.md Outdated
Comment thread _blog.yml
Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock, please make sure the name in the yaml matches the filename :)

zucchini-nlp and others added 15 commits May 16, 2024 13:45
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Copy link
Copy Markdown
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 (ready to merge on my end)

@zucchini-nlp
Copy link
Copy Markdown
Member Author

I don't have merge rights on this repo, who can merge?

@gante gante merged commit e6056ef into huggingface:main May 23, 2024
@gante
Copy link
Copy Markdown
Contributor

gante commented May 23, 2024

@zucchini-nlp merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants