Blog post on KV cache quantization by zucchini-nlp · Pull Request #2045 · huggingface/blog

zucchini-nlp · 2024-05-07T12:14:12Z

Add the blog post for kv cache quantization. Mostly the finalized version, but I could not find a way to group figures in a row to save space. Let me know if you have any idea on how it's done :)

cc @younesbelkada @SunMarc @gante

younesbelkada

Great work @zucchini-nlp ! Looking forward to 🚢 this in the HF ecosystem ! 🚀 Left few comments

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

SunMarc

Very nice blogpost @zucchini-nlp ! 🔥I left a few comments

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

gante

Very cool blog post!

I've added some tiny nits, but also approved as it could be merged as is (except for the missing thumbnail)

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

pcuenca

This is a new post, right? In that case, we need to add an entry to _blog.yml before this can be merged.

For example: https://github.com/huggingface/blog/blob/main/_blog.yml#L3999-L4007

zucchini-nlp · 2024-05-16T08:26:54Z

@pcuenca thanks, added it to the yaml file

pcuenca

Very cool, great work! 🔥

pcuenca · 2024-05-16T08:36:53Z

+
+Although it is worth noting that processing input prompt tokens (aka pre-fill stage), unlike subsequent generated tokens, still require computing the entire key-value matrices in one go for the whole input that may be another memory bottleneck for long contexts. Respectively, the latency associated with generating the first token tends to be higher compared to subsequent tokens. There are other different strategies to decrease memory burden for the pre-fill stage by optimizing the attention computation stage, such like [Local Windowed Attention](https://arxiv.org/abs/2004.05150) or [Flash-Attention](https://arxiv.org/abs/2307.08691). If you are out of memory for the pre-fill stage, you can use `FlashAttention` in 🤗 Transformers along with the kv cache quantization to decrease memory usage even more for long input prompts. See [docs](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) for more information on that.
+
+If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.


Suggested change

If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.

If you are interested to know how many tokens we can fit in the context when we push memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.

pcuenca

Approving to unblock, please make sure the name in the yaml matches the filename :)

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

gante

🔥 (ready to merge on my end)

zucchini-nlp · 2024-05-23T12:26:00Z

I don't have merge rights on this repo, who can merge?

gante · 2024-05-23T12:31:22Z

@zucchini-nlp merged

zucchini-nlp added 9 commits May 5, 2024 18:48

add small draft

9617231

add

e1b9312

fix img link?

d087dc7

now it should show all imgs correctly

6557ac6

last fix

253d218

push

e6699c0

update

9b799d6

can we group figures this way?

634d6d9

maybe further links to read

fd002d1

younesbelkada reviewed May 7, 2024

View reviewed changes

Comment thread kv_cache_quantization.md Outdated

Comment thread kv_cache_quantization.md Outdated

Comment thread kv_cache_quantization.md Outdated

Comment thread kv_cache_quantization.md Outdated

Comment thread kv_cache_quantization.md Outdated

Comment thread kv_cache_quantization.md Outdated

zucchini-nlp and others added 4 commits May 7, 2024 19:27

Update kv_cache_quantization.md

5d24fc4

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update kv_cache_quantization.md

38b1173

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update kv_cache_quantization.md

91bfe75

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update kv_cache_quantization.md

754c538

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

SunMarc approved these changes May 7, 2024

View reviewed changes

zucchini-nlp and others added 7 commits May 7, 2024 20:48

Update kv_cache_quantization.md

03e59dc

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

a7b8e68

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

3306cbb

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

446f1ed

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

07d3dd3

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

9ad117e

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update kv_cache_quantization.md

8394a03

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

gante approved these changes May 8, 2024

View reviewed changes

zucchini-nlp and others added 7 commits May 8, 2024 19:48

Update kv_cache_quantization.md

858e4d3

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Update kv_cache_quantization.md

04533b1

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Update kv_cache_quantization.md

407456e

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Update kv_cache_quantization.md

e363945

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

Update kv_cache_quantization.md

8d378d9

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

update

a913ac4

updates

d4b9979

zucchini-nlp and others added 2 commits May 13, 2024 14:58

Update kv_cache_quantization.md

9023b64

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

update

ec3aa2c

pcuenca requested changes May 16, 2024

View reviewed changes

zucchini-nlp and others added 2 commits May 16, 2024 13:24

Merge branch 'huggingface:main' into Raushan/CacheQuantization

757363e

update yml file

3361ea4

pcuenca reviewed May 16, 2024

View reviewed changes

Comment thread _blog.yml

pcuenca approved these changes May 16, 2024

View reviewed changes

zucchini-nlp and others added 15 commits May 16, 2024 13:45

Update kv_cache_quantization.md

ecdf931

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

c657a5f

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

bb2ad96

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

b4fc465

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

35c4374

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

e031875

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

2e85085

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

f6ad82e

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

1f105b2

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

25e551e

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Update kv_cache_quantization.md

4dac0a1

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

rename file

3370c7c

update

7e79d2b

update typos

d790102

update with hqq

076da62

zucchini-nlp mentioned this pull request May 23, 2024

Quantized KV Cache huggingface/transformers#30483

Merged

gante approved these changes May 23, 2024

View reviewed changes

Merge branch 'main' into Raushan/CacheQuantization

5a745a2

gante merged commit e6056ef into huggingface:main May 23, 2024


		Although it is worth noting that processing input prompt tokens (aka pre-fill stage), unlike subsequent generated tokens, still require computing the entire key-value matrices in one go for the whole input that may be another memory bottleneck for long contexts. Respectively, the latency associated with generating the first token tends to be higher compared to subsequent tokens. There are other different strategies to decrease memory burden for the pre-fill stage by optimizing the attention computation stage, such like [Local Windowed Attention](https://arxiv.org/abs/2004.05150) or [Flash-Attention](https://arxiv.org/abs/2307.08691). If you are out of memory for the pre-fill stage, you can use `FlashAttention` in 🤗 Transformers along with the kv cache quantization to decrease memory usage even more for long input prompts. See [docs](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) for more information on that.

		If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.

	If you are interested how many tokens we can fit in the context if we were to push the memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.
	If you are interested to know how many tokens we can fit in the context when we push memory usage to its limits, quantized kv cache can support up to 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the maximum capacity is 40k tokens.

Conversation

zucchini-nlp commented May 7, 2024

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented May 16, 2024

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca May 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented May 23, 2024

Uh oh!

gante commented May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pcuenca left a comment •

edited

Loading