Fix quantized cache by maxjeblick · Pull Request #30 · NVIDIA/kvpress

maxjeblick · 2024-12-10T17:10:55Z

This PR fixes the failing quantized cache for answers > 1.

Quantized cache also uses a key/value cache (of dummy 0s) that needs to be pruned after answer generation. This caused the original issue.
Quantized cache uses _seen_tokens to determine the cache sequence length, adjust this value to be in sync with the actual kv cache.
Added several tests (including integration tests for ruler)

Fixes #27

# Conflicts: # kvpress/pipeline.py # kvpress/presses/base_press.py # tests/presses/test_presses.py # tests/test_per_layer_compression_wrapper.py # tests/test_pipeline.py

# Conflicts: # kvpress/__init__.py # kvpress/presses/scorer_press.py

maxjeblick added 28 commits December 9, 2024 12:00

fix generate_answer for quantized cache

cfa4e58

fix value chache pruning

190b3c4

improve test

4255ba1

improve test

d766ce9

add integration tests

9c30df1

Merge branch 'main' into max/fix_quanto_cache

4df78ef

# Conflicts: # kvpress/pipeline.py # kvpress/presses/base_press.py # tests/presses/test_presses.py # tests/test_per_layer_compression_wrapper.py # tests/test_pipeline.py

get correct context length

f753eae

fix qunatized key cache

7649ab9

fix qunatized key cache

11a4c45

fix test

cfec9b8

fix test

1bfe667

fix test

d46cc55

add more asserts

524fe47

fix test

c07cc4d

fix test

6ff4bc9

fix test

960e05e

Merge branch 'main' into max/fix_quanto_cache

5c5e1b9

# Conflicts: # kvpress/__init__.py # kvpress/presses/scorer_press.py

fix merge conflicts

58bc8ae

fix failing tests

16e8671

import flash attn skip

d9887ea

fix test

f5d9d59

add integration tests

c620cb0

add integration tests

ca4f0ba

add integration tests

702888a

add fixture

4369c02

easen up test

22c9549

undo vvariable extraction

4b66477

undo newlines

8006a75

maxjeblick requested a review from SimJeg December 10, 2024 17:10

maxjeblick mentioned this pull request Dec 11, 2024

Add KeyRerotationPress #31

Merged

Merge branch 'main' into max/fix_quanto_cache

8eb3b3d

SimJeg reviewed Dec 11, 2024

View reviewed changes

Comment thread tests/integration/test_ruler.py Outdated

Comment thread tests/integration/test_ruler.py

maxjeblick added 3 commits December 11, 2024 15:35

address pr feedback

519222b

fix broken test

f75cb61

fix broken test

261bcea

SimJeg approved these changes Dec 11, 2024

View reviewed changes

maxjeblick merged commit 7503f0d into main Dec 11, 2024

maxjeblick deleted the max/fix_quanto_cache branch December 11, 2024 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix quantized cache#30

Fix quantized cache#30
maxjeblick merged 32 commits intomainfrom
max/fix_quanto_cache

maxjeblick commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxjeblick commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants