New cache tests and modular Hybrid Cache by manueldeprada · Pull Request #37972 · huggingface/transformers

manueldeprada · 2025-05-06T09:59:39Z

What does this PR do?

Refactor out the cache update logic for static and sliding window attention mechanisms.
Extracts the core update logic into two new utility functions: _static_cache_update_logic and _sliding_cache_update_logic. This way, there is only one implementation for StaticCache, SlidingWindowCache, and HybridCache.
@ArthurZucker @gante this is a first step towards per-layer modular cache definitions.
Added new synthetic tests for caches. Fixes Wrong KV cache update for sliding-window attention (SWA) layers when total sequence length reaches window size #37574 and should catch similar bugs.

This is preliminary work for #38077

HuggingFaceDocBuilderDev · 2025-05-06T10:22:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

In general looks good to me, especially the added tests. 👍

…-fix2

ArthurZucker

Thanks for working on this!

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

gante

I like the hardcoded tests better (easier to follow) 👍 added a few comments to align code style with other tests

…-fix2

manueldeprada · 2025-05-09T13:07:26Z

All suggestions applied and all the tests moved into clear "hardcoded" ones!! thanks a lot for the feedback :)

…-fix2

manueldeprada · 2025-05-10T09:53:52Z

Thanks @Cyrilvallez for quickly merging the fix into main(#38046)! In hindsight, I could’ve split the PR into the fix and the refactor + tests.

That said, I think this version is better long-term: having the sliding logic in one place with clear names and comments:

transformers/src/transformers/cache_utils.py

Lines 97 to 105 in 8548b8f

    
           current_seq_len = cache_position[-1] + 1  # Use last position to determine current length 
        
           to_shift = current_seq_len > max_cache_len 
        
           indices = (slicing + to_shift.sum()) % max_cache_len 
        
           k_out_shifted = k_cache[:, :, indices] 
        
           v_out_shifted = v_cache[:, :, indices] 
        
           # Clamp cache_position to determine the *target index* within the shifted cache view 
        
           update_position = cache_position.clamp(min=0, max=max_cache_len - 1)

Cyrilvallez

Hey! Thanks for working on this! Just a few thoughts/performance tips! 🤗

Cyrilvallez · 2025-05-12T21:58:32Z

+        key_states = key_states.to(self.key_cache[layer_idx].dtype)
+        value_states = value_states.to(self.value_cache[layer_idx].dtype)


The dtype should already be correct here no?

I am happy to remove it, passes tests. There are similar checks or casts that could be removed too, but I kept them in case existing code relies on them.

I think we can remove the cast, yes. Looking at the original PR, the source of the lines was to handle the case where we don't cast RoPE-based KVs in the model forward pass (RoPE is by default FP32, regardless of the model dtype)

thanks for the pointer @gante, I should have traced that down. We can't remove it: the PR's sample code fails after removing the casts. I am restoring the cast and adding a test...

I agree though with @Cyrilvallez that it is a bad solution to cast everything instead of doing something specific for RoPE. Since this PR has limited scope and is ground work for #38077, I will try to solve it more elegantly there.

[perhaps for a subsequent PR, to avoid bloating/delaying this one:]

It would be more transparent and precise if casting is done in the model architecture, rather than in the cache. In the specific case of GPT-J loaded in FP16, it seems like without cache KV is kept in FP32, and with cache KV is casted to FP16 in the cache class -> cache introduces performance degradation.

As such, I think it would be positive to remove the cast, and delegate control to the model architectures.

Good, I will check if its only a gptJ thing in a subsequent PR. In the meantime, I added the test. It unveiled 3 fixes like this that were applied to StaticCache but not to Hybrid and Offloaded. Please have a quick look at 772b0a0 before I merge.

Especially if it's only there for a given old model -> much better to fix the model rather than general cache logic!

gante

Thank you for iterating 🤗

This reverts commit 901c2a4.

manueldeprada and others added 3 commits May 6, 2025 11:29

squash rebase

acb901e

ruff

4eacd7d

Merge branch 'main' into cache-fix2

05d2ce6

manueldeprada and others added 3 commits May 6, 2025 12:56

ruff

6b765bd

fix hybrid cache in torch compile

32cd5f6

Merge branch 'main' into cache-fix2

4ddd8d6

gante reviewed May 6, 2025

View reviewed changes

manueldeprada and others added 3 commits May 6, 2025 16:36

Merge branch 'main' of github.com:huggingface/transformers into cache…

9bfdcbc

…-fix2

joaos suggestions

ec26e69

Merge branch 'main' into cache-fix2

9858f2c

manueldeprada requested a review from gante May 6, 2025 17:19

manueldeprada added 3 commits May 6, 2025 19:20

ruff

95805f3

Trigger Build

f08ea20

ruff

b3b0133

manueldeprada marked this pull request as ready for review May 7, 2025 07:50

Merge branch 'main' into cache-fix2

016d9db

ArthurZucker reviewed May 8, 2025

View reviewed changes

Comment thread src/transformers/cache_utils.py Outdated

Comment thread src/transformers/cache_utils.py

Comment thread tests/utils/test_cache_utils.py

Update src/transformers/cache_utils.py

3de7505

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

gante reviewed May 8, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

Comment thread tests/utils/test_cache_utils.py Outdated

Comment thread tests/utils/test_cache_utils.py Outdated

Comment thread tests/utils/test_cache_utils.py Outdated

manueldeprada added 3 commits May 9, 2025 15:04

suggestions

214e517

Merge branch 'main' of github.com:huggingface/transformers into cache…

468d887

…-fix2

ruff

36e07a2

manueldeprada requested a review from gante May 9, 2025 13:10

vasqu mentioned this pull request May 9, 2025

[tests] Test all cache implementations #37873

Merged

Cyrilvallez mentioned this pull request May 9, 2025

Wrong KV cache update for sliding-window attention (SWA) layers when total sequence length reaches window size #37574

Closed

manueldeprada added 2 commits May 10, 2025 11:36

revert naming change

deacc67

Merge branch 'main' of github.com:huggingface/transformers into cache…

8548b8f

…-fix2

Merge remote-tracking branch 'upstream/main' into cache-fix2

326d2b2

manueldeprada mentioned this pull request May 12, 2025

Cache System Refactor: Layered Architecture #38077

Closed

33 tasks

manueldeprada requested a review from ArthurZucker May 12, 2025 08:24

Cyrilvallez reviewed May 12, 2025

View reviewed changes

gante approved these changes May 13, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

manueldeprada added 6 commits May 13, 2025 15:23

added new test and fixes for gptj

772b0a0

ruff

e67049f

reinit instead of resetting stateful caches

df95e1f

Merge remote-tracking branch 'upstream/main' into cache-fix2

86d3f21

ruff

2ce64c5

optimize short seqs

901c2a4

gante reviewed May 13, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

gante reviewed May 13, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

gante reviewed May 13, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

gante reviewed May 13, 2025

View reviewed changes

Comment thread tests/utils/test_cache_utils.py Outdated

manueldeprada and others added 4 commits May 14, 2025 15:56

Revert "optimize short seqs"

8a9b0e2

This reverts commit 901c2a4.

apply suggestions

bd0a245

Merge branch 'main' into cache-fix2

c16653b

Merge branch 'main' into cache-fix2

d719ef4

manueldeprada merged commit d34e21e into huggingface:main May 20, 2025
20 checks passed

faaany pushed a commit to faaany/transformers that referenced this pull request May 21, 2025

New cache tests and refactored Hybrid Cache (huggingface#37972)

6ed77b8

xvyv99 pushed a commit to xvyv99/transformers that referenced this pull request May 21, 2025

New cache tests and refactored Hybrid Cache (huggingface#37972)

19475be

Cyrilvallez mentioned this pull request May 22, 2025

Fix HybridChunedCache & Llama4 #38299

Merged

		key_states = key_states.to(self.key_cache[layer_idx].dtype)
		value_states = value_states.to(self.value_cache[layer_idx].dtype)

Conversation

manueldeprada commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manueldeprada commented May 9, 2025

Uh oh!

manueldeprada commented May 10, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez May 12, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada May 13, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 13, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada May 13, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 13, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

manueldeprada commented May 6, 2025 •

edited

Loading

manueldeprada May 13, 2025 •

edited

Loading