[tests] Test all cache implementations by gante · Pull Request #37873 · huggingface/transformers

gante · 2025-04-29T18:18:41Z

What does this PR do?

The main purpose of this PR is to convert a few slow tests targeted at one cache implementation into fast tests that run on ALL cache implementations.

Secondarily, makes RUN_SLOW=1 py.test tests/utils/test_cache_utils.py green 🟢 These tests also become much, much faster (3 mins -> 1 min, on my machine), despite covering a larger number of features.

This is a follow up to #37684, which paved the way for this PR. After this PR is merged, I can go back to #37394 and properly test things!

👉 torch.compile was benchmarked with gemma2/hybrid and qwen3/static, no speed regressions.
👉 no regressions in RUN_SLOW=1 py.test tests/models/llama/test_modeling_llama.py

github-actions · 2025-04-29T18:18:54Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

gante · 2025-04-29T18:20:21Z


 class SinkCache(Cache):
    """
+    Deprecated.


SinkCache has been broken on some edge cases for over a year, the issues are non-trivial to fix, and it is no longer relevant -- we can achieve a similar effect with a few other flags. See deprecation warning below.

gante · 2025-04-29T18:26:06Z

        slicing = torch.ones(self.max_cache_len, dtype=torch.long, device=value_states.device).cumsum(0)
        cache_position = cache_position.clamp(0, self.max_cache_len - 1)
-        to_shift = cache_position >= self.max_cache_len - 1
+        to_shift = cache_position > self.max_cache_len - 1


Off by one: we were applying the shifting update one token too early. This applies on the last token when we initialize the sliding window cache with the exact size of the generation (e.g. with model.generate(..., cache_implementation="sliding_window")).

This effectively means our models were micro-underperforming with sliding window caches, more specifically on the last generated token :D One of the new tests caught this issue.

On first glance, this likely fixes the issue(s) raised in #37574 👀

This change is wrong in general and leads to garbage generation on sequence > sliding window! I am opening a PR to revert with examples 😉 What you observed is the fact that prefill and later stages should be treated separately in terms of the states they return

@Cyrilvallez You should give #37972 a look before :D

gante · 2025-04-29T18:29:17Z

                "config and it's not set to None."
            )
        self.max_cache_len = max_cache_len
+        self._sliding_window_max_len = min(config.sliding_window, max_cache_len)


HybridCache had the right pattern, but some of the other hybrid caches did not: generation was crashing if we tried to generate a max length < sliding window length. Caught by one of the new tests.

HuggingFaceDocBuilderDev · 2025-04-29T18:31:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2025-04-29T18:31:29Z

-@require_torch_accelerator
 class CacheIntegrationTest(unittest.TestCase):
-    """Cache tests that require loading models"""
+    """Fast cache integration tests that share the same small model"""


Separated into two classes, to make best use of setUpClass. Loading the model is the most costly part of these tests, and we only do it once.

gante · 2025-04-29T18:32:21Z


-        # DynamicCache and the legacy cache format should be equivalent
-        set_seed(0)
-        gen_out_legacy = model.generate(**inputs, do_sample=True, max_new_tokens=256)


the default is now DynamicCache(), the two generate calls in this test were the same

gante · 2025-04-29T18:33:11Z

        self.assertEqual(decoded[0], expected_text)

-    @slow
-    def test_dynamic_cache_batched(self):


adapted into CacheIntegrationTest

gante · 2025-04-29T18:33:18Z

-        self.assertListEqual(decoded, expected_text)
-
-    @slow
-    def test_dynamic_cache_beam_search(self):


adapted into CacheIntegrationTest

gante · 2025-04-29T18:34:04Z

-        self.assertListEqual(decoded, expected_text)
-
-    @slow
-    def test_hybrid_cache_n_sequences(self):


redundant with the tests in CacheIntegrationTest (more specifically, test_cache_batched and test_cache_beam_search)

gante · 2025-04-29T18:34:26Z

-    @require_non_xpu
-    @require_gptq
-    @slow
-    def test_sink_cache_hard(self):


test was broken and SinkCache is being deprecated

gante · 2025-04-29T18:34:33Z

-        self.assertTrue(decoded[0].endswith("to perform a variety of tasks. The Transformer is a neural network"))
-
-    @slow
-    def test_sink_cache_iterative_prompts(self):


test was broken and SinkCache is being deprecated

gante · 2025-04-29T18:35:00Z

            self.assertListEqual(decoded, EXPECTED_GENERATION)

-    @slow
-    def test_dynamic_cache_extra_left_padding(self):


adapted into CacheIntegrationTest

gante · 2025-04-29T18:35:07Z

-        self.assertListEqual(decoded, EXPECTED_GENERATION)
-
-    @slow
-    def test_static_cache_extra_left_padding(self):


adapted into CacheIntegrationTest

gante · 2025-04-29T18:35:26Z

-
-    @require_torch_accelerator
-    @slow
-    def test_offloaded_cache_equivalent_to_dynamic_cache(self):


we implicitly test this in CacheIntegrationTest

gante · 2025-04-29T18:36:29Z

            responses.append(response)

        EXPECTED_DECODED_TEXT = [
-            "You are a helpful assistant. Help me to write a blogpost about travelling.\n\nTraveling is an enriching experience that broadens our horizons and exposes us to new cultures, landscapes, and people. Whether it's a week",


if we checkout to the commit that added this test, we get a different output 👀 possibly due to different hardware/software? (anyway, I don't think it's worth to pin the exact cause)

gante · 2025-04-29T18:37:02Z

        # on `main`, prior to #36543, this would send stderr messages about cuda graphs being skipped.
        with CaptureStderr() as cap:
            model.generate(**inputs, max_new_tokens=2, cache_implementation="static")
-        self.assertEqual(cap.err, "")


failing on main if we have kernels installed, this change makes the test green regardless of the installed packages

gante · 2025-04-29T18:38:53Z

+            self.skipTest("Quanto is not available")
+
+        if cache_implementation == "offloaded_hybrid_chunked":
+            # TODO (joao, cyril): something is off with `offloaded_hybrid_chunked` aka `OffloadedHybridCache`: the


I don't think offloaded_hybrid_chunked + beam_search is worth the dive for now 🤔

nope agree with you!

gante · 2025-04-30T09:35:10Z


 from ...activations import ACT2FN
-from ...cache_utils import Cache, DynamicCache, StaticCache
+from ...cache_utils import Cache, DynamicCache


(same diff on all models)

ArthurZucker

Very nice! Thanks 🤗
Would be nice to have a fast test for the HybridChunked to make sure compile is fine using a dummy gemma2 model maybe?

TP is also an option to test 👀 but more of a TODO later!

ArthurZucker · 2025-04-30T13:55:29Z


 from ...activations import ACT2FN
-from ...cache_utils import Cache, DynamicCache, StaticCache
+from ...cache_utils import Cache, DynamicCache


ArthurZucker · 2025-04-30T14:05:58Z

+            self.skipTest("Quanto is not available")
+
+        if cache_implementation == "offloaded_hybrid_chunked":
+            # TODO (joao, cyril): something is off with `offloaded_hybrid_chunked` aka `OffloadedHybridCache`: the


nope agree with you!

gante · 2025-04-30T14:33:54Z

@ArthurZucker yeah, generalist cache + compile tests will be up next! :D

gante added 3 commits April 29, 2025 17:49

all but one test sorted

bbe1550

typo

7131cc2

final corrections

725aa95

github-actions Bot marked this pull request as draft April 29, 2025 18:18

gante commented Apr 29, 2025

View reviewed changes

gante marked this pull request as ready for review April 29, 2025 18:21

gante commented Apr 29, 2025

View reviewed changes

Comment thread src/transformers/generation/utils.py Outdated

gante commented Apr 29, 2025

View reviewed changes

Comment thread src/transformers/models/aria/modeling_aria.py Outdated

gante commented Apr 29, 2025

View reviewed changes

gante added 2 commits April 29, 2025 18:39

make fixup

d474b39

fix cache without cache

b434ff6

gante requested review from ArthurZucker and zucchini-nlp April 29, 2025 19:08

gante mentioned this pull request Apr 30, 2025

Wrong KV cache update for sliding-window attention (SWA) layers when total sequence length reaches window size #37574

Closed

gante commented Apr 30, 2025

View reviewed changes

gante and others added 3 commits April 30, 2025 10:35

Merge branch 'main' into test_all_caches

49c8786

skip offloaded on cpu

2fd8a6a

check if there is an attn mask before creating 4d

eef4465

ArthurZucker approved these changes Apr 30, 2025

View reviewed changes

gante merged commit 1b22290 into huggingface:main Apr 30, 2025
20 checks passed

gante deleted the test_all_caches branch April 30, 2025 14:37

This was referenced Apr 30, 2025

fix DbrxModelTest::test_offloaded_cache_implementation_0_offloaded #37853

Closed

New cache tests and modular Hybrid Cache #37972

Merged

Cyrilvallez mentioned this pull request May 9, 2025

Fix cache update! #38046

Merged

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025

[tests] Test all cache implementations (huggingface#37873)

4b0a9b3

manueldeprada mentioned this pull request Jul 8, 2025

Fix slow test_moshika_greedy_unconditional_fp16 #39251

Open

Conversation

gante commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions Bot commented Apr 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

gante Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Apr 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

gante commented Apr 29, 2025 •

edited

Loading

gante Apr 29, 2025 •

edited

Loading

gante Apr 29, 2025 •

edited

Loading

gante Apr 29, 2025 •

edited

Loading

gante Apr 29, 2025 •

edited

Loading