🚨 [Cache] Native mamba & hybrid cache by Cyrilvallez · Pull Request #44950 · huggingface/transformers

Cyrilvallez · 2026-03-23T16:25:13Z

What does this PR do?

As per the title. This PR finally makes mamba layer caches first class citizen, and adds native support for them.

It supports the following layers combinations:

all mamba layers
alternating attention layer/mamba layer
layers that are BOTH mamba and attention (zamba models)

For this, it adds the 2 following layer classes:

MambaLayer
MambaAndAttentionLayer (combining both)

By essence, MambaLayer has static shape (i.e. it does not depend of the sequence length). So they were added to both StaticCache and DynamicCache, to blend smoothly to what we already have.
MambaAndAttentionLayer on the other hand has only the mamba part that is static, and the attention part is a dynamic attention layer. It would however be very easy to add the full static equivalent if we want in the future.

Everything integrates smoothly with the existing cache machinery in the case of hybrid attention/mamba archs, i.e. functions such as get_seq_length, get_mask_sizes (used for mask creation notably) will always look at attention layers.

Compile

Except from the obvious benefits from having a standardized API that seamlessly work with our Cache construction, the new MambaLayer is fully compatible with compile, including cudagraphs! This means, any mamba model or alternating mamba/attention model can now be fully compiled with cudagraphs natively!

BC-breaking

The 🚨 marker here is only used 2 classes (MambaCache and FalconMambaCache) were previously public classes. They do no longer exist, so it's breaking in this way. It should not really have been made directly public imo, and I don't expect any direct usage so should be fine!

HuggingFaceDocBuilderDev · 2026-03-23T16:48:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez · 2026-03-25T16:13:12Z

run-slow: mamba2 zamba2 granitemoehybrid falcon_h1 lfm2 lfm2_moe qwen3_5 bamba mamba nemotron_h qwen3_next zamba jamba qwen3_5_moe falcon_mamba

github-actions · 2026-03-25T16:15:35Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/bamba", "models/falcon_h1", "models/falcon_mamba", "models/granitemoehybrid", "models/jamba", "models/lfm2", "models/lfm2_moe", "models/mamba", "models/mamba2", "models/nemotron_h", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_next", "models/zamba", "models/zamba2"]
quantizations: []

github-actions · 2026-03-25T17:12:09Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	8c3ad2af	workflow commit (merge commit)
PR	3df0d85a	branch commit (from PR)
main	2f624917	base commit (on `main`)

Model CI Report

❌ 4 new failed tests from this PR 😭

bamba:
tests/models/bamba/test_modeling_bamba.py::BambaModelIntegrationTest::test_simple_batched_generate_with_padding (❌ ⟹ ❌)
tests/models/bamba/test_modeling_bamba.py::BambaModelIntegrationTest::test_simple_generate (❌ ⟹ ❌)
mamba2:
tests/models/mamba2/test_modeling_mamba2.py::Mamba2IntegrationTest::test_batched_equivalence_with_cache (❌ ⟹ ❌)
tests/models/mamba2/test_modeling_mamba2.py::Mamba2IntegrationTest::test_batched_equivalence_without_cache (❌ ⟹ ❌)

Cyrilvallez · 2026-03-25T17:26:14Z

For reviewers (@ArthurZucker @vasqu), I checked locally and the 4 failed tests above are failing exactly the exact same way on main and this PR. Once again, I don't know why run-slow is flagging them, but they are all ok!
So this PR does not bring any new failure!

vasqu

Functionality wise, I don't really have much to complain. My comments are mostly about avoiding messy names and standards:

Mamba is super popular but it is a variation of linear attention
Not all linear attentions (GDN like qwen) are mamba (e.g. no SSM view)
This will get messy if we force all linear attentions to be named after mamba

Imo, we should be careful here and focus on establishing a good standard here. Let's assume more linear attention flavors to pop up!

Btw, could we have more mixins with e.g. only conv (lfm) and more convs x recurrent state (olmo hybrid). Probably for the future, just a thought

vasqu · 2026-03-26T13:45:55Z

        """
        mamba_mask = attention_mask
-        if (past_key_values is not None and past_key_values.has_previous_state) or (
+        if (past_key_values is not None and past_key_values.has_previous_state()) or (


Other point, but maybe we should move this to our mask API - essentially all linear attns will need this and then we can interact with layer types like SWA --> this will also allow vLLM to exchange the linear attn layers

Seeing you already have some attribute maps :D yea that would go hand in hand again then

Agreed, but I'd rather do it later, as this PR already refactors quite a lot of modeling files - easier to do in a second time

vasqu · 2026-03-26T14:10:31Z

+        return self.ssm_states
+
+
+class MambaAndAttentionLayer(MambaLayer, DynamicLayer):


Definitely possible to make a static version as well imo, but no rush let's get this right first 🫡

Yes, the Static version can basically be a copy/paste of the Dynamic one, but inheriting from StaticLayer. Did not add it yet, as I don't think it's really useful for the time being indeed

ArthurZucker

IMO also a good time to abstract Layer's Keys?
Would help for say FP8Indexer tthat can just use set_default / request for cache_keys={"indexer_kv"}

ArthurZucker · 2026-03-30T14:13:42Z

                    )
                if ssm_state is not None and cache_params is not None:
-                    cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
+                    ssm_state = cache_params.update_ssm_state(ssm_state, self.layer_idx)


Suggested change

ssm_state = cache_params.update_ssm_state(ssm_state, self.layer_idx)

ssm_state = cache_params.update("ssm_states", ssm_state, self.layer_idx)

this is what I had in mind TBH it scales with whatever naming, and howeverr many sub caches you have

imagine quantizing this:
ssm_state_scales = cache_params.update("ssm_state_scales", ssm_state_scales, self.layer_idx)
instead of creating a new class

Issue is that they don't all update the same way... So for now, I believe this is the easiest way to proceed, rather than try a dispatch based on kwarg name (because almost all models pass those as positional arg, not kwarg, not we don't have access to the name...).

I do want to explore a more general way with only update everywhere more easily in the future though, but it would be way too much (unrelated to mamba caches) changes for this PR!

vasqu

LGTM, just some last comments on my side for more details but honestly we could also leave it as-is

vasqu · 2026-03-31T11:35:12Z

-        if generation_config.cache_implementation != "dynamic_full":
+        # linear attention models always need to pass the config, otherwise it will use an Attention cache for the LinearAttention layers
+        is_linear_attention = any(
+            x in ("mamba", "conv", "linear_attention")


Suggested change

x in ("mamba", "conv", "linear_attention")

x in ("linear_attention_mamba", "conv", "linear_attention_minimax")

Wdyt about this naming convention? I think we will need some BC workings / breakings but I think it paves a clear path

Yup, would probably be very nice in the long run to harmonize all the names for sure - once again something I wanted to follow up with haha. We have way too many different names for the same things rn (from the lack of general coverage of those caches rn)

vasqu · 2026-03-31T11:40:49Z

-            if use_precomputed_states:
-                previous_states = cache_params.ssm_states[self.layer_idx][:, None, ...].to(device=states.device)
-            else:
-                previous_states = torch.zeros_like(states[:, :1])


I think I opened that on mamba2 but just for clarification where this comes from: Mamba2 can theoretically have an initial recurrent state (and I developed that native torch version so it carried over 😓) - it just never got established as it did not really improve anything perf wise on tasks. Although, I could image this to become a power feature and maybe necessary for CP support

So I think there was some mistake introduced because it should only check whether the cache has a prev state and exists - not care for the seq len == 1 case

Yup, cases need harmonization haha - but not related to cache directly!

vasqu · 2026-03-31T11:48:10Z

-        # 2. Convolution sequence transformation
-        if cache_params is not None and cache_params.has_previous_state:
-            cache_params.update_conv_state(layer_idx=self.layer_idx, new_conv_state=hidden_states_B_C, cache_init=False)
+        is_decoding = cache_params is not None and cache_params.has_previous_state(self.layer_idx)


I swear it's copy paste mistakes 😠 but yea no worries, not on you and don't mind moving this to another PR

github-actions · 2026-03-31T12:25:36Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bamba, falcon_h1, falcon_mamba, granitemoehybrid, jamba, lfm2, lfm2_moe, mamba, mamba2, musicflamingo

* add Cache and test on Mamba * fix * fix * fix * fix * fix * final fix * test hybrid with jamba * fix tests * fixes * fix * fix * fix * combine both types + zambas * add config mapèping * adjust tests * fix * fix * fix * more models * final mambas * config * finalize almost everything * simplify tests * simplify tests further * fix tests * oupsi * fix * fix broken no_split_modules * fix * fixes * fix * fix * fixes * add layer type * oupsi * fix * style * fix * fixes * final fix * forgot those qwens * tests * offloading * much better static shape native design * oupsi * adjustments in generate * allow cudagraphs * small oupsi * start renaming * revert unrelated what are they doing here * more renaming * revert offloading change * add offloading skips * split shapes for tests * comments and renaming

Cyrilvallez added 2 commits March 23, 2026 17:22

add Cache and test on Mamba

db8e4ff

fix

9d52598

Cyrilvallez added 27 commits March 23, 2026 17:52

fix

659beee

fix

29b91ab

fix

3e02650

fix

fb88345

final fix

1aeddfa

test hybrid with jamba

35db152

fix tests

a50293c

fixes

1607fe2

fix

ddc198a

fix

bae4a78

fix

984b578

combine both types + zambas

cac5d17

add config mapèping

bd8f9e9

adjust tests

b2f1bb8

fix

7795808

fix

18685c6

fix

fcec6bc

more models

b1df43f

final mambas

fdb1579

config

b156ade

finalize almost everything

330e397

simplify tests

b60c6f5

simplify tests further

0e8ca28

fix tests

c2ddcf9

oupsi

b23708f

fix

18feef2

fix broken no_split_modules

ce92f3d

small oupsi

3df0d85

huggingface deleted a comment from github-actions Bot Mar 25, 2026

Merge branch 'main' into clean-mamba-cache

39dae28

ArthurZucker mentioned this pull request Mar 26, 2026

[Bug] GlmMoeDsa crashes on second forward pass — stale indexer cache #44995

Closed

2 tasks

vasqu reviewed Mar 26, 2026

View reviewed changes

ArthurZucker reviewed Mar 30, 2026

View reviewed changes

Cyrilvallez and others added 7 commits March 31, 2026 10:58

start renaming

eadcfa4

revert unrelated what are they doing here

908f0da

Merge branch 'main' into clean-mamba-cache

cf87066

more renaming

86de2bc

revert offloading change

f5dfd79

add offloading skips

476aaaf

split shapes for tests

7a69287

vasqu approved these changes Mar 31, 2026

View reviewed changes

comments and renaming

3600b89

Cyrilvallez changed the title ~~[Cache] Native mamba & hybrid cache~~ 🚨 [Cache] Native mamba & hybrid cache Mar 31, 2026

Cyrilvallez merged commit 2dba8e0 into main Mar 31, 2026
30 checks passed

Cyrilvallez deleted the clean-mamba-cache branch March 31, 2026 13:09

This was referenced Apr 1, 2026

CI fails with dev dependencies: TypeError: 'NoneType' object is not iterable huggingface/trl#5425

Closed

Fix TypeError: 'NoneType' object is not iterable in GenerationMixin.generate #45164

Merged

vasqu mentioned this pull request Apr 9, 2026

[Cache] Proper support for Linear Attention (related) caches #40827

Closed

		return self.ssm_states


		class MambaAndAttentionLayer(MambaLayer, DynamicLayer):

	ssm_state = cache_params.update_ssm_state(ssm_state, self.layer_idx)
	ssm_state = cache_params.update("ssm_states", ssm_state, self.layer_idx)

	x in ("mamba", "conv", "linear_attention")
	x in ("linear_attention_mamba", "conv", "linear_attention_minimax")

Conversation

Cyrilvallez commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Compile

BC-breaking

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

Cyrilvallez commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

Cyrilvallez commented Mar 25, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Cyrilvallez commented Mar 23, 2026 •

edited

Loading