Port DeepSeek Sparse Attention to `MambaModel` by janEbert · Pull Request #3553 · NVIDIA/Megatron-LM

janEbert · 2026-02-24T00:15:05Z

What does this PR do ?

Make experimental DeepSeek Sparse Attention (DSA) available to MambaModel.

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-02-24T00:16:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

DSA = DeepSeek Sparse Attention

And add corresponding tests. DSA = DeepSeek Sparse Attention

- New pytest test `test_dsa_gpt_mamba_equivalence.py` builds both a GPTModel (DSA, 4 layers) and a MambaModel (pattern S-S-S-S-, 8 layers) in-memory, remaps weights GPT→Mamba, and asserts logprob equivalence across TP=1/PP=1, TP=2/PP=1, and TP=1/PP=2 distributed configs. - New checkpoint conversion utility `tools/checkpoint/remap_gpt_dsa_to_mamba.py` applies the same layer-key remapping (decoder.layers.{N} → {2N}/{2N+1}, decoder.final_layernorm → decoder.final_norm) to DCP checkpoints. - New functional test cases for CI: hybrid_dsa_mamba_logitsmatch_tp1_pp1 and _tp2_pp1, each with model_config.yaml (MambaModel inference) and placeholder golden values. - New CI recipe `tests/test_utils/recipes/h100/mamba-dsa-static-inference.yaml` wiring the two functional test cases to the h100 pipeline.

Extend the DSA GPT/Mamba logprob equivalence suite to cover mixed dense+MoE architectures, mirroring the real DeepSeek-V3 layout where the first N layers are dense and the remaining layers use MoE. Key changes: - Add `pre_mlp_layernorm.*` routing in `_remap_gpt_to_mamba_state_dict` and `_remap_key` (checkpoint tool): MoE layers expose a real TENorm for `pre_mlp_layernorm` (not fused), which maps to MoETransformerLayer 2N+1. Dense layers use IdentityOp and produce no keys, so existing tests are unaffected. - Add `_make_dsa_moe_config` with `moe_layer_freq=[0,0,1,1]` (first 2 GPT layers dense, last 2 MoE) and proxy MoE params matching the DeepSeek-V3 style (4 experts, grouped-gemm, allgather dispatcher, shared experts). - Add `_MOE_MAMBA_PATTERN = "S-S-SESE"` and `TestDSAMoEGPTMambaEquivalence` with the same three parametrized tests as the dense suite (tp=1/2 pp=1/2): logprob match, strict weight loading, and golden-value recording/comparison. - Add functional test configs (`hybrid_dsa_moe_mamba_logitsmatch_tp{1,2}_pp1`) with placeholder golden-value JSONs and corresponding CI recipe entries in `mamba-dsa-static-inference.yaml`.

duncanriach

Quick review.

I want this to merge after 3377. This will need to be adjusted to accommodate the changes in that PR

duncanriach · 2026-02-24T21:04:13Z

megatron/core/ssm/mamba_hybrid_layer_allocation.py


    MAMBA = "M"
    ATTENTION = "*"
+    DSA_ATTENTION = "S"


Wondering if 'S' should be reserved for sliding-window attention. Wondering if this should be 'D'. Of course, these choices are arbitrary and hopefully ultimately temporary.

duncanriach · 2026-02-24T21:08:43Z

megatron/core/ssm/mamba_mixer.py


        if self.config.fp8:
-            assert (2 * self.d_inner + 2 * self.ngroups * self.d_state + self.nheads) % 16 == 0, (
+            fp8_align_size = get_fp8_align_size(self.config.fp8_recipe)


What prompts this fix in this PR?

duncanriach · 2026-02-24T21:09:42Z

tests/unit_tests/models/test_mamba_model.py

        assert torch.all(padding_logits == 0.0), "Logits for padding tokens are not all zero."
+
+
+class TestMambaBlockwiseFP8:


Why in this PR?

duncanriach · 2026-02-24T21:13:31Z

tools/checkpoint/remap_gpt_dsa_to_mamba.py

Ugh, yeah. Checkpoint compatibility is an issue

janEbert requested review from a team as code owners February 24, 2026 00:15

svcnvidia-nemo-ci requested a review from a team February 24, 2026 00:15

svcnvidia-nemo-ci added this to the Core 0.16 milestone Feb 24, 2026

copy-pr-bot bot temporarily deployed to test February 24, 2026 00:15 Inactive

janEbert marked this pull request as draft February 24, 2026 00:16

janEbert added 7 commits February 24, 2026 19:35

Port DeepSeek Sparse Attention to MambaModel

225a529

Add DSA Mamba tests

9429a63

DSA = DeepSeek Sparse Attention

Fix DSA dispatch

70d02a7

And add corresponding tests. DSA = DeepSeek Sparse Attention

Do not hardcode FP8 alignment size

9b59922

Add blockwise FP8 Mamba tests

c85dc05

janEbert force-pushed the mamba-dsa branch from 0249ad8 to c85dc05 Compare February 24, 2026 18:35

duncanriach reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Port DeepSeek Sparse Attention to `MambaModel`#3553

Port DeepSeek Sparse Attention to `MambaModel`#3553
janEbert wants to merge 7 commits intoNVIDIA:mainfrom
janEbert:mamba-dsa

janEbert commented Feb 24, 2026

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

duncanriach left a comment

Uh oh!

duncanriach Feb 24, 2026

Uh oh!

duncanriach Feb 24, 2026

Uh oh!

duncanriach Feb 24, 2026

Uh oh!

duncanriach Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert torch.all(padding_logits == 0.0), "Logits for padding tokens are not all zero."


		class TestMambaBlockwiseFP8:

Comments

Conversation

janEbert commented Feb 24, 2026

What does this PR do ?

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

duncanriach left a comment

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

duncanriach Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

(Step 1): Add PR label `Expert Review`