Add MTP support for hybrid models by rkarimimahab · Pull Request #2363 · NVIDIA/Megatron-LM

rkarimimahab · 2025-11-23T11:30:52Z

What does this PR do ?

(1) supporting to use hybrid mamba models as mtp_model_layer.
(2) splitting the MTP loss calculation in the GPT model’s forward pass into a separate function.
(3) supporting MTP layer repetition.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-11-23T11:30:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

megatron/core/pipeline_parallel/schedules.py

megatron/core/models/mamba/mamba_layer_specs.py

tests/unit_tests/transformer/test_multi_token_prediction.py

megatron/core/models/mamba/mamba_layer_specs.py

megatron/core/transformer/multi_token_prediction.py

megatron/core/ssm/mamba_block.py

megatron/core/models/common/language_module/language_module.py

megatron/core/models/mamba/mamba_model.py

megatron/core/transformer/multi_token_prediction.py

megatron/core/pipeline_parallel/schedules.py

shifangx · 2025-12-08T01:47:05Z

This pr implements three features:
(1) supporting to use Mamba layer as mtp_model_layer.
(2) splitting the MTP loss calculation in the GPT model’s forward pass into a separate function.
(3) supporting MTP layer repetition.
I have no concerns about the implementation of the first two features.
As for the last feature, we need to consider how to support scenarios where multiple MTP layers are placed on different PP ranks for computation. These MTP layers share the same parameters value, but be placed onto different vpp stage, in order to make sure each vpp stage do not have much computation task.
Maybe we can use assert to guide users can not place mtp layers into different vpp stage currently, and support this feature in future.

deepakn94 · 2025-12-08T18:12:23Z

This pr implements three features: (1) supporting to use Mamba layer as mtp_model_layer. (2) splitting the MTP loss calculation in the GPT model’s forward pass into a separate function. (3) supporting MTP layer repetition. I have no concerns about the implementation of the first two features. As for the last feature, we need to consider how to support scenarios where multiple MTP layers are placed on different PP ranks for computation. These MTP layers share the same parameters value, but be placed onto different vpp stage, in order to make sure each vpp stage do not have much computation task. Maybe we can use assert to guide users can not place mtp layers into different vpp stage currently, and support this feature in future.

This is a good point. Let's go with the assertion for now re: 3.

rkarimimahab · 2025-12-09T13:37:59Z

This is a good point. Let's go with the assertion for now re: 3.

I added the assert, thanks!

sancha · 2026-02-01T17:59:13Z

/ok to test 9cc1668

This reverts commit 300d1b6.

This reverts commit a0cc8ca.

deepakn94 · 2026-02-05T23:12:13Z

For posterity, this PR was re-merged as #3207 with some bugfixes.

Signed-off-by: adithyare <adithyare@nvidia.com>

Co-authored-by: Rabeeh Mahabadi <rkarimimahab@nb-hel-cs-001-vscode-02.cm.cluster> Co-authored-by: Sanjeev Satheesh <sasatheesh@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

This reverts commit 300d1b6.

rkarimimahab requested review from a team as code owners November 23, 2025 11:30

BestJuly reviewed Nov 28, 2025

View reviewed changes

deepakn94 reviewed Dec 4, 2025

View reviewed changes

megatron/core/pipeline_parallel/schedules.py Show resolved Hide resolved

santhnm2 approved these changes Dec 4, 2025

View reviewed changes

github-actions bot added the community-request label Dec 8, 2025

BestJuly approved these changes Dec 10, 2025

View reviewed changes

rkarimimahab added 7 commits December 18, 2025 08:32

added mtp support

16e4ed6

addressed comments

3db9741

fixed the duplicate

aeecff9

added repeated mtp layers

9b4e2b5

adds mtp-num-per-layer to fix a bug

225c2dc

added the assert

61eb031

added the assert for vp-stage of MTP layers

e8304aa

rkarimimahab force-pushed the rkarimimahab/mtp branch from a94a868 to e8304aa Compare December 18, 2025 16:37

added the LBL fixes

7398d87

rkarimimahab requested review from a team as code owners December 18, 2025 17:57

Rabeeh Mahabadi added 3 commits December 18, 2025 10:40

added some fixes

47063da

fixed more issues

4896a80

updated

d5a8dd2

yanring approved these changes Dec 21, 2025

View reviewed changes

deepakn94 self-requested a review December 21, 2025 05:50

copy-pr-bot bot temporarily deployed to nemo-ci February 1, 2026 07:06 Inactive

copy-pr-bot bot temporarily deployed to test February 1, 2026 07:07 Inactive

bugfix: expected value error strings

9cc1668

copy-pr-bot bot temporarily deployed to nemo-ci February 1, 2026 17:59 Inactive

copy-pr-bot bot temporarily deployed to test February 1, 2026 18:00 Inactive

deepakn94 added this pull request to the merge queue Feb 1, 2026

Merged via the queue into NVIDIA:main with commit 300d1b6 Feb 1, 2026
45 checks passed

ko3n1g added the Run functional tests label Feb 2, 2026

deepakn94 mentioned this pull request Feb 2, 2026

Fix two minor bugs in MTP implementation for hybrid models #3194

Merged

ko3n1g added a commit that referenced this pull request Feb 2, 2026

Revert "Add MTP support for hybrid models (#2363)"

a0cc8ca

This reverts commit 300d1b6.

ko3n1g added a commit that referenced this pull request Feb 2, 2026

Revert "Add MTP support for hybrid models (#2363)"

6472fb9

This reverts commit 300d1b6.

sancha added a commit to sancha/Megatron-LM that referenced this pull request Feb 2, 2026

Reapply "Add MTP support for hybrid models (NVIDIA#2363)"

70b910e

This reverts commit a0cc8ca.

ko3n1g pushed a commit that referenced this pull request Feb 4, 2026

Reapply "Add MTP support for hybrid models (#2363)" (#3207)

9d71cb1

arendu pushed a commit to arendu/Megatron-LM that referenced this pull request Feb 5, 2026

Reapply "Add MTP support for hybrid models (NVIDIA#2363)"

b19565a

This reverts commit a0cc8ca.

BestJuly mentioned this pull request Feb 10, 2026

BUGFIX: gpt vs hybrid model mtp naming mismatch #3334

Merged

6 tasks

arendu added a commit to arendu/Megatron-LM that referenced this pull request Feb 18, 2026

added from NVIDIA#2363

68514e8

Signed-off-by: adithyare <adithyare@nvidia.com>

arendu added a commit to arendu/Megatron-LM that referenced this pull request Feb 21, 2026

added from NVIDIA#2363

9df1002

Signed-off-by: adithyare <adithyare@nvidia.com>

yfw pushed a commit to yaoyu-33/Megatron-LM that referenced this pull request Feb 23, 2026

Reapply "Add MTP support for hybrid models (NVIDIA#2363)" (NVIDIA#3207)

50cc8aa

daiyaanarfeen pushed a commit to daiyaanarfeen/Megatron-LM that referenced this pull request Feb 23, 2026

Revert "Add MTP support for hybrid models (NVIDIA#2363)"

4cd9865

This reverts commit 300d1b6.

daiyaanarfeen pushed a commit to daiyaanarfeen/Megatron-LM that referenced this pull request Feb 23, 2026

Reapply "Add MTP support for hybrid models (NVIDIA#2363)" (NVIDIA#3207)

9ce581e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MTP support for hybrid models#2363

Add MTP support for hybrid models#2363
deepakn94 merged 55 commits intoNVIDIA:mainfrom
rkarimimahab:rkarimimahab/mtp

rkarimimahab commented Nov 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shifangx commented Dec 8, 2025

Uh oh!

deepakn94 commented Dec 8, 2025

Uh oh!

rkarimimahab commented Dec 9, 2025 •

edited

Loading

Uh oh!

sancha commented Feb 1, 2026

Uh oh!

Uh oh!

deepakn94 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

rkarimimahab commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Nov 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shifangx commented Dec 8, 2025

Uh oh!

deepakn94 commented Dec 8, 2025

Uh oh!

rkarimimahab commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sancha commented Feb 1, 2026

Uh oh!

Uh oh!

deepakn94 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

rkarimimahab commented Nov 23, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`

rkarimimahab commented Dec 9, 2025 •

edited

Loading