[Dev] feat: Dynamic CP (part 2) by xiaoyao0115 · Pull Request #2000 · NVIDIA/Megatron-LM

xiaoyao0115 · 2025-10-28T08:57:13Z

This PR is the second part of hybrid-cp. The first part is: #2054
（PR for main branch: #2304 ）

Compared to part 1, this PR adds the following:

Added support for SFT datasets and sequence packing, along with a script. With these additions, hybrid-cp can run end-to-end. Convergence has been verified on qwen3-30B on 32 GPUs, with max_seqlen is set to 12288, and max_seqlen_per_dp_cp_rank is set to 3072. In the figure below, 'bshd' refers to running with CP=4, where sequences are padded to max_seqlen and executed in the same bshd format as in pretraining. 'thd-packing' refers to using CP=4 while packing variable-length sequences. In 'hybrid-cp', the maximum CP group size is also 4.

Added a mock SFT dataset that lets users control sequence lengths by specifying a sequence-length distribution or by providing a file containing sequence lengths.
Migrated the hybrid-cp and sequence packing changes into a dataiterator_wrapper to minimize code changes. Adding a new scheduling algorithm now only requires adding a new scheduler class, which keeps the logic clear and easier to maintain.
Added support for FSDP with hybrid-cp; the loss curve is shown below.(model : Qwen3-30B-A3B, hybrid-cp size : 4)
Added support for PP, but does not support for FSDP+PP.

There's many improvements that we want to make in the future releases.

The feature is limited to creating dynamic groups of CP of power 2. We hope to add complete dynamic support using changes in TransformerEngine DPA.
The feature does not support CUDA graphs.
The feature works best with FlashAttention instead of cuDNN FusedAttention. This is because the changing lengths and CP size make cuDNN recompile the graph and all performance gains are lost. We'll advocate for dynamic support to cuDNN FusedAttention.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-10-28T08:57:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yanring · 2025-11-07T09:26:07Z

Is there any difference between this and #2054?

kunlunl · 2025-11-07T10:27:02Z

Is there any difference between this and #2054?

This is the second MR, we need to merge 2054 first, and then this 2000 (The reason the second MR is 2000, while the first one is 2054 (>2000), is because they were migrated from GitLab at different times)

yanring · 2025-11-10T06:22:46Z

Is there any difference between this and #2054?

This is the second MR, we need to merge 2054 first, and then this 2000 (The reason the second MR is 2000, while the first one is 2054 (>2000), is because they were migrated from GitLab at different times)

Got it, thanks! Could you please update the title to reflect this?

pretrain_gpt.py

kunlunl · 2025-12-01T12:13:19Z

/ok to test e0c90c5

kunlunl · 2026-01-13T06:28:39Z

/ok to test d12ccf1

kunlunl · 2026-01-13T06:36:48Z

/ok to test 86581cd

yanring · 2026-01-13T07:20:21Z

megatron/training/arguments.py

    # during pipeline parallelism, it should not be set if sequence length
    # is constant during training.
-    args.variable_seq_lengths = False
+    if args.sequence_packing:


Please move these validations into transformer_config.

yanring · 2026-01-13T07:24:09Z

megatron/training/utils.py

-        max_seqlen = torch.empty(
-            1,
-            dtype=torch.int32,
-            device=torch.cuda.current_device(),


Could you please clarify why these were removed?

To support pp, it would become more complex, so the thd related logic was moved to a new separate function get_batch_on_this_rank_for_sequence_packing.

The thd logic was added in the part-1 PR, it's just back to how it was before.

yanring · 2026-01-13T07:30:48Z

megatron/core/datasets/data_schedule.py

 # Copyright (c) 2025 NVIDIA CORPORATION.  All rights reserved.

-from typing import Any, List, Optional
+import enum


Could we put this big change in a separate file?

This data_schedule.py should be the separate file you want, it was included in the part-1 PR.

ericharper · 2026-01-13T21:35:45Z

@asolergi-nv FYI

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

ISEEKYAN · 2026-01-26T07:09:43Z

megatron/core/pipeline_parallel/schedules.py

+                {config.sequence_packing_scheduler}"
+            )
+        scheduler_type = scheduler_type_map[config.sequence_packing_scheduler]
+        return wrap_dataloader(data_iterator, config, scheduler_type, pg_collection=None)


should pass the pg_collection instead of hardcode it to None

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

xiaoyao0115 assigned xiaoyao0115 and kunlunl Oct 28, 2025

xiaoyao0115 requested review from a team as code owners October 28, 2025 08:57

xiaoyao0115 added the enhancement New feature or request label Oct 28, 2025

xiaoyao0115 force-pushed the hybrid-cp branch 3 times, most recently from f33edcd to 48e91d2 Compare November 2, 2025 09:33

yanring added module: moe dev branch Dev branch related issues and development labels Nov 5, 2025

xiaoyao0115 changed the title ~~[Dev] feat: hybrid-cp feature for dev branch (Author: Parth Kunlun Tailai)~~ [Dev] feat: hybrid-cp feature for dev branch (part 2) Nov 11, 2025

xiaoyao0115 changed the title ~~[Dev] feat: hybrid-cp feature for dev branch (part 2)~~ [Dev] feat: hybrid-cp for dev branch (part 2) Nov 11, 2025

xiaoyao0115 force-pushed the hybrid-cp branch from 983e5f3 to 11d9960 Compare November 12, 2025 09:54

kunlunl reviewed Nov 24, 2025

View reviewed changes

pretrain_gpt.py Outdated Show resolved Hide resolved

kunlunl reviewed Nov 24, 2025

View reviewed changes

pretrain_gpt.py Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 12:13 Inactive

ko3n1g added this to the Core 0.16 milestone Dec 1, 2025

copy-pr-bot bot had a problem deploying to nemo-ci December 1, 2025 12:13 Failure

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 12:13 Inactive

copy-pr-bot bot had a problem deploying to public December 1, 2025 12:20 Failure

yanring mentioned this pull request Dec 3, 2025

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

xiaoyao0115 force-pushed the hybrid-cp branch from e0c90c5 to 501a5f6 Compare December 4, 2025 07:25

shifangx mentioned this pull request Dec 5, 2025

[QWen3_VL] pretrain performance optimization NVIDIA-NeMo/Megatron-Bridge#1605

Open

kunlunl requested a review from parthmannan December 9, 2025 13:25

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 06:23 Inactive

Fix lint error

d12ccf1

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 06:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 06:29 Inactive

Fix lint error

86581cd

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 06:36 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 06:37 Inactive

copy-pr-bot bot temporarily deployed to test January 13, 2026 06:37 Inactive

yanring reviewed Jan 13, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 07:22 Inactive

yanring reviewed Jan 13, 2026

View reviewed changes

parthmannan mentioned this pull request Jan 15, 2026

Hybrid Context Parallel Feature #2282

Merged

6 tasks

add test_wrap_dataloader UT

ffe8f94

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

xiaoyao0115 force-pushed the hybrid-cp branch from 3f9564f to ffe8f94 Compare January 19, 2026 09:19

ISEEKYAN reviewed Jan 26, 2026

View reviewed changes

rename hybrid-cp to dynamic-cp

8797dbc

Signed-off-by: xiaoyao0115 <1804647152@qq.com>

xiaoyao0115 force-pushed the hybrid-cp branch from 61ba6e8 to 8797dbc Compare January 28, 2026 07:28

ISEEKYAN mentioned this pull request Jan 28, 2026

[BREAKING][megatron] feat: support dynamic CP verl-project/verl#5057

Draft

erictang000 mentioned this pull request Feb 4, 2026

[megatron][perf] Integrate megatron dynamic context parallelism NovaSky-AI/SkyRL#1019

Open

yanring changed the title ~~[Dev] feat: hybrid-cp for dev branch (part 2)~~ [Dev] feat: Dynamic CP (part 2) Feb 5, 2026

Comments

Conversation

xiaoyao0115 commented Oct 28, 2025 • edited by yanring Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Oct 28, 2025

Uh oh!

yanring commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunlunl commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanring commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

kunlunl commented Dec 1, 2025

Uh oh!

kunlunl commented Jan 13, 2026

Uh oh!

kunlunl commented Jan 13, 2026

Uh oh!

yanring Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

yanring Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

kunlunl Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

kunlunl Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

yanring Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

kunlunl Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ericharper commented Jan 13, 2026

Uh oh!

ISEEKYAN Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xiaoyao0115 commented Oct 28, 2025 •

edited by yanring

Loading

(Step 1): Add PR label `Expert Review`

yanring commented Nov 7, 2025 •

edited

Loading

kunlunl commented Nov 7, 2025 •

edited

Loading