Reorganize the building of active-masked tensors in context by tdene · Pull Request #2929 · NVIDIA/Megatron-LM

tdene · 2026-01-13T16:45:34Z

What does this PR do ?

This enables future optimizations.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-01-13T16:45:38Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Phlip79

Please add tests for the new logic.

megatron/core/inference/contexts/dynamic_context.py

lmcafee-nvidia · 2026-01-20T18:59:57Z

megatron/core/inference/contexts/dynamic_context.py

        self.token_to_position_in_request = torch.empty_like(self.token_to_input_ids)
        self.token_to_local_position_within_kv_block = torch.empty_like(self.token_to_input_ids)

+        # Static tensor addresses of active slices to enable fast inference kernels.


what's your timeline for merging this .... this might be a good argument for re-ordering the context tensors and even simplifying what we already have."

I really don't like the idea of continually adding duplicated tensors, including the ones we already have, if all we need to do is bite the bullet and redesign things.

At a high-level, I see this PR as the base that enables several bite-sized optimization PRs and benchmarks them to see how much improvement each optimization leads to - in order to prove their value in a clean apples-to-apples comparison.

One of these bite-sized optimization PRs is the one that reorders the context tensors.

That is, we could reorder the context tensor today, but we will not have a clear idea of how much throughput/latency we gain by doing so. Going through the work of building things on top of this PR would clearly prove how much of a practical win we can get from reordering the context tensor.

But your point is valid. I'm saying that we should do things in order so that we have proper ablation studies. We do not have to actually merge things in that order though; I can do the ablation study, come back with results, and then say "Okay, let's just skip #2929 and go straight to re-ordering the context tensors."

so will your follow-up PRs be relatively independent from this duplicated-tensor style? As in, if we re-order active & paused requests in the near future, will that require changing all of your follow-up PRs significantly?

lmcafee-nvidia · 2026-01-21T13:56:43Z

megatron/core/inference/contexts/dynamic_context.py

+    def build_active_slices(self, batch_size: int):
+        """Build the active slices of specific tensors. This is run on every forward step.
+
+        If the context is reordered to active -> paused -> finished, this can be graphed.


this method can be graphed? Or we wouldn't need this method at all?

exactly == this method can be graphed?
-or-
exactly == we won't need this method at all?

lmcafee-nvidia · 2026-01-21T13:58:00Z

megatron/core/inference/contexts/dynamic_context.py

+            )
+
+        # The following tensor slices are used in various kernels.
+        self.active_request_ids[:batch_size].copy_(self.request_ids[padded_slice])


if we re-order active and paused, can all these copies be avoided?

Everything listed in this PR, yes.

But all PRs that built on top of this one end up needing to add things to pad_active_slices - and that method is used to pad tensors on the interval active_request_count : padded_active_request_count. The need to pad means that we do need copies of the tensor, so that we don't overwrite real data.

We see that today in the attention metadata tensors, which currently do a copy up to padded_active_request_count followed by a pad on the interval active_request_count : padded_active_request_count. Exactly the same pattern is needed for graphed logprobs, graphed sampling, and graphing time-critical parts of _dynamic_step_context_bookkeeping.

So, in a nutshell, here's what's happening:

This method, as it is in this PR, copies several tensors that are used in CUDA graphs in follow-up PRs. All of these copies go away if we reorder the context tensors.

There are also several tensors that are copied & padded in follow-up PRs. If we reorder the context tensors, those copies do not go away, but they become graphable.

orthogonal to your PRs and possibly less practical, but I wonder if we can merge regular and padded tensors also and just have a single tensor. We care mostly about the cuda graph case, or at least that's our gold standard for performance, so it seems like we could just used the padded tensors in the non-cuda-graph case as well for simplicity

tdene force-pushed the tde/active_mask_refactor branch from f938a05 to 0c2c1c5 Compare January 13, 2026 16:53

tdene marked this pull request as ready for review January 13, 2026 16:54

tdene requested review from a team as code owners January 13, 2026 16:54

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 16:54 Inactive

ko3n1g requested a review from a team January 13, 2026 16:54

ko3n1g added this to the Core 0.16 milestone Jan 13, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 16:54 Inactive

tdene assigned lmcafee-nvidia Jan 13, 2026

copy-pr-bot bot had a problem deploying to nemo-ci January 13, 2026 16:54 Failure

tdene changed the title ~~Refactor the building of active-masked tensors in context~~ Reorganize the building of active-masked tensors in context Jan 13, 2026

tdene force-pushed the tde/active_mask_refactor branch from 0c2c1c5 to 63c19dc Compare January 14, 2026 14:38

copy-pr-bot bot temporarily deployed to nemo-ci January 14, 2026 14:38 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 14, 2026 14:38 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 14, 2026 14:38 Inactive

Phlip79 added Expert Review Apply this label to indicate that your PR is ready for expert review. complexity: medium and removed Expert Review Apply this label to indicate that your PR is ready for expert review. labels Jan 14, 2026

Phlip79 requested changes Jan 14, 2026

View reviewed changes

tdene added 6 commits January 20, 2026 10:43

Reorganize code to slice tensors in context class

0f3a499

Slice additional tensors in context class

9e19d68

Slice tensors by padded_active_request_count

c62d39e

Move context tensor padding into dedicated method

7f7a0ee

Store logit output in static tensor

4b9dd92

Syntactic sugar for CG and awaiting

682cfca

tdene force-pushed the tde/active_mask_refactor branch from 63c19dc to 682cfca Compare January 20, 2026 16:45

copy-pr-bot bot temporarily deployed to nemo-ci January 20, 2026 16:45 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 20, 2026 16:45 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 20, 2026 16:45 Inactive

lmcafee-nvidia reviewed Jan 21, 2026

View reviewed changes

Comments

Conversation

tdene commented Jan 13, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

Phlip79 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

(Step 1): Add PR label `Expert Review`