Paged Stashing by nanz-nv · Pull Request #2690 · NVIDIA/Megatron-LM

nanz-nv · 2025-12-17T09:08:47Z

Main contributors (Equal Contribution, sorted alphabetically): Nan Zheng (@nanz-nv), Vasudevan Rengasamy (@vasunvidia)
Other contributors (sorted alphabetically): Dennis Liu(@Victarry), Hongbin Liu(@lhb8125), Qi Zhang(@QiZhangNV), Robin Zhang(@buptzyb), Tong Liu(@Autumn1998), Zijie Yan(@yanring)

Background

In token-dropless MoE training, the number of tokens received by each expert might vary, resulting in dynamic shaped tensors. Dynamic shaped tensors are naturally supported by PyTorch, thanks to its eager mode nature. This is done by creating a tensor lazily when the shape of the tensor is known at run-time. Albeit working well in eager mode, dynamic shaped tensor poses challenges for CUDA graphs because the the size of a tensor cannot be dynamically adjusted at runtime without the intervene of the host. In order to remove the sync and enable CUDA graph, one solution is to oversize the buffer in the expert part. This however causes significantly higher memory consumption compared to the eager-mode baseline through the form of memory fragmentation.

Idea overview

To address this problem, paged stashing decouples the need of oversized buffers for compute and the need of a properly sized buffer for storing activations for the backward pass. Paged stashing achieves this through adding one level of indirection: stashing and restoring. The stash operation copies the activation from the oversized static buffer to a pre-allocated stashing buffer after the forward for that module is done, and the restore operation does the reverse operation during the backward pass.

The key of saving memory lies in the fact that the stash operation packs the variable-size activation into a contiguous stashing buffer to reduce memory fragmentation. For simple scheduling where the activation allocation and deallocation follows a first-in-last-out pattern, stash and restore can be done easily in a bump-allocation manner. To accommodate complicated scheduling schedules, e.g. pipeline parallel, paging can be used, hence the name paged stashing.

page management

To accomodate complex scheduling such as that needed in pipeline parallelism, activations are partitioned into pages and a light-weight memory management kernel is in charge of allocate and deallocate pages for stashing. Pages are managed by lightweight GPU memory management kernels that can be fused with the stash/restore GPU kernels. It maintains a freelist which is implemented as a circular buffer. Each freelist keeps track of one type of pages.

CPU offloading

Paged stashing naturally supports offloading. When the stashing buffer is a pinned CPU tensor, the activation is offloaded to the host memory during forward and is reloaded to the GPU during backward.
Furthermore, one can easily extend the paging management system to accommodate partial offloading or on-demand offloading. This feature is currently WIP.

scheduling

Overlapping stashing and restore operations with compute can be implemented by inserting two autograd functions before and after the expert compute layer: pre-scheduler and post-scheduler that schedules stash and restore operations. The roles of these autograd functions are enumerated below:

Pre-scheduler forward: Wait for previous stash op. to complete, free the max-capacity sized temporary activations for the completed stash op. The wait is performed here instead of Post-scheduler forward to reduce the peak memory usage since the following expert compute layer will allocate another set of max-capacity sized temporary activations.
Post-scheduler forward: Since this is after experts compute, stashing operations for the current layer activations are scheduled here. If the next layer in the execution is a backward pass layer, schedule restore operations for the next layer.
Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Post-scheduler backward: Wait for previous stash op. to complete, free the max-capacity sized temporary activations for the completed stash op. The wait is performed here instead of Pre-scheduler backward to reduce the peak memory usage since the following expert compute BPROP layer will allocate another set of max-capacity sized temporary activations.
Wait for restore operation for the current layer to complete. Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Pre-scheduler backward: If the next layer in the execution is a backward pass layer, schedule restore operations for the next layer.

copy-pr-bot · 2025-12-17T09:08:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Victarry · 2025-12-19T01:23:22Z

/ok to test 3e8c042

github-actions · 2025-12-19T01:23:40Z

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

QiZhangNV · 2025-12-19T03:04:30Z

megatron/core/transformer/moe/experts.py

+                    permuted_probs.unsqueeze(-1), actual_tokens_per_expert
+                )
        else:
            permuted_probs = permuted_probs.unsqueeze(-1)


Could we also add an assert here since device grouped gemm does not support it

QiZhangNV · 2025-12-19T03:10:16Z

megatron/core/transformer/moe/experts.py

                        permuted_probs,
                        self.config.activation_func_fp8_input_store,
+                        tokens_per_expert.sum()
+                        if (isinstance(tokens_per_expert, torch.Tensor) and tokens_per_expert.is_cuda)


Suggested change

if (isinstance(tokens_per_expert, torch.Tensor) and tokens_per_expert.is_cuda)

if self.config.moe_use_device_initiated_grouped_gemm

QiZhangNV · 2025-12-19T03:15:56Z

megatron/core/transformer/transformer_config.py

    """

+    moe_use_device_initiated_grouped_gemm: bool = False
+    """Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm."""


cutlass -> device initiated, since there may be other backends

Suggested change

"""Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm."""

"""Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm."""

QiZhangNV · 2025-12-19T03:18:00Z

megatron/training/arguments.py

    group.add_argument('--moe-grouped-gemm', action='store_true',
                       help='When there are multiple experts per rank, launch multiple local GEMM kernels in multiple streams to improve the utilization and performance with GroupedLinear in TransformerEngine.')
+    group.add_argument('--moe-use-device-initiated-grouped-gemm', action='store_true',
+                       help='Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.')


Suggested change

help='Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.')

help='Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.')

QiZhangNV · 2025-12-19T03:20:06Z

megatron/training/utils.py

+            n_tensor = torch.ones(1, dtype=torch.int64, device=dev) * n
+            # n_tensor = torch.tensor(n, dtype=torch.int64, device=dev)


vasunvidia · 2026-01-07T18:26:47Z

megatron/core/transformer/transformer_config.py

+            ), "moe_expert_rank_capacity_factor must be set when moe_paged_stash is enabled."
+
+        # Check that no module is both stashed and offloaded
+        if self.stash_modules and self.offload_modules:


When will this condition be true? Shouldn't offload_modules be empty when paged stashing is enabled?

My expectation is that we can have both fine-grained offloading working on attention part and paged stashing working on the expert part.

vasunvidia · 2026-01-07T22:39:14Z

megatron/core/transformer/moe/experts.py

+            with offload_context:
                bias_act_output = bias_act_func(fc1_output, bias_parallel, permuted_probs)
+        if self.offload_moe_act:
+            (bias_act_output,) = fine_grained_offloading_group_commit(


Why fine_grained_offloading_group_commit has been moved ?

The group_commit should be placed after self.activation_checkpoint.discard_output_and_register_recompute().

vasunvidia · 2026-01-07T23:02:13Z

megatron/core/transformer/moe/token_dispatcher.py

+
        return routing_map, probs

-    @jit_fuser


Why is this change required?

vasunvidia · 2026-01-07T23:08:30Z

megatron/training/utils.py

            dev = torch.cuda.current_device()
            n = 0 if cu_seqlens is None else int(cu_seqlens.numel())
-            n_tensor = torch.tensor(n, dtype=torch.int64, device=dev)
+            n_tensor = torch.ones(1, dtype=torch.int64, device=dev) * n


Will this cause any issue during CG replay? Is it safer to use n_tensor.fill_(n) instead?

jianyuh · 2026-01-19T23:27:53Z

megatron/core/transformer/moe/paged_stash.py

+                f"{self.paged_tensors_to_reload[pp_schedule_layer]}"
+            )
+
+    def allocate_stash_buffers(self, stash_buffer_size_factor=1.10):


Curious how stash_buffer_size_factor is going to be determined? Is 1.10 be reasonable enough?

… tensor on GPU

…/restore on the same stream fixed a minor issue in calcualting budget

…idEP

fix one change that broke full-iter CUDA graph

Get rid of legacy names like packed offloading Move the main code body of paged stash to transformer/moe/

Remove unused triton kernel for dropping token in case overflow happens

resolve accidental change in fused_a2a.py

…SIZE_FACTOR is positive. 2. fix int32 overflow in some triton kernels when token count is large 3. fix a problem where restored activation might get deallocate prematurely

github-actions bot added the community-request label Dec 17, 2025

yanring assigned nanz-nv Dec 18, 2025

nanz-nv force-pushed the paged_offloading branch from d99b74f to f733d51 Compare December 18, 2025 08:45

Victarry self-requested a review December 19, 2025 00:10

copy-pr-bot bot temporarily deployed to nemo-ci December 19, 2025 01:23 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 19, 2025 01:23 Failure

ko3n1g added this to the Core 0.16 milestone Dec 19, 2025

Victarry requested review from Autumn1998, QiZhangNV and lhb8125 December 19, 2025 01:47

QiZhangNV reviewed Dec 19, 2025

View reviewed changes

vasunvidia reviewed Jan 7, 2026

View reviewed changes

Victarry mentioned this pull request Jan 16, 2026

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

jianyuh reviewed Jan 19, 2026

View reviewed changes

vasunvidia force-pushed the paged_offloading branch from 63126cc to d4eee90 Compare February 9, 2026 22:27

QiZhangNV and others added 7 commits February 19, 2026 15:56

Add --moe-use-device-initiated-grouped-gemm to allow token_per_expert…

29c453a

… tensor on GPU

Initial change for packed offloading

747b7a8

Bug fix

2679735

Mem Opt

7005df5

Handle MXFP8Tensor offload

bf7b7e6

Enable Packed offloading to CPU pinned memory with PACKED_OFFLOAD_CPU=1

163f839

Enable activation truncation for first step

dca1595

nanz-nv and others added 29 commits February 19, 2026 15:57

Add support for paged stashing

87e63e6

Add the feature of speculative CE stashing

c52f74b

Fix PP schedule

6cddc1f

Use common buffer across VP for paged stashing

de61a22

Disable Packed Offloading for validation

c98d946

Fixe perf issue in packed stash/pop kernels

ac645ee

Minor fix for tensor allocation and padding requirement on budget

f19890d

Packed/paged offloading is current not stream-safe. Need to put stash…

8536b50

…/restore on the same stream fixed a minor issue in calcualting budget

add new hybrid ep

031295c

Remove the overflow check in framework because it is now done by hybr…

c814af4

…idEP

Fix one merge conflict

e2e1e30

fix one change that broke full-iter CUDA graph

Code cleanup

3a1195b

Add second autograd to avoid triple buffering

5097957

Avoid unnecessary wait_stream for reload in case of 1f1b

d4ebd17

Check in dynamic-shape-aware SwiGLU triton kernel

b0be51f

Major cleanup and refactor

74e96ae

Get rid of legacy names like packed offloading Move the main code body of paged stash to transformer/moe/

Check in paged_stash.py that was omited in the previous commit

5f8bd1e

Remove d2d page feature for now

632d2c2

Remove unused triton kernel for dropping token in case overflow happens

Update added arguments and add compatibility check

51bf1ec

refine overflow check

42860bc

resolve accidental change in fused_a2a.py

Fixing lint issues

0df30d6

Minor refactor

f8fcd4e

Add unit test for Paged Stashing

d62fa3c

Initial check in of a) force load imbalance b) log overload factors

2e4db72

make overload factor logging work for cuda graph

e08e0b9

1. allocate stashing buffer based on avg token count if STASH_BUFFER_…

a53a36e

…SIZE_FACTOR is positive. 2. fix int32 overflow in some triton kernels when token count is large 3. fix a problem where restored activation might get deallocate prematurely

Reenable overlapping of stashing kernels

6eb84a0

Remove a buggy/redundant reset

9424b23

Cleanup moe-expert-rank-capacity-factor argument.

a1103bb

vasunvidia force-pushed the paged_offloading branch from f30202f to a1103bb Compare February 20, 2026 00:29

	if (isinstance(tokens_per_expert, torch.Tensor) and tokens_per_expert.is_cuda)
	if self.config.moe_use_device_initiated_grouped_gemm

	"""Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm."""
	"""Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm."""

	help='Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.')
	help='Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.')

		n_tensor = torch.ones(1, dtype=torch.int64, device=dev) * n
		# n_tensor = torch.tensor(n, dtype=torch.int64, device=dev)

Comments

Conversation

nanz-nv commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Idea overview

page management

CPU offloading

scheduling

Uh oh!

copy-pr-bot bot commented Dec 17, 2025

Uh oh!

Victarry commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhb8125 Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasunvidia Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nanz-nv commented Dec 17, 2025 •

edited

Loading

lhb8125 Jan 15, 2026 •

edited

Loading

vasunvidia Jan 7, 2026 •

edited

Loading