Conversation
d99b74f to
f733d51
Compare
|
/ok to test 3e8c042 |
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |
| permuted_probs.unsqueeze(-1), actual_tokens_per_expert | ||
| ) | ||
| else: | ||
| permuted_probs = permuted_probs.unsqueeze(-1) |
There was a problem hiding this comment.
Could we also add an assert here since device grouped gemm does not support it
| permuted_probs, | ||
| self.config.activation_func_fp8_input_store, | ||
| tokens_per_expert.sum() | ||
| if (isinstance(tokens_per_expert, torch.Tensor) and tokens_per_expert.is_cuda) |
There was a problem hiding this comment.
| if (isinstance(tokens_per_expert, torch.Tensor) and tokens_per_expert.is_cuda) | |
| if self.config.moe_use_device_initiated_grouped_gemm |
| """ | ||
|
|
||
| moe_use_device_initiated_grouped_gemm: bool = False | ||
| """Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.""" |
There was a problem hiding this comment.
cutlass -> device initiated, since there may be other backends
| """Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.""" | |
| """Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.""" |
megatron/training/arguments.py
Outdated
| group.add_argument('--moe-grouped-gemm', action='store_true', | ||
| help='When there are multiple experts per rank, launch multiple local GEMM kernels in multiple streams to improve the utilization and performance with GroupedLinear in TransformerEngine.') | ||
| group.add_argument('--moe-use-device-initiated-grouped-gemm', action='store_true', | ||
| help='Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.') |
There was a problem hiding this comment.
| help='Use the cutlass grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.') | |
| help='Use the device initiated grouped gemm kernel, which allows for the token_per_expert tensor on GPU. This can prevent the GPU-CPU synchronization during the grouped gemm.') |
| n_tensor = torch.ones(1, dtype=torch.int64, device=dev) * n | ||
| # n_tensor = torch.tensor(n, dtype=torch.int64, device=dev) |
| ), "moe_expert_rank_capacity_factor must be set when moe_paged_stash is enabled." | ||
|
|
||
| # Check that no module is both stashed and offloaded | ||
| if self.stash_modules and self.offload_modules: |
There was a problem hiding this comment.
When will this condition be true? Shouldn't offload_modules be empty when paged stashing is enabled?
There was a problem hiding this comment.
My expectation is that we can have both fine-grained offloading working on attention part and paged stashing working on the expert part.
| with offload_context: | ||
| bias_act_output = bias_act_func(fc1_output, bias_parallel, permuted_probs) | ||
| if self.offload_moe_act: | ||
| (bias_act_output,) = fine_grained_offloading_group_commit( |
There was a problem hiding this comment.
Why fine_grained_offloading_group_commit has been moved ?
There was a problem hiding this comment.
The group_commit should be placed after self.activation_checkpoint.discard_output_and_register_recompute().
|
|
||
| return routing_map, probs | ||
|
|
||
| @jit_fuser |
There was a problem hiding this comment.
Why is this change required?
| dev = torch.cuda.current_device() | ||
| n = 0 if cu_seqlens is None else int(cu_seqlens.numel()) | ||
| n_tensor = torch.tensor(n, dtype=torch.int64, device=dev) | ||
| n_tensor = torch.ones(1, dtype=torch.int64, device=dev) * n |
There was a problem hiding this comment.
Will this cause any issue during CG replay? Is it safer to use n_tensor.fill_(n) instead?
| f"{self.paged_tensors_to_reload[pp_schedule_layer]}" | ||
| ) | ||
|
|
||
| def allocate_stash_buffers(self, stash_buffer_size_factor=1.10): |
There was a problem hiding this comment.
Curious how stash_buffer_size_factor is going to be determined? Is 1.10 be reasonable enough?
63126cc to
d4eee90
Compare
…/restore on the same stream fixed a minor issue in calcualting budget
fix one change that broke full-iter CUDA graph
Get rid of legacy names like packed offloading Move the main code body of paged stash to transformer/moe/
Remove unused triton kernel for dropping token in case overflow happens
resolve accidental change in fused_a2a.py
…SIZE_FACTOR is positive. 2. fix int32 overflow in some triton kernels when token count is large 3. fix a problem where restored activation might get deallocate prematurely
f30202f to
a1103bb
Compare
Main contributors (Equal Contribution, sorted alphabetically): Nan Zheng (@nanz-nv), Vasudevan Rengasamy (@vasunvidia)
Other contributors (sorted alphabetically): Dennis Liu(@Victarry), Hongbin Liu(@lhb8125), Qi Zhang(@QiZhangNV), Robin Zhang(@buptzyb), Tong Liu(@Autumn1998), Zijie Yan(@yanring)
Background
In token-dropless MoE training, the number of tokens received by each expert might vary, resulting in dynamic shaped tensors. Dynamic shaped tensors are naturally supported by PyTorch, thanks to its eager mode nature. This is done by creating a tensor lazily when the shape of the tensor is known at run-time. Albeit working well in eager mode, dynamic shaped tensor poses challenges for CUDA graphs because the the size of a tensor cannot be dynamically adjusted at runtime without the intervene of the host. In order to remove the sync and enable CUDA graph, one solution is to oversize the buffer in the expert part. This however causes significantly higher memory consumption compared to the eager-mode baseline through the form of memory fragmentation.
Idea overview
To address this problem, paged stashing decouples the need of oversized buffers for compute and the need of a properly sized buffer for storing activations for the backward pass. Paged stashing achieves this through adding one level of indirection: stashing and restoring. The stash operation copies the activation from the oversized static buffer to a pre-allocated stashing buffer after the forward for that module is done, and the restore operation does the reverse operation during the backward pass.
The key of saving memory lies in the fact that the stash operation packs the variable-size activation into a contiguous stashing buffer to reduce memory fragmentation. For simple scheduling where the activation allocation and deallocation follows a first-in-last-out pattern, stash and restore can be done easily in a bump-allocation manner. To accommodate complicated scheduling schedules, e.g. pipeline parallel, paging can be used, hence the name paged stashing.
page management
To accomodate complex scheduling such as that needed in pipeline parallelism, activations are partitioned into pages and a light-weight memory management kernel is in charge of allocate and deallocate pages for stashing. Pages are managed by lightweight GPU memory management kernels that can be fused with the stash/restore GPU kernels. It maintains a freelist which is implemented as a circular buffer. Each freelist keeps track of one type of pages.
CPU offloading
Paged stashing naturally supports offloading. When the stashing buffer is a pinned CPU tensor, the activation is offloaded to the host memory during forward and is reloaded to the GPU during backward.
Furthermore, one can easily extend the paging management system to accommodate partial offloading or on-demand offloading. This feature is currently WIP.
scheduling
Overlapping stashing and restore operations with compute can be implemented by inserting two autograd functions before and after the expert compute layer: pre-scheduler and post-scheduler that schedules stash and restore operations. The roles of these autograd functions are enumerated below:
Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Wait for restore operation for the current layer to complete. Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.