feat: add hybridep#1333
Conversation
|
/ok to test 8d0be12 |
|
/ok to test 6233747 |
|
/ok to test 0a8d5ee |
|
/claude review |
| TORCH_CUDA_ARCH_LIST="9.0 10.0 12.0" MAX_JOBS=8 pip install --no-build-isolation . && \ | ||
| apt-get purge -y libnvidia-ml-dev && \ | ||
| apt-get autoremove -y && \ | ||
| rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
The old code cleaned up the DeepEP/ source dir and /opt/deepep.patch after install. The new code doesn't, and also leaves rdma-core/ behind. This can add non-trivial bloat to the Docker image.
| TORCH_CUDA_ARCH_LIST="9.0 10.0 12.0" MAX_JOBS=8 pip install --no-build-isolation . && \ | |
| apt-get purge -y libnvidia-ml-dev && \ | |
| apt-get autoremove -y && \ | |
| rm -rf /var/lib/apt/lists/* | |
| TORCH_CUDA_ARCH_LIST="9.0 10.0 12.0" MAX_JOBS=8 pip install --no-build-isolation . && \ | |
| apt-get purge -y libnvidia-ml-dev && \ | |
| apt-get autoremove -y && \ | |
| rm -rf /var/lib/apt/lists/* && \ | |
| rm -rf /opt/deepep.patch && \ | |
| cd / && rm -rf DeepEP rdma-core |
| ) -> torch.Tensor: | ||
| # Reset num_permuted_tokens to None to avoid reusing cached state from a prior dispatch. | ||
| # This can happen in non-reentrant activation checkpointing mode. | ||
| self.num_permuted_tokens = None | ||
| if self.token_probs.dtype != torch.float32: | ||
| self.token_probs = self.token_probs.float() | ||
| dispatched_hidden, self.dispatched_probs, _, tokens_per_expert, self.handle = hybrid_ep_dispatch( | ||
| x=hidden_states, | ||
| routing_map=self.routing_map, | ||
| probs=self.token_probs, | ||
| group=self.group, | ||
| num_local_experts=self.num_local_experts, | ||
| num_sms_dispatch_api=self.moe_hybridep_num_sms, | ||
| num_sms_combine_api=self.moe_hybridep_num_sms, | ||
| num_permuted_tokens=self.num_permuted_tokens, | ||
| pad_multiple=self.pad_multiple, | ||
| ) | ||
|
|
||
| self.tokens_per_expert = tokens_per_expert | ||
| self.num_permuted_tokens = self.tokens_per_expert.sum() | ||
|
|
There was a problem hiding this comment.
_indices_to_multihot is a pure tensor operation (no multi-GPU or DeepEP dependency needed) and is on the non-permute_fusion code path. Consider adding a unit test for it — edge cases like all--1 indices, duplicate expert assignments, or topk=1 would be good to cover.
… tests - Clean up DeepEP/, rdma-core/, and deepep.patch after Docker build to reduce image bloat - Add unit tests for _HybridEPManager._indices_to_multihot covering basic, topk=1, all-invalid, partial-invalid, and single-token edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
|
/claude review |
|
/ok to test 061d6f4 |
| # Install HybridEP | ||
| ## Dependency: RDMA Core | ||
| RUN git clone https://github.com/linux-rdma/rdma-core.git && \ | ||
| cd rdma-core && git checkout tags/v60.0 && sh build.sh |
There was a problem hiding this comment.
Bug: The INSTALL_DEEPEP build arg was removed here (DeepEP is now always installed), but it's still referenced in:
CONTRIBUTING.md:85(export INSTALL_DEEPEP=True)CONTRIBUTING.md:91(--build-arg INSTALL_DEEPEP=$INSTALL_DEEPEP).github/actions/build-container/action.yml:146(INSTALL_DEEPEP=True)
The CI action will silently pass a build arg that no longer exists (harmless but confusing), and the CONTRIBUTING docs are now stale/misleading. These should be cleaned up.
There was a problem hiding this comment.
Looks good overall — clean HybridEP integration with solid test coverage for the new paths.
One issue: the INSTALL_DEEPEP build arg was removed from the Dockerfile (DeepEP now builds unconditionally), but stale references remain in CONTRIBUTING.md and .github/actions/build-container/action.yml. See inline comment.
DeepEP is now always installed (no longer behind a build arg), so remove the leftover INSTALL_DEEPEP references from CONTRIBUTING.md and the build-container GitHub Action. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
|
/claude review |
|
/ok to test 07ffc01 |
| self.num_permuted_tokens = None | ||
| return hidden_states | ||
|
|
||
| def get_dispached_metadata(self) -> torch.Tensor: |
There was a problem hiding this comment.
Nit: typo — get_dispached_metadata should be get_dispatched_metadata. (Pre-existing in the base class and _DeepepManager too, but this is a good opportunity to fix it across all three since you're adding a new implementation.)
|
|
||
| def dispatch( | ||
| self, | ||
| hidden_states: torch.Tensor, | ||
| async_finish: bool = True, |
There was a problem hiding this comment.
Bug/silent no-op: async_finish and allocate_on_comm_stream are accepted here but silently ignored. The caller (MoEFlexTokenDispatcher.token_dispatch at L653 and token_combine at L732) passes these through for both backends, but only _DeepepManager honors them.
If HybridEP intentionally doesn't support these, consider either:
- Removing them from the signature and documenting why, or
- Wiring them through to
hybrid_ep_dispatch/hybrid_ep_combine(e.g.non_blockingindispatch_with_permuteis already derived fromnum_permuted_tokens, butallocate_on_comm_streamhas no equivalent)
As-is, a caller switching from deepep → hybridep will silently lose async behavior.
- Rename get_dispached_metadata -> get_dispatched_metadata across all three dispatch manager classes - Document that async_finish and allocate_on_comm_stream are not supported by the HybridEP backend (kept in signature for interface compatibility with callers) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
|
/claude review |
|
/ok to test b117377 |
Regenerated .secrets.baseline to keep line numbers in sync with the updated files. All entries are false positives (config keys, test fixtures, high-entropy example strings). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
|
/ok to test cf65bff |
https://wandb.ai/Nemo-automodel/tulu3-convergence/runs/z2jpozm9 |
Wandb - https://wandb.ai/Nemo-automodel/automodel-moe-dispatcher
This PR adds HybridEP support for MoE token dispatch and updates dependency/docker wiring needed to run it reliably.
Changelog
_HybridEPManagerinnemo_automodel/components/moe/megatron/token_dispatcher.py.moe_flex_dispatcher_backend("deepep"or"hybridep"), plus backend-specific SM settings.nemo_automodel/components/moe/megatron/fused_a2a.py:set_deepep_num_sms.BackendConfig.dispatchernow accepts"hybridep".dispatcher_num_sms."hybridep"as valid withte/gmmexperts.MoE,GroupedExpertsDeepEP, andGroupedExpertsTE.tests/unit_tests/moe/test_backend_config.pytests/unit_tests/moe/test_experts.pytests/unit_tests/moe/test_layers.pydeep_epto7febc6e25660af0f54d95dd781ecdcd62265ecca(v1.2.1+7febc6emetadata).nvidia-cudnn-cu12==9.19.0.56; sys_platform == 'linux'.uv.lockaccordingly.docker/common/update_pyproject_pytorch.shnow removes existingoverride-dependenciesand reinsertsdocker/common/uv-pytorch.tomlunder[tool.uv]to avoid duplicate/incorrect override blocks.Additional Information