[GraphTrainer][AutoDev] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2910
Draft
SherlockNoMad wants to merge 2 commits intomainfrom
Draft
[GraphTrainer][AutoDev] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2910SherlockNoMad wants to merge 2 commits intomainfrom
SherlockNoMad wants to merge 2 commits intomainfrom
Conversation
…v3 parallelize Port the fix from PR #2808 to GraphTrainer's experiment parallelize files. llama3's parallelize_llama was calling apply_tp() without passing enable_cp and enable_sp, causing context parallelism to silently malfunction. Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call that configures attention modules for CP ring attention communication. Changes: - Add enable_cp and enable_sp kwargs to apply_tp() call in llama3 - Add apply_cp_to_attention_module() call in both llama3 and deepseek_v3
SherlockNoMad
commented
Apr 9, 2026
| ) | ||
| maybe_enable_async_tp(parallelism, compile_config, tp_mesh) | ||
|
|
||
| if parallel_dims.cp_enabled: |
Contributor
Author
There was a problem hiding this comment.
do we have test coverage for this?
can you try a config with cpu enable, and see that it take to support CP?
same comment applies to dsv3.
Contributor
Author
There was a problem hiding this comment.
AutoDev: There was already test coverage for CP in JIT mode (lines 100-128 for llama3 HSDP+CP and FSDP+TP+CP, line 354-366 for deepseek_v3 FSDP+CP), but no aot_fx_trace CP tests existed.
Added two aot_fx_trace integration tests with context parallelism enabled:
- llama3:
aot_fx_trace FSDP+TP+CPwith dp_shard=2, tp=2, cp=2 on 8 GPUs (skip_rocm_test=True since aot_fx_trace applies cudagraph by default) - deepseek_v3:
aot_fx_trace FSDP+CPwith dp_shard=4, cp=2 on 8 GPUs (follows the same pattern as the existing JIT FSDP+CP test; uses SDPA config since CP only supports SDPA for dsv3)
Note: dsv3 CP test does not include EP/ETP since the existing JIT CP test also omits it — keeping the test focused on CP wiring.
…xt parallelism Add integration tests for context parallelism (CP) in aot_fx_trace mode for both llama3 and deepseek_v3, covering the CP wiring added in the parent commit. - llama3: aot_fx_trace FSDP+TP+CP (dp_shard=2, tp=2, cp=2, 8 GPUs) - deepseek_v3: aot_fx_trace FSDP+CP (dp_shard=4, cp=2, 8 GPUs)
yiming0416
approved these changes
Apr 10, 2026
9b439fe to
958b539
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
parallelize_llamawas callingapply_tp()without passingenable_cpandenable_sp, causing context parallelism to silently malfunctionapply_cp_to_attention_module()call that configures attention modules for CP ring attention communicationChanges
enable_cpandenable_spkeyword arguments to theapply_tp()call, and add theapply_cp_to_attention_module()call aftermaybe_enable_async_tpapply_cp_to_attention_module()call after the TP/EP block (deepseek_v3 already hadenable_cp/enable_spin itsapply_non_moe_tp()call)Test plan
pre-commit run --all-files(passed, no new issues)Ports fix from #2808.