[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808
Closed
SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
Closed
[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
Conversation
…lelize Llama3's parallelize_llama was calling apply_tp() without passing enable_cp and enable_sp, causing context parallelism to silently malfunction — TP plans wouldn't account for CP sharding, and positions wouldn't be handled correctly. Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call that configures attention modules for CP ring attention communication. Without this, inputs are sharded by CP (via Trainer.post_dataloading_process) but attention modules compute only on the local shard without seeing the full context. Changes: - llama3/parallelize.py: Pass enable_cp and enable_sp to apply_tp(), add apply_cp_to_attention_module() call - deepseek_v3/parallelize.py: Add apply_cp_to_attention_module() call (enable_cp/enable_sp were already passed correctly) [ghstack-poisoned]
This was referenced Apr 3, 2026
SherlockNoMad
added a commit
that referenced
this pull request
Apr 9, 2026
…v3 parallelize Port the fix from PR #2808 to GraphTrainer's experiment parallelize files. llama3's parallelize_llama was calling apply_tp() without passing enable_cp and enable_sp, causing context parallelism to silently malfunction. Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call that configures attention modules for CP ring attention communication. Changes: - Add enable_cp and enable_sp kwargs to apply_tp() call in llama3 - Add apply_cp_to_attention_module() call in both llama3 and deepseek_v3
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Llama3's parallelize_llama was calling apply_tp() without passing
enable_cp and enable_sp, causing context parallelism to silently
malfunction — TP plans wouldn't account for CP sharding, and positions
wouldn't be handled correctly.
Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module()
call that configures attention modules for CP ring attention communication.
Without this, inputs are sharded by CP (via Trainer.post_dataloading_process)
but attention modules compute only on the local shard without seeing the
full context.
Changes:
add apply_cp_to_attention_module() call
(enable_cp/enable_sp were already passed correctly)