[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize by SherlockNoMad · Pull Request #2808 · pytorch/torchtitan

SherlockNoMad · 2026-04-03T07:30:50Z

Stack from ghstack (oldest at bottom):

Llama3's parallelize_llama was calling apply_tp() without passing
enable_cp and enable_sp, causing context parallelism to silently
malfunction — TP plans wouldn't account for CP sharding, and positions
wouldn't be handled correctly.

Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module()
call that configures attention modules for CP ring attention communication.
Without this, inputs are sharded by CP (via Trainer.post_dataloading_process)
but attention modules compute only on the local shard without seeing the
full context.

Changes:

llama3/parallelize.py: Pass enable_cp and enable_sp to apply_tp(),
add apply_cp_to_attention_module() call
deepseek_v3/parallelize.py: Add apply_cp_to_attention_module() call
(enable_cp/enable_sp were already passed correctly)

…lelize Llama3's parallelize_llama was calling apply_tp() without passing enable_cp and enable_sp, causing context parallelism to silently malfunction — TP plans wouldn't account for CP sharding, and positions wouldn't be handled correctly. Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call that configures attention modules for CP ring attention communication. Without this, inputs are sharded by CP (via Trainer.post_dataloading_process) but attention modules compute only on the local shard without seeing the full context. Changes: - llama3/parallelize.py: Pass enable_cp and enable_sp to apply_tp(), add apply_cp_to_attention_module() call - deepseek_v3/parallelize.py: Add apply_cp_to_attention_module() call (enable_cp/enable_sp were already passed correctly) [ghstack-poisoned]

…v3 parallelize Port the fix from PR #2808 to GraphTrainer's experiment parallelize files. llama3's parallelize_llama was calling apply_tp() without passing enable_cp and enable_sp, causing context parallelism to silently malfunction. Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call that configures attention modules for CP ring attention communication. Changes: - Add enable_cp and enable_sp kwargs to apply_tp() call in llama3 - Add apply_cp_to_attention_module() call in both llama3 and deepseek_v3

pytorch-bot bot added the ciflow/8gpu label Apr 3, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 3, 2026

SherlockNoMad mentioned this pull request Apr 9, 2026

[GraphTrainer][AutoDev] Fix missing CP wiring in llama3 and deepseek_v3 parallelize #2910

Draft

2 tasks

SherlockNoMad closed this Apr 9, 2026

SherlockNoMad deleted the gh/SherlockNoMad/13/head branch April 9, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808

[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808
SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
gh/SherlockNoMad/13/head

SherlockNoMad commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SherlockNoMad commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SherlockNoMad commented Apr 3, 2026 •

edited

Loading