Skip to content

[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808

Closed
SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
gh/SherlockNoMad/13/head
Closed

[graph_trainer] Fix missing CP wiring in llama3 and deepseek_v3 parallelize#2808
SherlockNoMad wants to merge 1 commit intogh/SherlockNoMad/13/basefrom
gh/SherlockNoMad/13/head

Conversation

@SherlockNoMad
Copy link
Copy Markdown
Contributor

@SherlockNoMad SherlockNoMad commented Apr 3, 2026

Stack from ghstack (oldest at bottom):

Llama3's parallelize_llama was calling apply_tp() without passing
enable_cp and enable_sp, causing context parallelism to silently
malfunction — TP plans wouldn't account for CP sharding, and positions
wouldn't be handled correctly.

Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module()
call that configures attention modules for CP ring attention communication.
Without this, inputs are sharded by CP (via Trainer.post_dataloading_process)
but attention modules compute only on the local shard without seeing the
full context.

Changes:

  • llama3/parallelize.py: Pass enable_cp and enable_sp to apply_tp(),
    add apply_cp_to_attention_module() call
  • deepseek_v3/parallelize.py: Add apply_cp_to_attention_module() call
    (enable_cp/enable_sp were already passed correctly)

…lelize

Llama3's parallelize_llama was calling apply_tp() without passing
enable_cp and enable_sp, causing context parallelism to silently
malfunction — TP plans wouldn't account for CP sharding, and positions
wouldn't be handled correctly.

Both llama3 and deepseek_v3 were missing the apply_cp_to_attention_module()
call that configures attention modules for CP ring attention communication.
Without this, inputs are sharded by CP (via Trainer.post_dataloading_process)
but attention modules compute only on the local shard without seeing the
full context.

Changes:
- llama3/parallelize.py: Pass enable_cp and enable_sp to apply_tp(),
  add apply_cp_to_attention_module() call
- deepseek_v3/parallelize.py: Add apply_cp_to_attention_module() call
  (enable_cp/enable_sp were already passed correctly)

[ghstack-poisoned]
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 3, 2026
SherlockNoMad added a commit that referenced this pull request Apr 9, 2026
…v3 parallelize

Port the fix from PR #2808 to GraphTrainer's experiment parallelize files.

llama3's parallelize_llama was calling apply_tp() without passing enable_cp
and enable_sp, causing context parallelism to silently malfunction. Both
llama3 and deepseek_v3 were missing the apply_cp_to_attention_module() call
that configures attention modules for CP ring attention communication.

Changes:
- Add enable_cp and enable_sp kwargs to apply_tp() call in llama3
- Add apply_cp_to_attention_module() call in both llama3 and deepseek_v3
@SherlockNoMad SherlockNoMad deleted the gh/SherlockNoMad/13/head branch April 9, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant