Skip to content

support cp ,fix qwen3.5 gdn sp#138

Merged
meichangsu1 merged 7 commits intomodelscope:mainfrom
meichangsu1:fsdp_cp_ljl
Apr 21, 2026
Merged

support cp ,fix qwen3.5 gdn sp#138
meichangsu1 merged 7 commits intomodelscope:mainfrom
meichangsu1:fsdp_cp_ljl

Conversation

@meichangsu1
Copy link
Copy Markdown
Collaborator

@meichangsu1 meichangsu1 commented Apr 2, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

This PR adds context parallel and Qwen3.5 Gated DeltaNet sequence parallel support to the transformers stack, and refactors sequence parallel into a package-based implementation.

Main changes:

  • Refactor sequence_parallel.py into sequence_parallel/ and add shared utilities.
  • Add derived ring / zigzag ring attention support for CP + SP.
  • Add Qwen3.5 linear attention SP support in linear_attention_sp.py;Ring attention is not supported for this path yet.
  • Update transformers model / processor paths to work with the new SP+CP flow.
  • Adjust loss metric aggregation for Ulysses replicated loss behavior.
  • Update cookbook examples for sp_fsdp_dense.
  • Add test coverage for:
    • Qwen3.5 linear attention SP alignment
    • sequence parallel + context parallel behavior
  • Remove outdated tests/moe/test_expert_parallel_qwen3_fsdp_sp.py.

Experiment results

@meichangsu1 meichangsu1 changed the title Fsdp cp ljl support cp ,fix qwen3.5 gdn sp Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances sequence parallelism support by implementing ZigZag Ring Attention for long-sequence training and Ulysses-style sequence parallelism for Qwen3.5 linear attention. It also introduces multimodal deepstack patching for Qwen3-VL and refactors the SequenceParallel strategy to better handle complex device meshes and packed/varlen inputs. Feedback focuses on improving code maintainability and robustness, specifically by grouping attributes in the SequenceParallel constructor, removing redundant logic and unused imports, replacing deprecated inspection methods, and centralizing duplicated loss-gathering logic.

Comment thread src/twinkle/model/transformers/strategy/sequence_parallel/linear_attention_sp.py Outdated
Comment thread src/twinkle/model/transformers/transformers.py Outdated
@meichangsu1 meichangsu1 changed the title support cp ,fix qwen3.5 gdn sp support cp ,fix qwen3.5 gdn sp Apr 2, 2026
Comment thread src/twinkle/metric/loss.py Outdated
Comment thread src/twinkle/model/transformers/transformers.py Outdated
Comment thread src/twinkle/model/transformers/transformers.py Outdated
- Refactor linear attention sequence parallel import error message into a constantt
- Fix token counting in TransformersModel by using raw DP/FSDP world size instead of data_world_size
- Enhance Framework.gather_object to check distributed initialization before accessing world size
- Add test utility for creating padded labels in sequence parallel tests
- Add `num_tokens` field to `ModelOutput` TypedDict for explicit token denominator
- Update `LossOutput` to use `OutputType` for `num_tokens` instead of `int`
- Refactor `LossMetric` to prefer `num_tokens` from outputs, with fallback to labels
- Remove `_get_raw_dp_fsdp_world_size` helper and use `_device_mesh._get_dp_fsdp_world_size`
- Use `InputProcessor.postprocess_tensor_sp` for loss tensor gathering in TransformersModel
- Simplify sequence-parallel loss normalization by relying on output `num_tokens`
@meichangsu1 meichangsu1 merged commit 62a14d8 into modelscope:main Apr 21, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants