feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob#1859
Merged
feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob#1859
Conversation
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Contributor
Author
|
/ok to test 010ad75 |
…1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mparouli/fix_qwen3_5_extract_model_layers Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Contributor
Author
|
/ok to test 23803c5 |
- Use Qwen/Qwen3.5-27B instead of a local checkpoint path - Add commented-out wandb section so users know how to enable it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
- Remove duplicate self.pp.update_seq_len call in vlm/finetune.py (line 940 already covers it every batch; update_seq_len short-circuits when seq_len is unchanged). - Drop string-keyed Qwen3_5ForConditionalGeneration entry from VLM_MODEL_CLS_TO_LAYERS; the class-keyed entry is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…allback Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Contributor
|
/claude review |
Contributor
|
/ok to test b29ec79 |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Contributor
|
/ok to test a73bf2a |
The helper previously returned any attr named 'language_model' / 'text_model' / 'text_decoder' — including auto-generated unittest Mocks — which broke pipeline_forward tests that passed a plain Mock model. Now only descend into real nn.Module instances. Also explicitly set embed_tokens / layers / norm to None on the mocked text module in the two get_text_module rotary tests so the now-routed pipeline_forward skips those branches cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Eagerly importing Qwen3_5ForConditionalGeneration at module load was pre-loading transformers.models.qwen3_5 into sys.modules, defeating test_cp_linear_attn_patch.py's module stubbing. Switch to string-based class qualname lookup + __name__ comparison instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Contributor
|
/ok to test d0bee86 |
Contributor
|
/claude review |
1 similar comment
Contributor
|
/claude review |
HuiyingLi
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@HuiyingLi
Enables tensor + pipeline parallelism for Qwen3.5-27B VLM end-to-end, and adds a new
PipelineConfig.reduce_grad_per_microbatchknob that keeps FSDP gradients sharded across microbatches (saves ~27 GB per rank for a 13B-trainable-param stage).Changes
Qwen3.5 VLM TP plan (
optimized_tp_plans.py,parallelizer.py)Qwen3_5ForConditionalGenerationinPARALLELIZE_FUNCTIONS; the plan delegates toget_hf_tp_shard_plan, which reads transformers'base_model_tp_planfromQwen3_5TextConfigand prefixes it withmodel.language_model..get_hf_tp_shard_plan's dispatch so inner-model nesting resolves correctly.replicated_with_grad_allreducestyle as a no-op under FSDP+TP (norm weights are naturally replicated on the TP mesh; FSDP handles grad sync).linear_attn(GatedDeltaNet) layers remain un-TP-sharded — transformers itself doesn't provide a plan for them since the stockchunk_gated_delta_rule/causal_conv1d_fnkernels aren't TP-aware.reduce_grad_per_microbatchknob (config.py,autopipeline.py,functional.py,fsdp_mixin.py,kd.py)Falsepreserves current behavior (FSDPno_syncacross microbatches, reduce-scatter once at the end).True, every microbatch backward callsset_requires_gradient_sync(True)so FSDP reduce-scatters per microbatch. Grads stay sharded; the full-stageno_syncaccumulator (stage_trainable_params × 2 bytes) is eliminated.Recipe:
examples/vlm_finetune/qwen3_5/qwen3_5_27b_tp4pp4.yaml— 2-node (16 GPUs) tp=4, pp=4, dp=1 config with the new knob enabled.Validation (8 GPUs,
pp=2, tp=1, dp=4, lbs=4)Full-grad accumulator directly measured dropped from 26.9 GB (425 full-size grad tensors) to 6.7 GB (0 full-size grad tensors) after mb 0 backward.
100-step convergence run (wandb)
qwen35_27b_tp4pp416-GPU run completed 99+ steps, loss 1.56 → 1.07–1.30, peak ~40–52 GB per rank: https://wandb.ai/Nemo-automodel/huiyingl_workspace/runs/d89hnwouTest plan
patched_backward_maybe_with_nosyncpath)patched_backward_maybe_with_nosync)🤖 Generated with Claude Code