feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob by akoumpa · Pull Request #1859 · NVIDIA-NeMo/Automodel

akoumpa · 2026-04-15T07:19:39Z

Summary

@HuiyingLi
Enables tensor + pipeline parallelism for Qwen3.5-27B VLM end-to-end, and adds a new PipelineConfig.reduce_grad_per_microbatch knob that keeps FSDP gradients sharded across microbatches (saves ~27 GB per rank for a 13B-trainable-param stage).

Changes

Qwen3.5 VLM TP plan (optimized_tp_plans.py, parallelizer.py)

Register Qwen3_5ForConditionalGeneration in PARALLELIZE_FUNCTIONS; the plan delegates to get_hf_tp_shard_plan, which reads transformers' base_model_tp_plan from Qwen3_5TextConfig and prefixes it with model.language_model..
Add Qwen3.5 VLM to get_hf_tp_shard_plan's dispatch so inner-model nesting resolves correctly.
Translate transformers' new replicated_with_grad_allreduce style as a no-op under FSDP+TP (norm weights are naturally replicated on the TP mesh; FSDP handles grad sync).
Note: linear_attn (GatedDeltaNet) layers remain un-TP-sharded — transformers itself doesn't provide a plan for them since the stock chunk_gated_delta_rule / causal_conv1d_fn kernels aren't TP-aware.

reduce_grad_per_microbatch knob (config.py, autopipeline.py, functional.py, fsdp_mixin.py, kd.py)

Default False preserves current behavior (FSDP no_sync across microbatches, reduce-scatter once at the end).
When True, every microbatch backward calls set_requires_gradient_sync(True) so FSDP reduce-scatters per microbatch. Grads stay sharded; the full-stage no_sync accumulator (stage_trainable_params × 2 bytes) is eliminated.
Trade-off: N reduce-scatters per step instead of 1; memory savings ~27 GB per rank for a 13B-param stage in bf16.

Recipe: examples/vlm_finetune/qwen3_5/qwen3_5_27b_tp4pp4.yaml — 2-node (16 GPUs) tp=4, pp=4, dp=1 config with the new knob enabled.

Validation (8 GPUs, `pp=2, tp=1, dp=4`, lbs=4)

Config	Peak per rank	Outcome
default (no knob)	66 GB → OOM at step 1 bwd	❌ broken
knob=True	32–42 GB across steps 0–2	✅ fits

Full-grad accumulator directly measured dropped from 26.9 GB (425 full-size grad tensors) to 6.7 GB (0 full-size grad tensors) after mb 0 backward.

100-step convergence run (wandb)

qwen35_27b_tp4pp4 16-GPU run completed 99+ steps, loss 1.56 → 1.07–1.30, peak ~40–52 GB per rank: https://wandb.ai/Nemo-automodel/huiyingl_workspace/runs/d89hnwou

Test plan

Local pp=2/dp=4 validated with and without knob (OOM vs fit)
Local tp=2/pp=2/dp=2 with knob + new Qwen3.5 TP plan (fits, 3 steps clean)
2-node tp=4/pp=4/dp=1 with knob (100 steps, wandb loss curve)
pp=1 baseline for convergence comparison
Gemma4 VLM regression check (uses same patched_backward_maybe_with_nosync path)
MoE (Qwen3moeVL) PP regression check (primary user of patched_backward_maybe_with_nosync)

🤖 Generated with Claude Code

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot · 2026-04-15T07:19:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2026-04-15T07:20:56Z

/ok to test 010ad75

…1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mparouli/fix_qwen3_5_extract_model_layers Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

akoumpa · 2026-04-17T15:56:18Z

/ok to test 23803c5

- Use Qwen/Qwen3.5-27B instead of a local checkpoint path - Add commented-out wandb section so users know how to enable it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

- Remove duplicate self.pp.update_seq_len call in vlm/finetune.py (line 940 already covers it every batch; update_seq_len short-circuits when seq_len is unchanged). - Drop string-keyed Qwen3_5ForConditionalGeneration entry from VLM_MODEL_CLS_TO_LAYERS; the class-keyed entry is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…allback Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-21T01:33:29Z

/claude review

claude

LGTM

HuiyingLi · 2026-04-21T01:40:08Z

/ok to test b29ec79

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-21T03:08:56Z

/ok to test a73bf2a

The helper previously returned any attr named 'language_model' / 'text_model' / 'text_decoder' — including auto-generated unittest Mocks — which broke pipeline_forward tests that passed a plain Mock model. Now only descend into real nn.Module instances. Also explicitly set embed_tokens / layers / norm to None on the mocked text module in the two get_text_module rotary tests so the now-routed pipeline_forward skips those branches cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Eagerly importing Qwen3_5ForConditionalGeneration at module load was pre-loading transformers.models.qwen3_5 into sys.modules, defeating test_cp_linear_attn_patch.py's module stubbing. Switch to string-based class qualname lookup + __name__ comparison instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-21T04:59:24Z

/ok to test d0bee86

HuiyingLi · 2026-04-21T05:01:32Z

/claude review

HuiyingLi · 2026-04-21T06:31:58Z

/claude review

claude

LGTM

add qwen3_5

010ad75

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 15, 2026

copy-pr-bot Bot temporarily deployed to test April 15, 2026 07:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 15, 2026 07:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 15, 2026 08:26 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 15, 2026 08:47 Failure

HuiyingLi and others added 3 commits April 16, 2026 00:38

Merge remote-tracking branch 'origin/cherry-pick-1813-main' into akou…

aaeedf5

…mparouli/fix_qwen3_5_extract_model_layers Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Merge branch 'main' into akoumparouli/fix_qwen3_5_extract_model_layers

b771415

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 17, 2026 15:56 Inactive

copy-pr-bot Bot temporarily deployed to test April 17, 2026 15:56 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 17, 2026 16:03 Inactive

HuiyingLi changed the title ~~fix: add qwen3_5 to _extract_model_layers~~ feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob Apr 20, 2026

HuiyingLi and others added 4 commits April 19, 2026 22:05

refactor: reuse defer_fsdp_grad_sync for PP; restore Qwen3.5 string f…

2ee9af7

…allback Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

test: cover Qwen3.5 VLM TP plan + grad-sync knob additions

b29ec79

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

claude Bot previously approved these changes Apr 21, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test April 21, 2026 01:40 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 21, 2026 01:40 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 21, 2026 01:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 21, 2026 01:51 Failure

style: drop unused functools.reduce import

a73bf2a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi dismissed claude[bot]’s stale review via a73bf2a April 21, 2026 03:08

HuiyingLi and others added 2 commits April 20, 2026 21:50

claude Bot approved these changes Apr 21, 2026

View reviewed changes

HuiyingLi approved these changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob#1859

feat: Qwen3.5 VLM TP+PP support with per-microbatch grad reduce-scatter knob#1859
akoumpa merged 19 commits intomainfrom
akoumparouli/fix_qwen3_5_extract_model_layers

akoumpa commented Apr 15, 2026 •

edited by HuiyingLi

Loading

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

akoumpa commented Apr 15, 2026

Uh oh!

akoumpa commented Apr 17, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akoumpa commented Apr 15, 2026 • edited by HuiyingLi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation (8 GPUs, pp=2, tp=1, dp=4, lbs=4)

100-step convergence run (wandb)

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

akoumpa commented Apr 15, 2026

Uh oh!

akoumpa commented Apr 17, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

HuiyingLi commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akoumpa commented Apr 15, 2026 •

edited by HuiyingLi

Loading

Validation (8 GPUs, `pp=2, tp=1, dp=4`, lbs=4)