fix: resolve VLM CI failures for PP recipes and collate_fn by HuiyingLi · Pull Request #1799 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-04-13T07:31:11Z

Summary

Add drop_last: true, collate_fn, and num_workers to validation dataloaders for PP VLM recipes (kimi25vl_medpix, qwen3_vl_moe_235b, qwen3_5_moe_medpix) to prevent PP microbatch mismatch
Make drop_overlong optional (default False) in default_collate_fn — previously max_length always dropped overlong samples, silently shrinking batch size and crashing PP schedules. Now truncates by default instead
Fix qwen3_5_moe_medpix model ID placeholder → Qwen/Qwen3.5-397B-A17B
Reduce qwen3_5_35b_neat_packing packed_sequence lengths 4096 → 2048 to fix OOM
Add ffmpeg to Docker image for torchcodec audio/video decoding (phi4_mm_cv17 fix)
Update max_length to 2048 across VLM PP recipes

Test plan

Unit tests for drop_overlong default behavior (truncation) and opt-in drop behavior pass
Existing _drop_overlong_samples unit tests still pass
Reproduced qwen3_vl_30b locally — runs clean with no PP mismatch
Reproduced qwen3_5_35b_neat_packing OOM locally — confirmed fix with reduced pack size
CI validation on 32-node PP recipes (kimi25vl, qwen3_vl_235b, qwen3_5_moe_397b)

🤖 Generated with Claude Code

HuiyingLi · 2026-04-13T07:31:17Z

/ok to test c33bb10

copy-pr-bot · 2026-04-13T07:31:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Add drop_last, collate_fn, and num_workers to validation dataloaders for kimi25vl_medpix, qwen3_vl_moe_235b, and qwen3_5_moe_medpix PP recipes - Fix qwen3_5_moe_medpix model ID placeholder to Qwen/Qwen3.5-397B-A17B - Reduce qwen3_5_35b_neat_packing packed_sequence lengths from 4096 to 2048 to fix OOM - Make drop_overlong optional (default False) in default_collate_fn to prevent batch size reduction that crashes PP schedules - Add ffmpeg to Docker image for torchcodec audio/video decoding - Update max_length to 2048 across VLM PP recipes - Add unit tests for drop_overlong behavior Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-13T07:32:35Z

/ok to test 9ed00d7

HuiyingLi · 2026-04-13T07:32:41Z

/claude review

…ismatch Replace super().forward(input_ids=None) with inline vision feature processing. When input_ids=None, HF's get_placeholder_mask falls back to bfloat16 embedding comparison which produces false positives, causing "Image features and image tokens do not match" on the 235B. Inline processing passes the original input_ids for reliable integer comparison. Also fix video token ID (151652 -> 151656) and add PP attention mask size guard to prevent MoE routing errors on non-Stage-0 ranks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

The inline vision processing was passing image_grid_thw as positional arg where get_rope_index expects mm_token_type_ids, causing shape mismatch on 235B: "shape of the mask [2047] does not match [3]". Use keyword arguments to pass mm_token_type_ids from kwargs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T01:40:25Z

/claude review

claude · 2026-04-16T01:42:04Z

+            t, h, w = grid_thws[0].tolist()
+            merge_h, merge_w = 2, 2
+            expected_image_tokens = int((h // merge_h) * (w // merge_w))


Bug: For multi-image samples, grid_thws has one row per image. This only computes expected_image_tokens from the first image, but actual_image_tokens on the next line counts all media tokens across all images. The comparison will always fail for multi-image inputs, causing their pixel data to be incorrectly dropped and replaced with pad tokens.

Suggested change

t, h, w = grid_thws[0].tolist()

merge_h, merge_w = 2, 2

expected_image_tokens = int((h // merge_h) * (w // merge_w))

merge_h, merge_w = 2, 2

expected_image_tokens = sum(

int((h // merge_h) * (w // merge_w))

for _, h, w in grid_thws.tolist()

)

Address Claude review: for multi-image samples, sum expected tokens across all images in grid_thws, not just the first one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T01:59:47Z

/claude review

claude

LGTM

HuiyingLi · 2026-04-16T02:04:28Z

/ok to test 813a307

Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T02:52:04Z

/ok to test abac360

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T03:01:09Z

/ok to test 338c65e

… guard Move the module-level CUDA skip marker onto the classes that actually instantiate the full model, then add two new CPU-runnable test classes (TestQwen3VLMoeModelInlineVisionCpu, TestQwen3VLMoeForConditionalGenerationPpGuardCpu) that invoke forward() as an unbound method with a MagicMock self. This lets codecov/patch see coverage of the new inline vision branches (single-modality, merged images+videos, text-only, dict attention mask) and the PP attention-mask guard / chunked pixel_values dispatch without requiring GPU, which was blocking PR merge at 58% patch coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T17:30:23Z

/claude review

HuiyingLi · 2026-04-16T17:30:25Z

/ok to test 86c5535

claude · 2026-04-16T17:32:59Z

        if grid_thws is not None:
-            all_grid_thws.append(grid_thws)
-        if "pixel_values" in sample_batch:
+            merge_h, merge_w = 2, 2


Bug-prone duplication: this hardcodes the same (2, 2) merge kernel that _expand_image_tokens uses as its default merge_kernel_size parameter. If that default ever changes, this check will silently compute the wrong expected_image_tokens and either drop image data it shouldn't or keep corrupted image data.

Consider extracting a module-level constant:

Suggested change

merge_h, merge_w = 2, 2

merge_h, merge_w = _expand_image_tokens.__defaults__[0] # merge_kernel_size default

Or simpler — define a _DEFAULT_MERGE_KERNEL = (2, 2) constant at module level and reference it in both places.

claude

Good changes overall — the video token ID fix (151652 → 151656) and the PP mask guard are solid, and the test coverage for the new drop_overlong behavior is thorough. One minor fragility flag posted inline.

Replace the hardcoded (2, 2) merge kernel in kimi_k25_vl_collate_fn's post-expansion truncation check with a module-level _DEFAULT_MERGE_KERNEL constant, and reference the same constant from _expand_image_tokens' default. This ensures both computations stay in sync if the vision-tower merge kernel ever changes. Addresses Claude review feedback on commit 86c5535. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T19:17:10Z

/ok to test 19ad3d8

Resolves GitHub-reported merge-conflict state after main advanced by ~37 commits. Auto-merge succeeded with no textual conflicts; the only overlapping file (tests/unit_tests/models/qwen3_vl_moe/test_qwen3_vl_moe_model.py) merged cleanly and all 11 new CPU tests still pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-16T19:19:52Z

/ok to test 6e4af76357cf2ff70b1b68022bc09b2b4de9cb8f

copy-pr-bot · 2026-04-16T19:19:56Z

/ok to test 6e4af76357cf2ff70b1b68022bc09b2b4de9cb8f

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

HuiyingLi · 2026-04-16T19:22:44Z

/ok to test 6e4af76

HuiyingLi requested review from a team, ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and pthombre as code owners April 13, 2026 07:31

copy-pr-bot Bot had a problem deploying to nemo-ci April 13, 2026 07:31 Error

copy-pr-bot Bot had a problem deploying to test April 13, 2026 07:31 Error

HuiyingLi force-pushed the huiyingl/fix-vlm-ci-failures-2604 branch from c33bb10 to 9ed00d7 Compare April 13, 2026 07:32

copy-pr-bot Bot temporarily deployed to test April 13, 2026 07:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 07:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 15, 2026 00:28 Inactive

HuiyingLi and others added 2 commits April 15, 2026 15:44

claude Bot reviewed Apr 16, 2026

View reviewed changes

claude Bot previously approved these changes Apr 16, 2026

View reviewed changes

fix: resolve linting errors in qwen3_vl_moe tests

abac360

Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: auto-format collate_fns.py for ruff

338c65e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

claude Bot reviewed Apr 16, 2026

View reviewed changes

HuiyingLi mentioned this pull request Apr 16, 2026

feat: add Qwen3.6-35B-A3B VLM finetune recipe #1882

Merged

4 tasks

akoumpa approved these changes Apr 17, 2026

View reviewed changes

	merge_h, merge_w = 2, 2
	merge_h, merge_w = _expand_image_tokens.__defaults__[0] # merge_kernel_size default

Conversation

HuiyingLi commented Apr 13, 2026

Summary

Test plan

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

copy-pr-bot Bot commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

copy-pr-bot Bot commented Apr 16, 2026

Uh oh!

HuiyingLi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants