Skip to content

fix: resolve VLM CI failures for PP recipes and collate_fn#1799

Merged
akoumpa merged 17 commits intomainfrom
huiyingl/fix-vlm-ci-failures-2604
Apr 17, 2026
Merged

fix: resolve VLM CI failures for PP recipes and collate_fn#1799
akoumpa merged 17 commits intomainfrom
huiyingl/fix-vlm-ci-failures-2604

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

Summary

  • Add drop_last: true, collate_fn, and num_workers to validation dataloaders for PP VLM recipes (kimi25vl_medpix, qwen3_vl_moe_235b, qwen3_5_moe_medpix) to prevent PP microbatch mismatch
  • Make drop_overlong optional (default False) in default_collate_fn — previously max_length always dropped overlong samples, silently shrinking batch size and crashing PP schedules. Now truncates by default instead
  • Fix qwen3_5_moe_medpix model ID placeholder → Qwen/Qwen3.5-397B-A17B
  • Reduce qwen3_5_35b_neat_packing packed_sequence lengths 4096 → 2048 to fix OOM
  • Add ffmpeg to Docker image for torchcodec audio/video decoding (phi4_mm_cv17 fix)
  • Update max_length to 2048 across VLM PP recipes

Test plan

  • Unit tests for drop_overlong default behavior (truncation) and opt-in drop behavior pass
  • Existing _drop_overlong_samples unit tests still pass
  • Reproduced qwen3_vl_30b locally — runs clean with no PP mismatch
  • Reproduced qwen3_5_35b_neat_packing OOM locally — confirmed fix with reduced pack size
  • CI validation on 32-node PP recipes (kimi25vl, qwen3_vl_235b, qwen3_5_moe_397b)

🤖 Generated with Claude Code

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test c33bb10

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Add drop_last, collate_fn, and num_workers to validation dataloaders for
  kimi25vl_medpix, qwen3_vl_moe_235b, and qwen3_5_moe_medpix PP recipes
- Fix qwen3_5_moe_medpix model ID placeholder to Qwen/Qwen3.5-397B-A17B
- Reduce qwen3_5_35b_neat_packing packed_sequence lengths from 4096 to 2048
  to fix OOM
- Make drop_overlong optional (default False) in default_collate_fn to prevent
  batch size reduction that crashes PP schedules
- Add ffmpeg to Docker image for torchcodec audio/video decoding
- Update max_length to 2048 across VLM PP recipes
- Add unit tests for drop_overlong behavior

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi HuiyingLi force-pushed the huiyingl/fix-vlm-ci-failures-2604 branch from c33bb10 to 9ed00d7 Compare April 13, 2026 07:32
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 9ed00d7

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

HuiyingLi and others added 2 commits April 15, 2026 15:44
…ismatch

Replace super().forward(input_ids=None) with inline vision feature
processing. When input_ids=None, HF's get_placeholder_mask falls back
to bfloat16 embedding comparison which produces false positives,
causing "Image features and image tokens do not match" on the 235B.
Inline processing passes the original input_ids for reliable integer
comparison.

Also fix video token ID (151652 -> 151656) and add PP attention mask
size guard to prevent MoE routing errors on non-Stage-0 ranks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The inline vision processing was passing image_grid_thw as positional
arg where get_rope_index expects mm_token_type_ids, causing shape
mismatch on 235B: "shape of the mask [2047] does not match [3]".
Use keyword arguments to pass mm_token_type_ids from kwargs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +789 to +791
t, h, w = grid_thws[0].tolist()
merge_h, merge_w = 2, 2
expected_image_tokens = int((h // merge_h) * (w // merge_w))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: For multi-image samples, grid_thws has one row per image. This only computes expected_image_tokens from the first image, but actual_image_tokens on the next line counts all media tokens across all images. The comparison will always fail for multi-image inputs, causing their pixel data to be incorrectly dropped and replaced with pad tokens.

Suggested change
t, h, w = grid_thws[0].tolist()
merge_h, merge_w = 2, 2
expected_image_tokens = int((h // merge_h) * (w // merge_w))
merge_h, merge_w = 2, 2
expected_image_tokens = sum(
int((h // merge_h) * (w // merge_w))
for _, h, w in grid_thws.tolist()
)

Address Claude review: for multi-image samples, sum expected tokens
across all images in grid_thws, not just the first one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

claude[bot]
claude Bot previously approved these changes Apr 16, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 813a307

Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test abac360

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 338c65e

… guard

Move the module-level CUDA skip marker onto the classes that actually
instantiate the full model, then add two new CPU-runnable test classes
(TestQwen3VLMoeModelInlineVisionCpu, TestQwen3VLMoeForConditionalGenerationPpGuardCpu)
that invoke forward() as an unbound method with a MagicMock self.

This lets codecov/patch see coverage of the new inline vision branches
(single-modality, merged images+videos, text-only, dict attention mask)
and the PP attention-mask guard / chunked pixel_values dispatch without
requiring GPU, which was blocking PR merge at 58% patch coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 86c5535

if grid_thws is not None:
all_grid_thws.append(grid_thws)
if "pixel_values" in sample_batch:
merge_h, merge_w = 2, 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug-prone duplication: this hardcodes the same (2, 2) merge kernel that _expand_image_tokens uses as its default merge_kernel_size parameter. If that default ever changes, this check will silently compute the wrong expected_image_tokens and either drop image data it shouldn't or keep corrupted image data.

Consider extracting a module-level constant:

Suggested change
merge_h, merge_w = 2, 2
merge_h, merge_w = _expand_image_tokens.__defaults__[0] # merge_kernel_size default

Or simpler — define a _DEFAULT_MERGE_KERNEL = (2, 2) constant at module level and reference it in both places.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes overall — the video token ID fix (151652 → 151656) and the PP mask guard are solid, and the test coverage for the new drop_overlong behavior is thorough. One minor fragility flag posted inline.

Replace the hardcoded (2, 2) merge kernel in kimi_k25_vl_collate_fn's
post-expansion truncation check with a module-level _DEFAULT_MERGE_KERNEL
constant, and reference the same constant from _expand_image_tokens' default.
This ensures both computations stay in sync if the vision-tower merge kernel
ever changes.

Addresses Claude review feedback on commit 86c5535.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 19ad3d8

Resolves GitHub-reported merge-conflict state after main advanced by
~37 commits. Auto-merge succeeded with no textual conflicts; the only
overlapping file (tests/unit_tests/models/qwen3_vl_moe/test_qwen3_vl_moe_model.py)
merged cleanly and all 11 new CPU tests still pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 6e4af76357cf2ff70b1b68022bc09b2b4de9cb8f

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 16, 2026

/ok to test 6e4af76357cf2ff70b1b68022bc09b2b4de9cb8f

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 6e4af76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants