fix: resolve VLM CI failures for PP recipes and collate_fn#1799
fix: resolve VLM CI failures for PP recipes and collate_fn#1799
Conversation
|
/ok to test c33bb10 |
- Add drop_last, collate_fn, and num_workers to validation dataloaders for kimi25vl_medpix, qwen3_vl_moe_235b, and qwen3_5_moe_medpix PP recipes - Fix qwen3_5_moe_medpix model ID placeholder to Qwen/Qwen3.5-397B-A17B - Reduce qwen3_5_35b_neat_packing packed_sequence lengths from 4096 to 2048 to fix OOM - Make drop_overlong optional (default False) in default_collate_fn to prevent batch size reduction that crashes PP schedules - Add ffmpeg to Docker image for torchcodec audio/video decoding - Update max_length to 2048 across VLM PP recipes - Add unit tests for drop_overlong behavior Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
c33bb10 to
9ed00d7
Compare
|
/ok to test 9ed00d7 |
|
/claude review |
…ismatch Replace super().forward(input_ids=None) with inline vision feature processing. When input_ids=None, HF's get_placeholder_mask falls back to bfloat16 embedding comparison which produces false positives, causing "Image features and image tokens do not match" on the 235B. Inline processing passes the original input_ids for reliable integer comparison. Also fix video token ID (151652 -> 151656) and add PP attention mask size guard to prevent MoE routing errors on non-Stage-0 ranks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The inline vision processing was passing image_grid_thw as positional arg where get_rope_index expects mm_token_type_ids, causing shape mismatch on 235B: "shape of the mask [2047] does not match [3]". Use keyword arguments to pass mm_token_type_ids from kwargs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
| t, h, w = grid_thws[0].tolist() | ||
| merge_h, merge_w = 2, 2 | ||
| expected_image_tokens = int((h // merge_h) * (w // merge_w)) |
There was a problem hiding this comment.
Bug: For multi-image samples, grid_thws has one row per image. This only computes expected_image_tokens from the first image, but actual_image_tokens on the next line counts all media tokens across all images. The comparison will always fail for multi-image inputs, causing their pixel data to be incorrectly dropped and replaced with pad tokens.
| t, h, w = grid_thws[0].tolist() | |
| merge_h, merge_w = 2, 2 | |
| expected_image_tokens = int((h // merge_h) * (w // merge_w)) | |
| merge_h, merge_w = 2, 2 | |
| expected_image_tokens = sum( | |
| int((h // merge_h) * (w // merge_w)) | |
| for _, h, w in grid_thws.tolist() | |
| ) |
Address Claude review: for multi-image samples, sum expected tokens across all images in grid_thws, not just the first one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
|
/ok to test 813a307 |
Fix ruff I001 (import sort), F401 (unused import), F841 (unused variable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test abac360 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test 338c65e |
… guard Move the module-level CUDA skip marker onto the classes that actually instantiate the full model, then add two new CPU-runnable test classes (TestQwen3VLMoeModelInlineVisionCpu, TestQwen3VLMoeForConditionalGenerationPpGuardCpu) that invoke forward() as an unbound method with a MagicMock self. This lets codecov/patch see coverage of the new inline vision branches (single-modality, merged images+videos, text-only, dict attention mask) and the PP attention-mask guard / chunked pixel_values dispatch without requiring GPU, which was blocking PR merge at 58% patch coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
|
/ok to test 86c5535 |
| if grid_thws is not None: | ||
| all_grid_thws.append(grid_thws) | ||
| if "pixel_values" in sample_batch: | ||
| merge_h, merge_w = 2, 2 |
There was a problem hiding this comment.
Bug-prone duplication: this hardcodes the same (2, 2) merge kernel that _expand_image_tokens uses as its default merge_kernel_size parameter. If that default ever changes, this check will silently compute the wrong expected_image_tokens and either drop image data it shouldn't or keep corrupted image data.
Consider extracting a module-level constant:
| merge_h, merge_w = 2, 2 | |
| merge_h, merge_w = _expand_image_tokens.__defaults__[0] # merge_kernel_size default |
Or simpler — define a _DEFAULT_MERGE_KERNEL = (2, 2) constant at module level and reference it in both places.
Replace the hardcoded (2, 2) merge kernel in kimi_k25_vl_collate_fn's post-expansion truncation check with a module-level _DEFAULT_MERGE_KERNEL constant, and reference the same constant from _expand_image_tokens' default. This ensures both computations stay in sync if the vision-tower merge kernel ever changes. Addresses Claude review feedback on commit 86c5535. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test 19ad3d8 |
Resolves GitHub-reported merge-conflict state after main advanced by ~37 commits. Auto-merge succeeded with no textual conflicts; the only overlapping file (tests/unit_tests/models/qwen3_vl_moe/test_qwen3_vl_moe_model.py) merged cleanly and all 11 new CPU tests still pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test 6e4af76357cf2ff70b1b68022bc09b2b4de9cb8f |
@HuiyingLi, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 6e4af76 |
Summary
drop_last: true,collate_fn, andnum_workersto validation dataloaders for PP VLM recipes (kimi25vl_medpix, qwen3_vl_moe_235b, qwen3_5_moe_medpix) to prevent PP microbatch mismatchdrop_overlongoptional (defaultFalse) indefault_collate_fn— previouslymax_lengthalways dropped overlong samples, silently shrinking batch size and crashing PP schedules. Now truncates by default insteadQwen/Qwen3.5-397B-A17Bffmpegto Docker image for torchcodec audio/video decoding (phi4_mm_cv17 fix)max_lengthto 2048 across VLM PP recipesTest plan
drop_overlongdefault behavior (truncation) and opt-in drop behavior pass_drop_overlong_samplesunit tests still pass🤖 Generated with Claude Code