Skip to content

Remove unnecessary expand_as in get_placeholder_mask across VLMs#44907

Open
syncdoth wants to merge 2 commits intohuggingface:mainfrom
syncdoth:remove-expand-as-placeholder-mask
Open

Remove unnecessary expand_as in get_placeholder_mask across VLMs#44907
syncdoth wants to merge 2 commits intohuggingface:mainfrom
syncdoth:remove-expand-as-placeholder-mask

Conversation

@syncdoth
Copy link
Copy Markdown

Fixes #44906

Summary

  • Remove .expand_as(inputs_embeds) from placeholder mask creation in get_placeholder_mask and equivalent inline patterns across all VLM models. masked_scatter natively broadcasts (B, S, 1)(B, S, H), making the expansion unnecessary.
  • Replace inputs_embeds[special_image_mask].numel() == image_features.numel() validation with equivalent arithmetic n_tokens * inputs_embeds.shape[-1] == image_features.numel(), which avoids data-dependent boolean indexing and is more torch.compile-friendly.
  • 71 files changed across llava, qwen2_vl, paligemma, gemma3n, chameleon, video_llava, idefics2/3, instructblip, blip_2, and many more.

How this was developed

The core fix was first implemented and verified on llava/modeling_llava.py, then expanded to all other models following the same pattern using Claude Code. Each file was reviewed to ensure the transformation was appropriate — files with genuinely different expand_as usage (e.g., pe_audio where the mask is later .reshape()-ed) were left unchanged.

Test plan

  • Correctness verified: masked_scatter with broadcast (B,S,1) mask produces identical results to expanded (B,S,H) mask
  • pytest tests/models/llava/test_modeling_llava.py -x -v -k "not slow" — 136 passed
  • pytest tests/models/qwen2_vl/test_modeling_qwen2_vl.py -x -v -k "not slow" — 137 passed
  • pytest tests/models/paligemma/test_modeling_paligemma.py -x -v -k "not slow" — 124 passed
  • ruff check — all checks passed
  • check_modular_conversion.py --fix_and_overwrite — all generated files consistent
  • No duplicate PRs found (gh pr list --search "expand_as placeholder mask" / "get_placeholder_mask")

This PR uses AI assistance (Claude Code). I have reviewed all changes and validated the behavior end-to-end.

The placeholder mask was being expanded from (B, S, 1) to (B, S, H) via
`.expand_as(inputs_embeds)` before being passed to `masked_scatter`. Since
`masked_scatter` natively supports broadcasting, this expansion materializes
a large boolean tensor unnecessarily.

Changes:
- Remove `.expand_as(inputs_embeds)` from mask creation, keeping masks as
  (B, S, 1) and relying on `masked_scatter`/`torch.where` broadcasting
- Replace `inputs_embeds[mask].numel() == features.numel()` validation with
  equivalent arithmetic `n_tokens * hidden_dim == features.numel()`, which
  avoids data-dependent boolean indexing and is more torch.compile-friendly
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, blip_2, chameleon, cohere2_vision, colqwen2, deepseek_vl, deepseek_vl_hybrid, emu3, ernie4_5_vl_moe, fast_vlm, florence2, fuyu, gemma3, gemma3n, glm46v

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zucchini-nlp zucchini-nlp self-requested a review March 23, 2026 10:19
@Rocketknight1
Copy link
Copy Markdown
Member

cc @zucchini-nlp it's an agent PR but looks high-quality to me, mostly because I'm a sucker for ways to avoid materializing broadcasts and tensor expansions 😅

@zucchini-nlp
Copy link
Copy Markdown
Member

Yea, this should be mostly copy-paste. I'm coming to this later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove unnecessary expand_as in get_placeholder_mask across VLMs

3 participants