Extract dynamic vision/audio tensors into standalone pure functions#45396
Extract dynamic vision/audio tensors into standalone pure functions#45396IlyasMoutawwakil wants to merge 63 commits intomainfrom
Conversation
- Create top-level `modeling_vision_utils.py` with shared pure functions: `get_vision_cu_seqlens`, `get_rotary_pos_ids`, `get_rotary_pos_ids_interleaved`, `get_window_index`, `get_pos_embed_indices` - Move audio precompute functions (`chunk_and_pad_features`, `get_audio_cu_seqlens`, `get_valid_indices`, `get_pool_indices`) into modular files directly - Simplify `VisionRotaryEmbedding.forward` to accept `pos_ids` tensor directly via broadcast multiply, eliminating data-dependent table creation - Make vision/audio encoder forwards accept optional precomputed tensors (`cu_seqlens`, `rotary_pos_ids`, `window_index`, `embed_indices`, etc.) - Use explicit naming: `get_vision_cu_seqlens` / `get_audio_cu_seqlens` Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull request overview
This PR refactors multimodal (vision/audio) models to share “pure” tensor-building utilities and to optionally accept precomputed tensors (e.g., cu_seqlens / rotary pos ids), reducing duplicated logic across many model implementations and processors.
Changes:
- Added
src/transformers/modeling_vision_utils.pywith standalone helpers (e.g.,get_vision_cu_seqlens,get_rotary_pos_ids,get_window_index,get_pos_embed_indices) and updated multiple models/processors to use them. - Updated multiple vision encoders to accept optional precomputed tensors (
cu_seqlens,rotary_pos_ids,window_index,embed_indices, etc.) and simplified rotary embedding computation to takepos_idsdirectly. - Refactored audio precompute logic into modular model files and added processor support for returning extra precomputed tensors via
return_extra_tensors.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/transformers/utils/auto_docstring.py | Adds new documented processor/model kwargs for precomputed vision tensors. |
| src/transformers/models/video_llama_3/processing_video_llama_3.py | Allows optionally returning precomputed vision tensors from the processor. |
| src/transformers/models/video_llama_3/modular_video_llama_3.py | Switches vision rotary/cu_seqlens generation to shared helpers and adds optional precomputed inputs. |
| src/transformers/models/video_llama_3/modeling_video_llama_3.py | Same as modular: uses shared helpers and updates rotary embedding forward API. |
| src/transformers/models/qwen3_vl/processing_qwen3_vl.py | Adds optional return of precomputed cu_seqlens/rotary pos ids (interleaved variant). |
| src/transformers/models/qwen3_vl/modular_qwen3_vl.py | Moves pos-embed/rotary/cu_seqlens computations to shared helpers; adds optional precomputed inputs. |
| src/transformers/models/qwen3_vl/modeling_qwen3_vl.py | Same refactor as modular file (generated modeling). |
| src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py | Same vision refactor for MoE variant. |
| src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py | Moves audio chunking/cu_seqlens/valid index logic into pure helpers + forward accepts optional precomputes. |
| src/transformers/models/qwen3_5/modular_qwen3_5.py | Refactors vision pos/rotary/cu_seqlens computations; adds optional precomputed inputs. |
| src/transformers/models/qwen3_5/modeling_qwen3_5.py | Same vision refactor for generated modeling file. |
| src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py | Same vision refactor for MoE variant. |
| src/transformers/models/qwen2_vl/processing_qwen2_vl.py | Adds optional return of precomputed cu_seqlens/rotary pos ids from processor. |
| src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | Uses shared get_rotary_pos_ids / get_vision_cu_seqlens and accepts optional precomputed tensors. |
| src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py | Adds optional return of precomputed cu_seqlens/rotary pos ids from processor. |
| src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py | Refactors rotary/cu_seqlens/window indexing via shared helpers; adds optional precomputed inputs. |
| src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py | Same refactor for generated modeling file. |
| src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py | Moves audio chunking/indices/cu_seqlens/pooling computations into pure helper functions and accepts optional precomputes. |
| src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py | Updates PaddleOCR vision path to use shared rotary/cu_seqlens helpers and renames args (grid_thw). |
| src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py | Same as modular: shared helper usage and argument renames. |
| src/transformers/models/glm4v/processing_glm4v.py | Adds optional return of precomputed cu_seqlens/rotary pos ids from processor. |
| src/transformers/models/glm4v/modular_glm4v.py | Refactors vision rotary/cu_seqlens computations and video grid flattening logic. |
| src/transformers/models/glm4v/modeling_glm4v.py | Same as modular file, plus updates to rotary embedding forward API. |
| src/transformers/models/glm4v_moe/modeling_glm4v_moe.py | Same refactor for MoE variant. |
| src/transformers/models/glm46v/processing_glm46v.py | Adds optional return of precomputed cu_seqlens/rotary pos ids from processor. |
| src/transformers/models/glm46v/modeling_glm46v.py | Passes optional precomputed vision tensors through get_*_features and vision tower. |
| src/transformers/models/glm_ocr/modular_glm_ocr.py | Refactors vision rotary/cu_seqlens computation to shared helpers. |
| src/transformers/models/glm_ocr/modeling_glm_ocr.py | Same vision refactor for generated modeling file. |
| src/transformers/models/glm_image/modular_glm_image.py | Refactors rotary pos ids and cu_seqlens to shared helpers; adds optional precomputed inputs. |
| src/transformers/models/glm_image/modeling_glm_image.py | Same as modular file, plus updates to rotary embedding forward API. |
| src/transformers/models/esm/configuration_esm.py | Moves rope_theta doc section to align with parameter ordering/docs. |
| src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py | Refactors vision rotary/cu_seqlens to shared helpers and accepts optional precomputes. |
| src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py | Same refactor for generated modeling file. |
| src/transformers/modeling_vision_utils.py | New shared pure functions for vision tensor precomputation. |
| docs/source/en/model_doc/nomic_bert.md | Updates NomicBERT paper link. |
There was a problem hiding this comment.
Pull request overview
This PR refactors multimodal (vision/audio) models to allow passing precomputed, data-dependent tensors (e.g. cu_seqlens, rotary position IDs, window indices, position-embedding interpolation indices) and centralizes shared vision tensor construction into a new src/transformers/modeling_vision_utils.py.
Changes:
- Add
src/transformers/modeling_vision_utils.pywith shared pure helper functions for vision precomputations (get_vision_cu_seqlens, rotary pos IDs, window indices, pos-embed interpolation indices). - Update many vision model/processor implementations to accept optional precomputed tensors and avoid rebuilding them inside
forward. - Move/inline audio precompute helpers into relevant modular/model files and update docstrings/autodoc arg definitions accordingly.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/transformers/utils/auto_docstring.py |
Adds autodoc entries for new optional precomputed image/video tensors. |
src/transformers/modeling_vision_utils.py |
New shared pure functions for vision tensor precomputation (cu_seqlens, rotary pos IDs, window indices, pos-embed indices/weights). |
src/transformers/models/video_llama_3/processing_video_llama_3.py |
Processor can optionally return extra precomputed vision tensors. |
src/transformers/models/video_llama_3/modular_video_llama_3.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils. |
src/transformers/models/video_llama_3/modeling_video_llama_3.py |
Generated modeling updated to accept/use optional precomputed tensors. |
src/transformers/models/qwen3_vl/processing_qwen3_vl.py |
Processor can optionally return extra precomputed vision tensors (incl. interleaved rotary IDs). |
src/transformers/models/qwen3_vl/modular_qwen3_vl.py |
Vision path refactor to accept precomputed tensors; uses shared vision utils for pos-embed and rotary IDs. |
src/transformers/models/qwen3_vl/modeling_qwen3_vl.py |
Generated modeling updated similarly (precomputed tensors + shared utils). |
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py |
Same precomputed-tensor refactor for MoE variant. |
src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py |
Moves audio precompute helpers into the modular file; updates audio forward to accept precomputes. |
src/transformers/models/qwen3_5/modular_qwen3_5.py |
Vision refactor to accept optional precomputed tensors; uses shared vision utils. |
src/transformers/models/qwen3_5/modeling_qwen3_5.py |
Generated modeling updated similarly. |
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py |
Same precomputed-tensor refactor for MoE variant. |
src/transformers/models/qwen2_vl/processing_qwen2_vl.py |
Processor can optionally return extra precomputed vision tensors. |
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils. |
src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py |
Processor can optionally return extra precomputed vision tensors. |
src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py |
Vision forward refactor: optional precomputes + shared get_window_index/rotary IDs/cu_seqlens. |
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py |
Generated modeling updated similarly. |
src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py |
Moves audio precompute helpers into the modular file; updates audio forward to accept precomputes. |
src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py |
Vision encoder refactor to accept grid_thw + optional precomputed rotary IDs / cu_seqlens. |
src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py |
Generated modeling updated similarly. |
src/transformers/models/glm4v/processing_glm4v.py |
Processor can optionally return extra precomputed vision tensors. |
src/transformers/models/glm4v/modular_glm4v.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils; minor tensor construction refactors. |
src/transformers/models/glm4v/modeling_glm4v.py |
Generated modeling updated similarly. |
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py |
Same refactor for MoE variant. |
src/transformers/models/glm46v/processing_glm46v.py |
Processor can optionally return extra precomputed vision tensors. |
src/transformers/models/glm46v/modeling_glm46v.py |
Updates get_{image,video}_features signatures to accept precomputed tensors. |
src/transformers/models/glm_ocr/modular_glm_ocr.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils. |
src/transformers/models/glm_ocr/modeling_glm_ocr.py |
Generated modeling updated similarly. |
src/transformers/models/glm_image/modular_glm_image.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils. |
src/transformers/models/glm_image/modeling_glm_image.py |
Generated modeling updated similarly. |
src/transformers/models/esm/configuration_esm.py |
Docstring reorders rope_theta description (docs-only change). |
src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py |
Vision forward accepts optional precomputed tensors; uses shared vision utils. |
src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py |
Generated modeling updated similarly. |
docs/source/en/model_doc/nomic_bert.md |
Updates the paper link URL (docs-only change). |
| video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*): | ||
| The temporal, height and width of feature shape of each video in LLM. | ||
| video_cu_seqlens (`torch.IntTensor`, *optional*): | ||
| Precomputed cumulative sequence lengths for videos (from `get_cu_seqlens`). |
| image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*): | ||
| The temporal, height and width of feature shape of each image in LLM. | ||
| image_cu_seqlens (`torch.IntTensor`, *optional*): | ||
| Precomputed cumulative sequence lengths for images (from `get_cu_seqlens`). |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
run-slow: ernie4_5_vl_moe, esm, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/esm", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe"] |
zucchini-nlp
left a comment
There was a problem hiding this comment.
just a few quick comments, randomly chose one model file to review
| self, | ||
| hidden_states: torch.Tensor, | ||
| grid_thw: torch.Tensor, | ||
| cu_seqlens: torch.Tensor | None = None, | ||
| window_index: torch.Tensor | None = None, | ||
| cu_window_seqlens: torch.Tensor | None = None, | ||
| position_ids: torch.Tensor | None = None, | ||
| **kwargs: Unpack[TransformersKwargs], | ||
| ) -> tuple | BaseModelOutputWithPooling: |
There was a problem hiding this comment.
tbh cu_seqlens and position_ids are already in TransformersKwargs, no?
There was a problem hiding this comment.
ah yes true at this point, should i remove them from the forward and pop them from kwargs ?
There was a problem hiding this comment.
hmm however the cu_seqlens in TransfomersKwargs are these two
cu_seq_lens_q: torch.LongTensor | None
cu_seq_lens_k: torch.LongTensor | None
| video_cu_seqlens: torch.Tensor | None = None, | ||
| video_position_ids: torch.Tensor | None = None, |
There was a problem hiding this comment.
and for utilities, imo we don't yet need these args because they aren't returned by processor. Unless it is a req for export
There was a problem hiding this comment.
if we remove them from here they can't be propagated through the model from forward, i can revert them but that means the model will still not be compliable end-to-end, only it's visual encoder will be.
There was a problem hiding this comment.
Ah so we are currently consuming it all in kwargs via XXModel and explicitly write it out in vision-related models?
I think cu-seq-lens and positions are kinda oke to be consumed with existing kwargs, because they are assumed to represent FA-related arguments. Then we also have "cu_window_seqlens" indices in some models, which is actually the same cu-len used in FA
So imo we can consolidate these two, maybe similar to how attention mask is built? For ex, the model expects a dict of cu_seq_len with keys for layer types (full attn or window attn)
also cc @vasqu, wdyt?
There was a problem hiding this comment.
Could this be solved through a decorator maybe that filters/maps the kwargs based on modality. I still don't like the args being super visible because it should remain a power feature but I also see that it is needed for properly propagating. Wdyt about the decorator solution then?
There was a problem hiding this comment.
Actually I dont mind the TypedDict unpacking as imo these fit perfectly in "typical FA kwargs". Also doesnt look appealing to me when we have 4+ new args. What do you have in mind re decorators?
There was a problem hiding this comment.
Something along @fa_kwargs(modality="vision") and we make our internal mapping that goes through the kwargs and maps them to the correct naming. This decorator wouldn't apply here but where we actually need it then so down the line in the vision attention for example. At least that's the rough idea
There was a problem hiding this comment.
From my understanding, the naming within vision model has to be without any prefixes. For ex, rn it accepts images and videos under same arg name (pixel_values)
But yeah, we might need to prefix it in a general VLM forward in subsequent PRs, if we wan to allow users prepare FA kwargs and pass it down the line
CI ResultsCommit Info
Model CI Report❌ 5 new failed tests from this PR 😭
|
86e0012 to
a7c2277
Compare
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, video_llama_3 |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/video_llama_3"] |
CI ResultsCommit Info
Model CI Report❌ 9 new failed tests from this PR 😭
|
|
run-slow: qwen3_omni_moe, qwen3_vl_moe, video_llama_3 |
|
This comment contains models: ["models/qwen3_omni_moe", "models/qwen3_vl_moe", "models/video_llama_3"] |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
|
run-slow: qwen3_vl_moe, video_llama_3 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, video_llama_3 |
needed both claude and copilot's help on this one 😅 The idea is to make the vlms and their visual/audio encders compileable / exportable. here's a demo of the full model forward being compileable with these precomputed tensors.
modeling_vision_utils.pywith shared pure functions:get_vision_cu_seqlens,get_rotary_pos_ids,get_rotary_pos_ids_interleaved,get_window_index,get_pos_embed_indiceschunk_and_pad_features,get_audio_cu_seqlens,get_valid_indices,get_pool_indices) into modular files directlyVisionRotaryEmbedding.forwardto acceptpos_idstensor directly via broadcast multiply, eliminating data-dependent table creationcu_seqlens,rotary_pos_ids,window_index,embed_indices, etc.)get_vision_cu_seqlens/get_audio_cu_seqlensModels: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl
What does this PR do?
Fixes # (issue)
Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.