Fix Qwen2.5-VL temporal RoPE scaling applied to still images by Kash6 · Pull Request #45330 · huggingface/transformers

Kash6 · 2026-04-08T23:51:52Z

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

What does this PR do?

Fixes #45325

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions. Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position. Qwen3-VL inherits this fix via super().get_rope_index(). Fixes huggingface#45325

zucchini-nlp

Great catch! Thanks for fixing

zucchini-nlp · 2026-04-09T09:47:53Z

I'll merge when CI is green, we're waiting internally for the team to skip hub-tests (too flaky)

github-actions · 2026-04-10T09:03:17Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_vl

HuggingFaceDocBuilderDev · 2026-04-10T09:11:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions. Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position. Qwen3-VL inherits this fix via super().get_rope_index(). Fixes #45325 Co-authored-by: Raushan Turganbay <raushan@huggingface.co>

…face#45330) get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions. Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position. Qwen3-VL inherits this fix via super().get_rope_index(). Fixes huggingface#45325 Co-authored-by: Raushan Turganbay <raushan@huggingface.co>

zucchini-nlp approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into fix/qwen25vl-image-temporal-rope-45325

99d9a0e

zucchini-nlp enabled auto-merge April 10, 2026 09:02

zucchini-nlp added this pull request to the merge queue Apr 10, 2026

Merged via the queue into huggingface:main with commit a9f5b3a Apr 10, 2026
21 checks passed

zucchini-nlp mentioned this pull request Apr 13, 2026

transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong #45381

Closed

4 tasks

zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026

Kash6 deleted the fix/qwen25vl-image-temporal-rope-45325 branch April 14, 2026 03:21

JustinTong0323 mentioned this pull request Apr 15, 2026

Upgrade transformers to 5.5.3 and refactor hf_transformers_utils into subpackage sgl-project/sglang#21569

Merged

2 tasks

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen2.5-VL temporal RoPE scaling applied to still images#45330

Fix Qwen2.5-VL temporal RoPE scaling applied to still images#45330
zucchini-nlp merged 2 commits intohuggingface:mainfrom
Kash6:fix/qwen25vl-image-temporal-rope-45325

Kash6 commented Apr 8, 2026

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kash6 commented Apr 8, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants