Skip to content

Fix Qwen2.5-VL temporal RoPE scaling applied to still images#45330

Merged
zucchini-nlp merged 2 commits intohuggingface:mainfrom
Kash6:fix/qwen25vl-image-temporal-rope-45325
Apr 10, 2026
Merged

Fix Qwen2.5-VL temporal RoPE scaling applied to still images#45330
zucchini-nlp merged 2 commits intohuggingface:mainfrom
Kash6:fix/qwen25vl-image-temporal-rope-45325

Conversation

@Kash6
Copy link
Copy Markdown
Contributor

@Kash6 Kash6 commented Apr 8, 2026

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

What does this PR do?

Fixes #45325

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan

get_rope_index unconditionally applies tokens_per_second temporal scaling to
both images and videos. For still images (modality_type == 1), this shifts the
temporal position origin to start_position * tokens_per_second instead of
start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video
inputs (modality_type == 2). Still images use time_interval=1, keeping the
temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

Fixes huggingface#45325
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Thanks for fixing

@zucchini-nlp
Copy link
Copy Markdown
Member

I'll merge when CI is green, we're waiting internally for the team to skip hub-tests (too flaky)

@zucchini-nlp zucchini-nlp enabled auto-merge April 10, 2026 09:02
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_vl

@zucchini-nlp zucchini-nlp added this pull request to the merge queue Apr 10, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merged via the queue into huggingface:main with commit a9f5b3a Apr 10, 2026
21 checks passed
@zucchini-nlp zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026
ArthurZucker pushed a commit that referenced this pull request Apr 13, 2026
get_rope_index unconditionally applies tokens_per_second temporal scaling to
both images and videos. For still images (modality_type == 1), this shifts the
temporal position origin to start_position * tokens_per_second instead of
start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video
inputs (modality_type == 2). Still images use time_interval=1, keeping the
temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

Fixes #45325

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
@Kash6 Kash6 deleted the fix/qwen25vl-image-temporal-rope-45325 branch April 14, 2026 03:21
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…face#45330)

get_rope_index unconditionally applies tokens_per_second temporal scaling to
both images and videos. For still images (modality_type == 1), this shifts the
temporal position origin to start_position * tokens_per_second instead of
start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video
inputs (modality_type == 2). Still images use time_interval=1, keeping the
temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

Fixes huggingface#45325

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0

3 participants