Revert text_positions in Qwen25VL by pengzhenghao · Pull Request #40177 · huggingface/transformers

pengzhenghao · 2025-08-14T17:08:14Z

What does this PR do?

Fixes # (issue)

Remove problematic text_position_ids preparation when generating with Qwen2.5VL model.

Fixes #40154 (The Qwen 2.5 VL text position ID issue)
Fixes #40136 (Qwen2.5VL performance drop)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

pengzhenghao · 2025-08-14T17:09:22Z

Note that the outreach of the issue might not limited to Qwen2.5VL. But I am not familiar with other model so the only thing I do here is to initialize the investigation on this issue. @rahul-tuli let's see if this can get performance back.

vasqu · 2025-08-14T17:19:49Z

You only modified the modular file, but you need to apply the changes to the modeling file as well, i.e. python utils/modular_converter qwen2_5_vl or the sort (didnt check for exact naming).

vasqu · 2025-08-14T17:47:48Z

Thanks for investigating, it helps a lot! We are going to include #40161 in a patch for safety but for final fixes (maybe including this) will have to wait for 4.56.x potentially.

Let's see whether the original author of the mmmu issue can chime in and give feedback

PavloFesenko · 2025-08-18T03:38:58Z

@pengzhenghao Could you please add the issue numbers to Fixes #<issue number> in the PR description? The word Fixes tells GitHub that this PR will fix the underlying issue, displays a PR icon near the issue in the issue list, and automatically closes the issue once the PR is merged. 🙏

I guess that it should be OK to write something like this:

Fixes #40154 (The Qwen 2.5 VL text position ID issue)
Fixes #40136 (Qwen2.5VL performance drop)

zucchini-nlp

Thanks for digging into the issue. I think we should not be deleting textual position ids here which is still needed for generation with FA2. Just removing text position will not solve it, because the root of issue is in

transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

Lines 872 to 898 in cf58d70

    
           # NOTE: we need to pass text position ids for packing. Qwen2-VL uses 3D positions 
        
           # where each dim indicates visual spatial positions for temporal/height/width grids. 
        
           # There are two scenarios when FA2-like packed masking might be activated. 
        
           # 1. User specifically passed packed `position_ids` and no attention mask. 
        
           #    In this case we expect the useer to create correct position ids for all 3 grids 
        
           #    and prepend text-only position ids to it. The final tensor will be [4, bs, seq-len] 
        
           # 2. User runs forward with no attention mask and no position ids. In this case, position ids 
        
           #    are prepared by the model (`get_rope_index`) as `[4, bs, seq-len]` tensor. Text-only positions are 
        
           #    prepended by us when creating positions so that the mask is constructed correctly. NOTE: failing to pass 
        
           #    text-only positions will cause incorrect mask construction, do not change `prepare_input_for_generation` 
        
           if position_ids.ndim == 3 and position_ids.shape[0] == 4: 
        
               text_position_ids = position_ids[0] 
        
               position_ids = position_ids[1:] 
        
           else: 
        
               text_position_ids = position_ids[0] 
        
           # It may already have been prepared by e.g. `generate` 
        
           if not isinstance(causal_mask_mapping := attention_mask, dict): 
        
               # Prepare mask arguments 
        
               mask_kwargs = { 
        
                   "config": self.config, 
        
                   "input_embeds": inputs_embeds, 
        
                   "attention_mask": attention_mask, 
        
                   "cache_position": cache_position, 
        
                   "past_key_values": past_key_values, 
        
                   "position_ids": text_position_ids,

We need to allow packed sequence with FA2 which was requested several times by the community, And instead define that text_pos_ids = None if the position ids do not contain 4 position types in the first dimension

zucchini-nlp · 2025-08-18T07:44:29Z

Can you also check if these tests are still passing on the models?

transformers/tests/test_modeling_common.py

Lines 4265 to 4284 in 2914cec

    
           def test_eager_padding_matches_padding_free_with_position_ids(self): 
        
               self.attention_mask_padding_matches_padding_free_with_position_ids(attn_implementation="eager") 
        
           def test_sdpa_padding_matches_padding_free_with_position_ids(self): 
        
               self.attention_mask_padding_matches_padding_free_with_position_ids(attn_implementation="sdpa") 
        
           @require_flash_attn 
        
           @require_torch_gpu 
        
           @mark.flash_attn_test 
        
           @slow 
        
           def test_flash_attention_2_padding_matches_padding_free_with_position_ids(self): 
        
               self.attention_mask_padding_matches_padding_free_with_position_ids(attn_implementation="flash_attention_2") 
        
           @require_flash_attn 
        
           @require_torch_gpu 
        
           @mark.flash_attn_test 
        
           @slow 
        
           def test_flash_attention_2_padding_matches_padding_free_with_position_ids_and_fa_kwargs(self): 
        
               self.attention_mask_padding_matches_padding_free_with_position_ids( 
        
                   attn_implementation="flash_attention_2", fa_kwargs=True

zucchini-nlp · 2025-08-18T07:45:53Z

Note that the outreach of the issue might not limited to Qwen2.5VL

You need to update all Qwen-VL family including Qwen-Omni, the issue is relevant to models using 3D RoPE

gpantaz · 2025-08-25T12:21:15Z

Hi!

Are there any updates on this? I am trying to finetune qwenvl-2.5 and I am noticing significant variations in the loss when using the latest transformers version (4.55.4). Within a couple of steps my loss is around 30-35 which seems suspicious. I also tried building from source using the forked library for this PR but I am seeing similar curves.

I tried a few previous versions and the latest one that seems to yield reasonable loss during each step is 4.49.0 I wonder if anyone else has experienced a similar behavior and if its related to this PR.

Thanks,
George

zucchini-nlp · 2025-08-25T12:51:41Z

@pengzhenghao hey, do you still want to merge this PR? I can take it up, if you are busy

zucchini-nlp · 2025-08-26T14:48:49Z

I will assume you are busy and take over the issue tomorrow, because Qwen-VL models have high usage

pengzhenghao · 2025-08-26T17:02:48Z

Thank you so much for following this issue! @zucchini-nlp I am indeed busy and also I am not familiar with other Qwen-VL models & FA2 impl.

That would be great if you can take over!

We need to allow packed sequence with FA2 which was requested several times by the community, And instead define that text_pos_ids = None if the position ids do not contain 4 position types in the first dimension

I am a little confused here.

Seems like you are suggesting:

text_position_ids will be used as packed sequence and it should be [0, 1, ..., seq_len-1]
Therefore, for position_ids.shape[0]==4, text_position_ids = position_ids[0] is intended.
We instead should fix the problem that text_position_ids=position_ids[0] when input pos IDs are 3D, we should set text_position_ids = None in this case.

So, I think this naturally implies:

We should pass in a 3D position_ids when using Qwen-VL series models.

Do I understand correctly?

zucchini-nlp · 2025-08-26T19:49:09Z

@pengzhenghao yeah right, since we don't want to enforce all users to use packed sequence with FA2 we can just set text_positions=None. That way we will keep supporting packed sequences for users who requested it and fix the bug in the linked issue

I will check it tomorrow, and we definitely need more tests in Qwen given its specific RoPE type

github-actions · 2025-08-26T20:15:09Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_vl

zucchini-nlp · 2025-08-27T12:13:34Z

@pengzhenghao I located the bug and it is not in using packed or text position ids. The bug was actually in how position ids are computed when generating which was not same as computing in forward. I fixed it in #40490 and checked with MMEval

Thanks for your PR and investigating it

pengzhenghao mentioned this pull request Aug 14, 2025

Qwen2.5VL is broken! #40154

Closed

4 tasks

zucchini-nlp reviewed Aug 18, 2025

View reviewed changes

pengzhenghao added 2 commits August 26, 2025 13:13

revert text_positions in Qwen25VL

11a338e

update modeling file

9b13d94

pengzhenghao force-pushed the fix-qwen25vl-pos branch from cf58d70 to 9b13d94 Compare August 26, 2025 20:14

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open


	# NOTE: we need to pass text position ids for packing. Qwen2-VL uses 3D positions
	# where each dim indicates visual spatial positions for temporal/height/width grids.
	# There are two scenarios when FA2-like packed masking might be activated.
	# 1. User specifically passed packed `position_ids` and no attention mask.
	# In this case we expect the useer to create correct position ids for all 3 grids
	# and prepend text-only position ids to it. The final tensor will be [4, bs, seq-len]
	# 2. User runs forward with no attention mask and no position ids. In this case, position ids
	# are prepared by the model (`get_rope_index`) as `[4, bs, seq-len]` tensor. Text-only positions are
	# prepended by us when creating positions so that the mask is constructed correctly. NOTE: failing to pass
	# text-only positions will cause incorrect mask construction, do not change `prepare_input_for_generation`
	if position_ids.ndim == 3 and position_ids.shape[0] == 4:
	text_position_ids = position_ids[0]
	position_ids = position_ids[1:]
	else:
	text_position_ids = position_ids[0]

	# It may already have been prepared by e.g. `generate`
	if not isinstance(causal_mask_mapping := attention_mask, dict):
	# Prepare mask arguments
	mask_kwargs = {
	"config": self.config,
	"input_embeds": inputs_embeds,
	"attention_mask": attention_mask,
	"cache_position": cache_position,
	"past_key_values": past_key_values,
	"position_ids": text_position_ids,

Conversation

pengzhenghao commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

pengzhenghao commented Aug 14, 2025

Uh oh!

vasqu commented Aug 14, 2025

Uh oh!

vasqu commented Aug 14, 2025

Uh oh!

PavloFesenko commented Aug 18, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Aug 18, 2025

Uh oh!

zucchini-nlp commented Aug 18, 2025

Uh oh!

gpantaz commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Aug 26, 2025

Uh oh!

pengzhenghao commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Aug 26, 2025

Uh oh!

github-actions Bot commented Aug 26, 2025

Uh oh!

zucchini-nlp commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pengzhenghao commented Aug 14, 2025 •

edited

Loading

gpantaz commented Aug 25, 2025 •

edited

Loading

zucchini-nlp commented Aug 25, 2025 •

edited

Loading

pengzhenghao commented Aug 26, 2025 •

edited

Loading