Prepare and keep track of position ids in generate#43734
Prepare and keep track of position ids in generate#43734zucchini-nlp merged 26 commits intohuggingface:mainfrom
generate#43734Conversation
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43734&sha=929e2c |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: colqwen2, ernie4_5_vl_moe, gemma3, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, gpt_neo, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_vl, qwen3_vl_moe, reformer, video_llama_3 |
|
This comment contains models: ["models/colqwen2", "models/ernie4_5_vl_moe", "models/gemma3", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/gpt_neo", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl", "models/qwen3_vl_moe", "models/reformer", "models/video_llama_3"] |
|
run-slow: colqwen2, ernie4_5_vl_moe, gemma3, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, gpt_neo, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_vl, qwen3_vl_moe, reformer, video_llama_3 |
|
This comment contains models: ["models/colqwen2", "models/ernie4_5_vl_moe", "models/gemma3", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/gpt_neo", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl", "models/qwen3_vl_moe", "models/reformer", "models/video_llama_3"] |
| position_ids.masked_fill_(attention_mask == 0, 1) | ||
| position_ids.masked_fill_(attention_mask == 0, 0) |
There was a problem hiding this comment.
doesn't make diff which value we use, because the token is masked anyway. Using 0 makes more sense because when the seq has only one unmasked token, we are getting position ids with max value of 1, not 0
| model_input = model_input[:, -current_input_length:] | ||
| model_input = model_input[..., -current_input_length:] | ||
| model_input = model_input.clone(memory_format=torch.contiguous_format) |
There was a problem hiding this comment.
3D positions support
| position_ids, rope_deltas = self.vlm.model.get_rope_index( | ||
| input_ids=input_ids, | ||
| image_grid_thw=image_grid_thw, | ||
| video_grid_thw=None, | ||
| attention_mask=attention_mask, | ||
| ) | ||
|
|
There was a problem hiding this comment.
why we called it here, no idea. Better to let self.vlm handle everything
| else: | ||
| if attention_mask is not None: | ||
| position_ids = attention_mask.long().cumsum(-1) - 1 | ||
| position_ids.masked_fill_(attention_mask == 0, 1) | ||
| position_ids = position_ids.unsqueeze(0).expand(3, -1, -1).to(attention_mask.device) | ||
| max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0] | ||
| mrope_position_deltas = max_position_ids + 1 - attention_mask.shape[-1] | ||
| else: | ||
| position_ids = ( | ||
| torch.arange(input_ids.shape[1], device=input_ids.device) | ||
| .view(1, 1, -1) | ||
| .expand(3, input_ids.shape[0], -1) | ||
| ) | ||
| mrope_position_deltas = torch.zeros( | ||
| [input_ids.shape[0], 1], | ||
| device=input_ids.device, | ||
| dtype=input_ids.dtype, | ||
| ) | ||
|
|
||
| return position_ids, mrope_position_deltas |
There was a problem hiding this comment.
same thing as it was for all get_rope_index, just deleted this part
vasqu
left a comment
There was a problem hiding this comment.
Just left a few questions / nits, I have a feeling we can use modular a tad more re compute_3d_position_ids?
| image_outputs.pooler_output = image_embeds | ||
| return image_outputs | ||
|
|
||
| def compute_3d_position_ids( |
There was a problem hiding this comment.
Could we not inherit from Qwen2_5_VLModel? Or is there something specific, let's avoid rewriting where possible
There was a problem hiding this comment.
Ernie uses mm_token_type_ids but Qwen2-5VL has second_grid_ts. We can do it if we hide extra kwargs as **kwargs, which basically will look like the above comment
There was a problem hiding this comment.
Ah, yea that's a good point. On another note, do we want to change the other VLMs to use mm token type ids here? Iiirc, it's much faster(?)
There was a problem hiding this comment.
we do! That is my next PR after this one is merged. I have some stuff locally :)
|
run slow might have been broken so better to rerun after merging with main 👀 |
vasqu
left a comment
There was a problem hiding this comment.
Forgot to approve, can you also check if qwen3_5(_moe) have inherited as expected? They are essentially coming from qwen3 vl
|
Oke, will run a few slow tests and merge |
|
run-slow: colqwen2, ernie4_5_vl_moe, gemma3, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, gpt_neo, idefics, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_vl, qwen3_vl_moe, reformer |
|
This comment contains models: ["models/colqwen2", "models/ernie4_5_vl_moe", "models/gemma3", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/gpt_neo", "models/idefics", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl", "models/qwen3_vl_moe", "models/reformer"] |
CI ResultsCommit Info
Model CI Report❌ 7 new failed tests from this PR 😭
|
|
Hey! Sorry I'm a bit late to the party! Instead of having a dedicated |
|
If we make each model compute their position ids in forward (which already happens now in not-so-correct way), we can't just build upon it by incrementing to the next position. Position ids aren't returned from model like cache so we have to start returning them from forward to be able to re-use. Otherwise we just have to let each model re-compute positions from scratch every time, basically what happens now Actually, I thought at first to make each |
|
Ah that was the issue with padding side, supposed to be "left". We don't recompute positions every time thus it doesn't work well with right padding |
|
Updated on the hub! So ernie should behave somewhat normally now, no idea why the default padding changed tbh |
|
run-slow: colqwen2, ernie4_5_vl_moe, qwen2_5_vl |
|
This comment contains models: ["models/colqwen2", "models/ernie4_5_vl_moe", "models/qwen2_5_vl"] |
CI ResultsCommit Info
Model CI Report❌ 5 new failed tests from this PR 😭
|
|
run-slow: ernie4_5_vl_moe, qwen2_5_vl |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/qwen2_5_vl"] |
Arhhh, you're right 🥲 Nevermind then! |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: colqwen2, ernie4_5_vl_moe, gemma3, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, gpt_neo, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
What does this PR do?
As per title. lays ground to unifying 3D position ids in qwen-style VLMs
PR adds a single entrypoint to prepare position ids in
GenerationMixinwhich models can override if needed (qwen-vl for ex). This allow users to prepare their own position ids and pass them togenerate(). In decoding stages, the position ids are simply incremented by one to build the next positionsAlong with it, PR starts a light unification on 3D positions by splitting it into its own utility fn. Now we have only two or three models with their own
compute_3d_positionsand all other models copy from there. In the next PR, I will splitget_rope_indexinto smaller components allowing us to copy similarities easily. I am working on it locally but it's blocked by current branchReview starting from
transformers/generationand models from which we copy (qwen2-vl and ernie4_5_vl)Fixes #29149