Fix Qwen2.5VL temporal grid positions#45400
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| grid_thw[2].item() // spatial_merge_size, | ||
| ) | ||
|
|
||
| image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t |
There was a problem hiding this comment.
fix repo from qwen2-vl, here and after this
|
run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"] |
| def get_rope_index( | ||
| self, | ||
| input_ids: torch.LongTensor, | ||
| mm_token_type_ids: torch.IntTensor, |
There was a problem hiding this comment.
same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'
vasqu
left a comment
There was a problem hiding this comment.
Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here
1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix
| # Repeat the positions per each grid and per video frame. Add start position for temporal grid | ||
| # Important to add start positions after applying `time_interval`, order matters |
There was a problem hiding this comment.
Let's move this comment above, was thinking that directly on the first arange for temporal
There was a problem hiding this comment.
tho it doesn't really apply to arange, we are repeating and adding start position afterwards ?
There was a problem hiding this comment.
I think I was caught up with # Important to add start positions after applying time_interval, order matters but no worries
There was a problem hiding this comment.
Not the same as the other ones?
There was a problem hiding this comment.
nope, it only supports image generation from text or another image. Used only in diffusers as part of their pipe
| { | ||
| (None, None): [ | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white', | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to', |
There was a problem hiding this comment.
So this is intentional, was it changed before and we just went along?
There was a problem hiding this comment.
yeah, I changed the second_temporal_grid to non-zero value by passing the video in chat template, so it actually tests smth now. Prev it wasn't testing at all positions with video frames 😢
(old processors had no way to decode video which I added much later, so it's expected to be missed)
This test was passing with or without this PR, so its output was useless kinda...
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
I chose models with specific position ids, and from different families. The other ones are all just copies, or don't support videos This fix has no effect when |
|
re: applies to having better multimodal tests. AFAIK very few of the vlms that supports video have proper video testing with dummy weights. Neither we have tests with many modalities in a single input sample. Unfortunately I never had time to push further |
|
I will merge this one then, if no objections :) |
|
Nope, should fine. I think we have to rethink video modalities in general but that should be part of the VLM tester ideally |
|
We merged some small MoE update yesterday, maybe merge with main and recheck that nothing got broken |
|
I agree about video, we definitely need fast tests with dummy models and push contribs more for adding proper integration test for each modality that model supports. It applies also to audio, I remember some gemma models don't have integration audio tests @tarekziade maybe for dummy tester part Ah got merged before I came to reply. I don't think MoE affects positions so should be fine anyway |
* interesting * oops * test uses better temporal positions now * fix repo * re-unite glm and qwen3-vl * add some fast tests * dummy import * missed another dummy import * move comments around and add more comments
What does this PR do?
Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔
update: i know why, the integration test we have uses
second_grid_its = 0.083which rounds to0.0. So multiplication is zero no matter what value we get for vision positions. Great!For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once
In any case, worth adding a fast test with expected positions, will do so