Fix VLM generation issues#32836
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
|
||
| # Merge text and images in prefill stage | ||
| if past_key_values is None: | ||
| if input_ids.shape[1] != 1: |
There was a problem hiding this comment.
| if input_ids.shape[1] != 1: | |
| if inputs_embeds.shape[1] != 1: |
should we not check inputs_embeds instead?
There was a problem hiding this comment.
nope, we don't expect a user to pass inputs embeds and pixels at the same time, because for merging we need to slice image token ids positions. Yes, we have a ValueError in case both are passed, but it is in latest main.
UPDATE: Okey, this was causing test fails so yeah, I fixed it with embeds to make CI happy. This should be enough for the patch, we'll get the new VLMs on v4.45, will try to add mixin tests until then
|
|
||
| # generation with cache, decoding stage | ||
| elif past_key_values is not None and (pixel_values is not None or pixel_values_videos is not None): | ||
| elif pixel_values is not None or pixel_values_videos is not None: |
There was a problem hiding this comment.
Not sure why this is changed as the first if is inputs_embeds.shape[1] != 1: so that is prefill / simple forward (and indeed merging should always be done regardless of generation)
But then here, you need to index the past key values, which don't always exist (forward call with use_cache = False). for example
There was a problem hiding this comment.
In forward call with no cache, we should go with the first 'if' and never reach the elif. yeah kinda too many dependencies, but shouldn't cause an error. I'll add a parametrized to the new tests (use-cache or no-cache)
* fix in one commit * add parameterized * fix tests * fix test flakiness * maybe that's why flaky * style * flakiness... --------- Co-authored-by: raushan <raushan@huggingface.co>
What does this PR do?
Fixes generation for llava-next-video. Apparently started failing after we moved to cache class but some parts of the code were not modified. Checked all llava models, others are working since check is done on a different condition.
Yes, we can start using cache_position and rely on that, but we should note that cache_position for VLMs will not be correct and will contain positions only for text tokens. Adding support for cache position will come in the next PR, which is in progress. We;ll have to deprecate many things before we can get rid of the current checks to "merge or expand".
This PR ports the changes from #32527 to a proper patch
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.