Any to any pipeline and auto-mapping#40884
Conversation
merveenoyan
left a comment
There was a problem hiding this comment.
from what I understand in the code what we do is being able to load an any-to-any model and still being able to do what we do with image-text-to-text tasks with it, for me it's a bit confusing but if we write the docs well it should be ok!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@merveenoyan if you have time to look on the docs section, your advice will be appreciated. Do you think there is anything we should add or highlight? I added basic functionality with examples for now |
|
Oke, I think this one is ready now, as long as CI turns green |
|
Thank you! Looking forward to this getting merged 🙏🏻 |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice! Left a few comments! Tagging @ArthurZucker as well, as the names we choose for the pipelines and mappings are important here - we will likely get stuck with them for some time so let's make sure we like them/they are descriptives enough!
ArthurZucker
left a comment
There was a problem hiding this comment.
Overall very nice, naming not sure yet!
jackzhxng
left a comment
There was a problem hiding this comment.
Solves our use case perfectly, also output_modalities is very useful to have. Thanks @zucchini-nlp 🙏🏻
Leaving to @ArthurZucker and @Cyrilvallez for approval
|
Test failures not related! |
|
Test failures not related, kind ping @ArthurZucker whenever you have time |
ArthurZucker
left a comment
There was a problem hiding this comment.
Very very nice, sorry that it took so long to come back to it!
Fan of in/out modalities! Shaping well!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, align, altclip, aria, audioflamingo3, auto, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon |
* initial commit * fix tests * fix copies, tests and rename pipe * another rename * fix copies again * activate pipeline mixin in some models * audio loading * typo * fix the test * stupid typo in filename * fix copies * docs * forgot * fix pipe tests * fix copies * fix test * lets not pass it explicitly * final fix * rename in test files as well * fix again after reordering... * add qwen2 audio * add qwen3-omni * wait, I didn't push it last time? * it's only torch from now on * how was the model merged with docstring issues? * make style * requires backend depends on input modalities * add repr * fix copies * fox copies, new models were added * and now fix copies
What does this PR do?
Adds any-to-any as a pipeline and in auto classes so that we can have a single mapping for all multimodal models. The model mapping is almost same as image-text-to-text, with inclusion of audio-LLM and omni-LLM. I hope I added all audio models, but lmk if anything is missing from recent ones
Fixes #40302 and fixes #37794