Skip to content
This repository was archived by the owner on Nov 1, 2025. It is now read-only.

Conversation

@sedrickkeh
Copy link

@sedrickkeh sedrickkeh commented Apr 26, 2024

This is similar to the Llama implementation in #4, but extended to multimodal HF models.

Idefics2 by HuggingFace supports multiple-image inputs. Its API output format is quite similar to ChatGPT's output format. I initially tried it with 50 frames, which is what GPT4-V was using, but that gave OOM, so I lowered the num_frames to 10.

Other multimodal models on HF should be quite similar to implement, though I think for things like Llava, multi-image input may not be supported off the shelf.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants