Skip to content

Add Audio-Visual Flamingo model#45586

Open
lashahub wants to merge 56 commits intohuggingface:mainfrom
lashahub:add_AudioVisualFlamingo
Open

Add Audio-Visual Flamingo model#45586
lashahub wants to merge 56 commits intohuggingface:mainfrom
lashahub:add_AudioVisualFlamingo

Conversation

@lashahub
Copy link
Copy Markdown
Contributor

This PR adds support for Audio-Visual Flamingo (AVF), an open audio-visual large language model for joint understanding and reasoning over audio, images, and videos.

The paper, model weights, and project page will be released in May 2026.

In Transformers, AVF pairs a SigLIP vision tower with an AF-Whisper audio encoder and a Qwen2.5-7B causal language model, with separate projectors for visual and audio features. For joint video-audio inputs, AVF aligns synchronized visual and audio chunks, interleaves them along the time axis, applies Constrained Rotary Time Embeddings (CRTE), and feeds the fused sequence to the language model. It also supports Dynamic-S2 preprocessing for high-resolution images and sampled video frames.

This PR introduces:

  • AudioVisualFlamingoConfig
  • AudioVisualFlamingoForConditionalGeneration
  • AudioVisualFlamingoProcessor
  • Joint video-audio handling from a single container via load_audio_in_video=True
  • Dynamic-S2 visual preprocessing
  • Temporal audio-visual interleaving with CRTE
  • Modeling, processing, tests, and documentation

Example usage once the checkpoint is public:

from transformers import AudioVisualFlamingoForConditionalGeneration, AutoProcessor

model_id = "nvidia/audio-visual-flamingo-hf"

processor = AutoProcessor.from_pretrained(
    model_id,
    padding_side="left",
    use_fast=False,
    load_audio_in_video=True,
    num_video_frames=128,
    audio_chunk_length="max_3600",
)
model = AudioVisualFlamingoForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    load_audio_in_video=True,
).eval()

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "video.mp4"},
            {
                "type": "text",
                "text": "Describe both the visual scene and the spoken or environmental audio content.",
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0])

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audiovisualflamingo, auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants