Add Audio-Visual Flamingo model by lashahub · Pull Request #45586 · huggingface/transformers

lashahub · 2026-04-22T19:05:22Z

This PR adds support for Audio-Visual Flamingo (AVF), an open audio-visual large language model for joint understanding and reasoning over audio, images, and videos.

The paper, model weights, and project page will be released in May 2026.

In Transformers, AVF pairs a SigLIP vision tower with an AF-Whisper audio encoder and a Qwen2.5-7B causal language model, with separate projectors for visual and audio features. For joint video-audio inputs, AVF aligns synchronized visual and audio chunks, interleaves them along the time axis, applies Constrained Rotary Time Embeddings (CRTE), and feeds the fused sequence to the language model. It also supports Dynamic-S2 preprocessing for high-resolution images and sampled video frames.

This PR introduces:

AudioVisualFlamingoConfig
AudioVisualFlamingoForConditionalGeneration
AudioVisualFlamingoProcessor
Joint video-audio handling from a single container via load_audio_in_video=True
Dynamic-S2 visual preprocessing
Temporal audio-visual interleaving with CRTE
Modeling, processing, tests, and documentation

Example usage once the checkpoint is public:

from transformers import AudioVisualFlamingoForConditionalGeneration, AutoProcessor

model_id = "nvidia/audio-visual-flamingo-hf"

processor = AutoProcessor.from_pretrained(
    model_id,
    padding_side="left",
    use_fast=False,
    load_audio_in_video=True,
    num_video_frames=128,
    audio_chunk_length="max_3600",
)
model = AudioVisualFlamingoForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    load_audio_in_video=True,
).eval()

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "video.mp4"},
            {
                "type": "text",
                "text": "Describe both the visual scene and the spoken or environmental audio content.",
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0])

github-actions · 2026-04-22T21:32:35Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audiovisualflamingo, auto

lashahub added 30 commits February 18, 2026 15:25

Initial integration of Omnivinci

5b93891

Working stage in v5

ad39c78

Initial converter

8bbf21e

Flat converter

2037118

Whisper 128 features

1af74c4

Remove from_pretrained

9ebf8fd

Migrate

01c91d7

Remove redundancy

72d9068

Migrate

2f5120c

Cleanup the processor

67aabd1

Simplify audio tower

47077f9

make style

05e08c2

Removed soft_cross_entropy

d77d738

Remove _build_model_from_config_mapping

4f6d41f

Remove redundant configs for MM projectors

8a54a07

Move WhisperFeatureExtractor from modeling to processing

0f463d8

Remove DownSampleBlock

d95fd65

Removed extra vision tower classes

6dd0620

Remove custom attn handler

7da7270

Simplify config loading

28593ad

Cleanup media

b1b2656

Easier inference

3d0d671

Move tokenizer to processor

6a6ddca

Move apply_chat_template to processor

b01cdae

Use native HF media loader

12caedb

Fix mismatch due to fp64 vs fp32

a9d3e81

Cleaner inference

5ec0673

Embed in forward()

8324564

Use **inputs

a5d7097

Cleaner inference

cc1abae

lashahub added 26 commits February 28, 2026 00:14

AutoModel and AutoProcessor

dcc272a

Update converter

4e4c6c4

Usable OmniVinci

fc59394

Processor aligned with vLLM

e614a6e

Fix a processor bug

feb38ee

OmniVinci -> AudioVisualFlamingo

be4774b

Fix converter

545db6f

Clean up configs

37c11f7

Clean up processor

5d46594

Clean up config

5aeff1e

Clean up media loading

747bd6e

Clean up step huggingface#1

32864da

Merge files

1f29ac8

Remove redundant deps

90ad719

Use modular

93393ef

Reuse modules

b142abc

Merge origin/main into add_AudioVisualFlamingo

83a84c0

Fix legacy SigLIP AVF checkpoint loading

3531727

Clean up huggingface#1

cc81ee1

Clean up huggingface#2

32296a7

Clean up huggingface#3

eaae06c

Clean up huggingface#4

064522c

Clean up huggingface#5

09bed97

Add docs

f035bc7

Merge branch 'main' into add_AudioVisualFlamingo

e573b0e

make fix-repo

46dc788

eustlb self-assigned this Apr 23, 2026

tarekziade mentioned this pull request Apr 24, 2026

Add audiovisual flamingo tarekziade/tarekziade-transformers-reviewer-test#5

Closed

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Audio-Visual Flamingo model#45586

Add Audio-Visual Flamingo model#45586
lashahub wants to merge 56 commits intohuggingface:mainfrom
lashahub:add_AudioVisualFlamingo

lashahub commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lashahub commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants