Skip to content

[models] Add AudioFlamingo3 integration#40290

Merged
ydshieh merged 159 commits intohuggingface:mainfrom
jsalt-2025:audioflamingo3
Nov 12, 2025
Merged

[models] Add AudioFlamingo3 integration#40290
ydshieh merged 159 commits intohuggingface:mainfrom
jsalt-2025:audioflamingo3

Conversation

@lashahub
Copy link
Copy Markdown
Contributor

@lashahub lashahub commented Aug 19, 2025

This PR adds support for AudioFlamingo3 (AF3) — NVIDIA’s open large audio language model capable of reasoning over speech, sounds, and music.

It introduces the following components:

  • AudioFlamingo3 model class
  • AudioFlamingo3Processor for preprocessing text + audio
  • Configuration, modeling, and processing utilities
  • Example usage

With this integration, AF3 can be loaded directly from the Hugging Face Hub:

from transformers import AudioFlamingo3Processor, AudioFlamingo3

processor = AudioFlamingo3Processor.from_pretrained("nvidia/audio-flamingo-3")
model = AudioFlamingo3.from_pretrained("nvidia/audio-flamingo-3")

prompt = "What is happening in the audio?"
audio = "clap.wav"

input_ids, media, media_meta = processor(prompt, audio)
output_ids = model.generate(
    input_ids=input_ids,
    media=media,
    media_meta=media_meta,
    generation_config=model.default_generation_config,
)
print(processor.decode(output_ids))
# Example output: "A crowd is applauding and cheering."

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lashahub for your PR! This is very a exciting model to add to the Transformers library!

I see that you've taken inspiration from Llava, which makes sense as you combine modules of different modalities.

Most of my comments are about rearranging your modules so that they fit the Transformers convention, such that it will be more convenient for others to use and to test your new model. To that end, Llava's and this PR (of another audio model) might serve as a useful example to see what files will be added/modified.

Below are my suggested steps.

1. Refactoring / reorganizing current files according to Transformers convention

You can take inspiration from the above models for refactoring your configuration, modeling and processing files, specifically:

  • Consolidating your configurations. From what I see you may need only one config like in Llava or two like in Dia for AudioFlamingo3Config and AudioFlamingo3EncoderConfig.
  • Processor: you can take inspiration from Dia to group your feature extractor, text tokenizer, and audio tokenizer into a single component. Loading pre-trained feature extractors and LLMs can be directly handled by the processor without you (or the user) having to manually load the corresponding weights (like you've done below).
# Load components
fe = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3")
tok = AutoTokenizer.from_pretrained(
    llm_dir,
    padding_side="right",
    use_fast=True,
    legacy=False,
)

To this end, you'll need to create a model conversion script (with something like this) so the configuration files are generated for the processor to know where to pull the relevant models.

  • The modeling file will diminish quite significantly because we won't apply the audio/text tokenizer inside it but rather in the processor, and the configuration file will also handle pulling the relevant LLM config.

2. Conversion script

This will be needed to convert your model weights / configuration to one that is compatible with the one defined in the above files.

This script can also handle uploading the Transformer compatible model and its configuration to the Hugging Face Hub. You can again take inspiration from Llava and Dia.

3. Testing, documentation, etc

Once your model implementation is consistent with other models implemented in Transformers, there's a lot of boilerplate code we can reuse to make using model convenient and to apply various testing suites. For example, you can see the docs, src/transformers/models/auto, and tests/models/dia folders of the Dia PR on how to prepare / modify the relevant files.

Hope that helps and let me know if you have any questions!

Comment thread src/transformers/models/audioflamingo3/configuration_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/configuration_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/__init__.py
Comment thread src/transformers/models/audioflamingo3/configuration_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated
Comment thread src/transformers/models/audioflamingo3/modeling_audioflamingo3.py Outdated
@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Nov 12, 2025

run-slow: audioflamingo3

@github-actions
Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Nov 12, 2025

@lashahub we still need to keep the models in bf16 for the tests, otherwise they won't load properly (see here)

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Nov 12, 2025

run-slow: audioflamingo3

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

@github-actions
Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Nov 12, 2025

@lashahub could you also add this training snippet: 8bf40fa

to the model page: https://huggingface.co/nvidia/audio-flamingo-3-hf

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos everyone and thanks @eustlb and @ebezzam very clean 😉

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Nov 12, 2025

run-slow: audioflamingo3

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

@github-actions
Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@ebezzam ebezzam self-requested a review November 12, 2025 13:59
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lashahub and @Sreyan88 for the great work, and @eustlb and @ArthurZucker for the feedback 🤗

Merging!

@ebezzam ebezzam enabled auto-merge (squash) November 12, 2025 14:06
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ebezzam ebezzam disabled auto-merge November 12, 2025 14:18
@ydshieh ydshieh merged commit 1709ed9 into huggingface:main Nov 12, 2025
21 of 23 checks passed
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* Audio Flamingo 3 initial integration

* Added local Qwen

* Moving to AF3

* Loading directly from HF

* Formatting

* add snapshot_download

* Loading from hub

* Import gating

* Pass audio arrays directly

* Remove requires_backend

* Move constants to config.json

* Remove redundancies

* Separate tokenizer, cleaner from_pretrained

* Remove LlavaMetaModel

* Remove sound tower wrapper

* Merged BasicSoundEncoder

* Some improvements

* Towards AudioFlamingo3

* Migrate LlavaConfig

* Merge LlavaMetaForCausalLM into AudioFlamingo3ForConditionalGeneration

* Remove redundant lines

* Add AudioFlamingo3PreTrainedModel

* Unified model.safetensors

* Inline MM projector

* Tokenizer in root dir

* Default processor from_pretrained

* Remove tokenizer from modeling

* Added types

* Cleanup

* Docs & license

* device handling

* Change year

* Remove redundant methods

* Use BatchFeature

* Streamline audio feature handling

* Batch inference

* Reorder alphabetically

* Make style check

* Make fixup

* Avoid calls to separate functions

* Remove forward_tower()

* Rename encode_sound to get_audio_features for clarity

* Add batch decoding method to AudioFlamingo3Processor

* Use tensors instead of lists

* Move end embed token eval

* Prepare audio_features_mask in the processor

* No hardcoded 750 and 3000

* Remove _load_sound_mask completely and use WhisperFeatureExtractor

* Compute embeddings separately

* MM Projector is audio adaptor

* Simplify AudioFlamingo3Config initialization with default encoder_config

* Add modular

* Clean up

* make fixup

* Cleanup processing, add params to encoder config

* Remove redundant methods

* update config references, improve method names, and enhance logging in processor

* processor: move FE args to audio_kwargs, use common_kwargs for return_tensors

* Qwen-like processor

* Simplified AudioFlamingo3Processor

* Extract common code from generate() and forward()

* Add conversion script for AudioFlamingo3 to Hugging Face format

* Use save_pretrained()

* Don't overwrite gen config

* Use AutoTokenizer and FE to convert the processor

* minor formatting

* Finalize processor, do token expansion inside

* AudioFlamingo3: refactor docs, types, and audio–text feature merge

* AudioFlamingo3 Docs

* Add AudioFlamingo3Processor to AutoProcessor

* Processor tests

* Use audio_config instead of encoder_config

* Add audio_token_id to config

* Cleanup & new keys

* Add links

* Improved processor

* Handle conversational input

* Make processing consistent.

* Add fallback for no sound token, default left padding.

* Cleanup

* Replace manual 4D mask with masking_utils; dtype/device from inputs

* Text only mode

* Finalize processor

* Export processor directly

* Add push_to_hub to converter

* Add model_input_names property to AudioFlamingo3Processor to pass tests

* Processor chat template support

* Added Jinja processor chat template with audio support

* Processor tests

* Model tests

* Added docs

* Don't use common_kwargs in __call__

* Pass 'test_left_padding_compatibility' by never treating padding as content

* Updated docs

* Cleanup docs

* Standardization

* Update conversion script weight mapping.

* Flatten _build_square_attn_mask

* Make style

* Small dim and attn mask fix

* Fix processor padding side bug

* Error handling in converter

* Use position_ids

* Cleanup generation config

* Use precomputed position embeddings in AudioFlamingo3 encoder

* Added usage examples

* Fix generation config

* Integration tests

* Simplify modeling and shift part of mask preparation to processor. And update tests.

* Updated docs

* ASR convenience method

* Fixed tests

* make fixup

* Shift encoder mask preparation to the encoder's forward.

* Change to HF profiles.

* Integration test standardization.

* Clean up before integration test setup.

* Remove strict float32, more similar to Qwen2Audio.

* Use HF dataset links

* Keep weights in BF16

* New audio in tests

* Processor conventions.

* Standardize audio token expansion in processor.

* Add 'strip_prefix' to batch_decode

* Batch decode nits.

* Remove dtype casting.

* Read token ids from tokenizer

* diverse changes according to review

* add training example

* Add missing docstring.

* Fix typos.

* Add audio token docstring.

* Fix fill type.

* Fix docs

* Save converted weights in bf16

* Fix tests

* Keep model in bf16 for tests.

* Update expected results for single.

* Fix integration tests from runner.

* Update reproducer, and dtype nits.

---------

Co-authored-by: Eric B <ebezzam@gmail.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
@ebezzam ebezzam mentioned this pull request Feb 9, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants