Add AudioFlamingoNext model by lashahub · Pull Request #44830 · huggingface/transformers

lashahub · 2026-03-18T14:31:45Z

This PR adds AudioFlamingoNext as a separate model name that inherits directly from MusicFlamingo #43538 and keeps the same architecture and behavior.

Changes:

add audioflamingonext model files
register it in the auto mappings
add basic tests for the new model name

github-actions · 2026-04-13T07:28:06Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, audioflamingonext, auto, musicflamingo

lashahub · 2026-04-13T07:35:17Z

Hi @ebezzam, the model checkpoint is live at nvidia/audio-flamingo-next-hf. The main difference between Music Flamingo and Audio Flamingo Next is that AF-Next supports up to 30 minutes of audio, while MF supports 20.

Updated some AF3 and MF fixtures too that were failing on my end.

lashahub · 2026-04-15T21:31:08Z

Hi @eustlb, can we run the slow tests? Also this PR is pretty lightweight, it just registers AudioFlamingoNext as a model name, would be great to get it in when convenient. Thanks!

ebezzam · 2026-04-19T08:47:53Z

@lashahub I started my review, should be able to finish this week!

ebezzam · 2026-04-24T18:39:57Z

run-slow: audioflamingonext

ebezzam

@lashahub thanks for the model addition! Could you add links to reproducer scripts for the integration tests? Like we did for AF3 and MF

and can you revert the changes you made for the audioflamingo3 and musicflamingo tests? we should only change them if you are changing modeling code that would change those outputs. Moreover, It's expected that they fail for you. When integrating those models, I had to re-compute the expected outputs for our CI machine/setup (as different configurations can lead to difference outputs). Also someone added XPU outputs for music flamingo and we want to leave that.

ebezzam · 2026-04-13T16:09:07Z

+        audio_token="<sound>",
+        audio_bos_token="<|sound_bos|>",
+        audio_eos_token="<|sound_eos|>",
+        max_audio_len=1800,


NOTE: max_audio_len is only change from Music Flamingo (here)

ebezzam · 2026-04-24T17:42:42Z

@@ -1 +1 @@
-{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent of each other.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 315, 1817, 1008, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645, 151671]]}
+{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent sound events.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 5112, 4357, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645]]}


can you revert this change? this test should normally be passing on our machines already (it was passing on my development machine, but this change causes it to fail)

different hardware can lead to different outputs, and we computed these outputs on a similar setup as our CI

ebezzam · 2026-04-24T18:32:43Z

let's also leave this as before. Someone added expected outputs for XPU hardware since the model addition, and we want to keep that.

ebezzam · 2026-04-24T18:38:42Z

+        cleanup(torch_device, gc_collect=True)
+
+    @slow
+    def test_fixture_single_matches(self):


can you add the reproducer scripts? so I can reproduce the output in case there are outputs differences with out CI setup/hardware.

if the original model is private like musicflamingo, we will do something like the musicflamingo tests, where the expected outputs are computed with the model at merge

ebezzam · 2026-04-24T18:45:43Z

    rope_parameters: dict | None = None

    def __post_init__(self, **kwargs):
+        if self.rope_parameters is None:


why did you need to shift it here?

This needs to happen before super().__post_init__() because the base config standardizes rope_parameters=None into {"rope_theta": 10000.0, "rope_type": "default"}. If we set the Music Flamingo default after super(), the fallback no longer runs and the config gets the wrong RoPE value instead of rope_theta=1200 / partial_rotary_factor=0.2.

ebezzam · 2026-04-24T18:47:33Z

run-slow: audioflamingonext

HuggingFaceDocBuilderDev · 2026-04-24T18:53:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ebezzam

Some fixes are also needed for the config!

ebezzam · 2026-04-24T19:01:30Z

+        if isinstance(self.audio_config, dict):
+            self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder")
+        elif self.audio_config is None:
+            self.audio_config = {"model_type": "audioflamingo3_encoder"}
+        if self.rope_parameters is None:
+            self.rope_parameters = {
+                "rope_type": "default",
+                "rope_theta": kwargs.get("rope_theta", 1200),
+                "partial_rotary_factor": kwargs.get("partial_rotary_factor", 0.2),
+            }
+        if isinstance(self.audio_config, dict):
+            self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingonext_encoder")
+            self.audio_config = CONFIG_MAPPING[self.audio_config["model_type"]](**self.audio_config)
+        elif self.audio_config is None:
+            self.audio_config = CONFIG_MAPPING["audioflamingonext_encoder"]()


two code blocks for self.audio_config are getting genetated

ebezzam · 2026-04-24T19:03:26Z

+        if isinstance(self.audio_config, dict):
+            self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder")
+        elif self.audio_config is None:
+            self.audio_config = {"model_type": "audioflamingo3_encoder"}


modular doesn't let you override the other self.audio_config unfortunately. But what stops from using MusicFlamingoConfig as is?

lashahub · 2026-04-25T05:13:10Z

@ebezzam thanks for the reviews!

Reverted the Audio Flamingo 3 / Music Flamingo fixture and test changes.
Added the AF-Next fixture reproducer link in the integration tests: https://gist.github.com/lashahub/5dbee78c5faedd5389e211da85e3066d
Simplified the AF-Next config inheritance and removed the duplicated generated audio_config handling.
Kept the Music Flamingo RoPE default setup before super().__post_init__() because the base config now standardizes rope_parameters=None to the global default before model-specific defaults can be applied.
Reran AF-Next slow modeling tests locally on GPU.

github-actions · 2026-04-25T05:26:55Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44830&sha=205f8c

lashahub added 30 commits December 24, 2025 10:59

Music flamingo

caf33f1

Fix pos embeddings

b69a9d1

Merge branch 'huggingface:main' into main

b4acc17

Method arg docstrings

cf1e9bc

Add tests & docs

44e801b

Merge branch 'huggingface:main' into main

f0956e3

Fix AF3 dtype bug

c973445

Fix the MF performance issue

e3a17fb

Fix pos embeddings

627dee8

Merge branch 'main' of https://github.com/lashahub/transformers

e9df30d

Fix embeddings & format

4c48132

Remove external deps

d67c114

Update processor token names

aedd341

Cleanup

87d55a9

Simplify RotaryEmbedding to lang-only

be22746

Reuse AF3 config classes

e5b4677

Trim+rename rotary embedding

d7e0bcb

Call parent _init_weights first and drop rotary einsum

74af4fa

Precompute rotary cache at init

d622368

Use modular processor pattern for MusicFlamingo

cab3937

Remove audio-only inference example

9dd94a0

Refactor Audio Feature Casting Path

767a1d5

Clarify private source repo

9119660

Clean up modular

fc8ab3a

Move config to modular

5159745

Formatting

4aae4ff

Remove dummy

ffef7db

Derive musicflamingo timing and rotary config

90148ed

Llama style rotary embeddings

87cb03f

Added reproducer comments

5abac95

lashahub added 5 commits April 4, 2026 16:31

Merge branch 'main' into add_AudioFlamingoNext

6d5948e

Regen

a44c774

Fix AF-Next converter

3c7f582

AF-Next readme

8f33aeb

Add tests

d323d1f

lashahub mentioned this pull request Apr 5, 2026

Update MusicFlamingo and add AudioFlamingoNext vllm-project/vllm#39011

Open

5 tasks

Update AF-Next docs

9735a5f

lashahub marked this pull request as draft April 6, 2026 02:08

Fix MusicFlamingo and AFNext context metadata and rope defaults

548e646

lashahub force-pushed the add_AudioFlamingoNext branch from c536301 to 548e646 Compare April 9, 2026 18:13

lashahub added 2 commits April 13, 2026 01:33

Merge branch 'huggingface:main' into add_AudioFlamingoNext

e771363

Update fixtures

8c17076

lashahub marked this pull request as ready for review April 13, 2026 07:27

ebezzam self-assigned this Apr 23, 2026

ebezzam reviewed Apr 24, 2026

View reviewed changes

Comment thread src/transformers/models/audioflamingonext/modular_audioflamingonext.py Outdated

ebezzam reviewed Apr 24, 2026

View reviewed changes

lashahub added 2 commits April 25, 2026 00:49

Address reviews

d79fb29

Add reproducer

205f8c8

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

		@@ -1 +1 @@
		{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent of each other.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 315, 1817, 1008, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645, 151671]]} No newline at end of file
		{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent sound events.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 5112, 4357, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645]]} No newline at end of file

Conversation

lashahub commented Mar 18, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

lashahub commented Apr 13, 2026

Uh oh!

lashahub commented Apr 15, 2026

Uh oh!

ebezzam commented Apr 19, 2026

Uh oh!

ebezzam commented Apr 24, 2026

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Apr 24, 2026

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lashahub commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants