Skip to content

Add AudioFlamingoNext model#44830

Open
lashahub wants to merge 109 commits intohuggingface:mainfrom
lashahub:add_AudioFlamingoNext
Open

Add AudioFlamingoNext model#44830
lashahub wants to merge 109 commits intohuggingface:mainfrom
lashahub:add_AudioFlamingoNext

Conversation

@lashahub
Copy link
Copy Markdown
Contributor

This PR adds AudioFlamingoNext as a separate model name that inherits directly from MusicFlamingo #43538 and keeps the same architecture and behavior.

Changes:

  • add audioflamingonext model files
  • register it in the auto mappings
  • add basic tests for the new model name

@lashahub lashahub marked this pull request as draft April 6, 2026 02:08
@lashahub lashahub force-pushed the add_AudioFlamingoNext branch from c536301 to 548e646 Compare April 9, 2026 18:13
@lashahub lashahub marked this pull request as ready for review April 13, 2026 07:27
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, audioflamingonext, auto, musicflamingo

@lashahub
Copy link
Copy Markdown
Contributor Author

Hi @ebezzam, the model checkpoint is live at nvidia/audio-flamingo-next-hf. The main difference between Music Flamingo and Audio Flamingo Next is that AF-Next supports up to 30 minutes of audio, while MF supports 20.

Updated some AF3 and MF fixtures too that were failing on my end.

@lashahub
Copy link
Copy Markdown
Contributor Author

Hi @eustlb, can we run the slow tests? Also this PR is pretty lightweight, it just registers AudioFlamingoNext as a model name, would be great to get it in when convenient. Thanks!

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 19, 2026

@lashahub I started my review, should be able to finish this week!

@ebezzam ebezzam self-assigned this Apr 23, 2026
@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 24, 2026

run-slow: audioflamingonext

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lashahub thanks for the model addition! Could you add links to reproducer scripts for the integration tests? Like we did for AF3 and MF

and can you revert the changes you made for the audioflamingo3 and musicflamingo tests? we should only change them if you are changing modeling code that would change those outputs. Moreover, It's expected that they fail for you. When integrating those models, I had to re-compute the expected outputs for our CI machine/setup (as different configurations can lead to difference outputs). Also someone added XPU outputs for music flamingo and we want to leave that.

audio_token="<sound>",
audio_bos_token="<|sound_bos|>",
audio_eos_token="<|sound_eos|>",
max_audio_len=1800,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: max_audio_len is only change from Music Flamingo (here)

@@ -1 +1 @@
{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent of each other.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 315, 1817, 1008, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645, 151671]]} No newline at end of file
{"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent sound events.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 5112, 4357, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645]]} No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this change? this test should normally be passing on our machines already (it was passing on my development machine, but this change causes it to fail)

different hardware can lead to different outputs, and we computed these outputs on a similar setup as our CI

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also leave this as before. Someone added expected outputs for XPU hardware since the model addition, and we want to keep that.

cleanup(torch_device, gc_collect=True)

@slow
def test_fixture_single_matches(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the reproducer scripts? so I can reproduce the output in case there are outputs differences with out CI setup/hardware.

if the original model is private like musicflamingo, we will do something like the musicflamingo tests, where the expected outputs are computed with the model at merge

rope_parameters: dict | None = None

def __post_init__(self, **kwargs):
if self.rope_parameters is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need to shift it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to happen before super().__post_init__() because the base config standardizes rope_parameters=None into {"rope_theta": 10000.0, "rope_type": "default"}. If we set the Music Flamingo default after super(), the fallback no longer runs and the config gets the wrong RoPE value instead of rope_theta=1200 / partial_rotary_factor=0.2.

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Apr 24, 2026

run-slow: audioflamingonext

Comment thread src/transformers/models/audioflamingonext/modular_audioflamingonext.py Outdated
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fixes are also needed for the config!

Comment on lines +75 to +89
if isinstance(self.audio_config, dict):
self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder")
elif self.audio_config is None:
self.audio_config = {"model_type": "audioflamingo3_encoder"}
if self.rope_parameters is None:
self.rope_parameters = {
"rope_type": "default",
"rope_theta": kwargs.get("rope_theta", 1200),
"partial_rotary_factor": kwargs.get("partial_rotary_factor", 0.2),
}
if isinstance(self.audio_config, dict):
self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingonext_encoder")
self.audio_config = CONFIG_MAPPING[self.audio_config["model_type"]](**self.audio_config)
elif self.audio_config is None:
self.audio_config = CONFIG_MAPPING["audioflamingonext_encoder"]()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two code blocks for self.audio_config are getting genetated

Comment on lines +30 to +33
if isinstance(self.audio_config, dict):
self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder")
elif self.audio_config is None:
self.audio_config = {"model_type": "audioflamingo3_encoder"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modular doesn't let you override the other self.audio_config unfortunately. But what stops from using MusicFlamingoConfig as is?

@lashahub
Copy link
Copy Markdown
Contributor Author

@ebezzam thanks for the reviews!

  • Reverted the Audio Flamingo 3 / Music Flamingo fixture and test changes.
  • Added the AF-Next fixture reproducer link in the integration tests: https://gist.github.com/lashahub/5dbee78c5faedd5389e211da85e3066d
  • Simplified the AF-Next config inheritance and removed the duplicated generated audio_config handling.
  • Kept the Music Flamingo RoPE default setup before super().__post_init__() because the base config now standardizes rope_parameters=None to the global default before model-specific defaults can be applied.
  • Reran AF-Next slow modeling tests locally on GPU.

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44830&sha=205f8c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants