Add AudioFlamingoNext model#44830
Conversation
c536301 to
548e646
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, audioflamingonext, auto, musicflamingo |
|
Hi @ebezzam, the model checkpoint is live at nvidia/audio-flamingo-next-hf. The main difference between Music Flamingo and Audio Flamingo Next is that AF-Next supports up to 30 minutes of audio, while MF supports 20. Updated some AF3 and MF fixtures too that were failing on my end. |
|
Hi @eustlb, can we run the slow tests? Also this PR is pretty lightweight, it just registers |
|
@lashahub I started my review, should be able to finish this week! |
|
run-slow: audioflamingonext |
ebezzam
left a comment
There was a problem hiding this comment.
@lashahub thanks for the model addition! Could you add links to reproducer scripts for the integration tests? Like we did for AF3 and MF
and can you revert the changes you made for the audioflamingo3 and musicflamingo tests? we should only change them if you are changing modeling code that would change those outputs. Moreover, It's expected that they fail for you. When integrating those models, I had to re-compute the expected outputs for our CI machine/setup (as different configurations can lead to difference outputs). Also someone added XPU outputs for music flamingo and we want to leave that.
| audio_token="<sound>", | ||
| audio_bos_token="<|sound_bos|>", | ||
| audio_eos_token="<|sound_eos|>", | ||
| max_audio_len=1800, |
There was a problem hiding this comment.
NOTE: max_audio_len is only change from Music Flamingo (here)
| @@ -1 +1 @@ | |||
| {"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent of each other.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 315, 1817, 1008, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645, 151671]]} No newline at end of file | |||
| {"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent sound events.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 5112, 4357, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645]]} No newline at end of file | |||
There was a problem hiding this comment.
can you revert this change? this test should normally be passing on our machines already (it was passing on my development machine, but this change causes it to fail)
different hardware can lead to different outputs, and we computed these outputs on a similar setup as our CI
There was a problem hiding this comment.
let's also leave this as before. Someone added expected outputs for XPU hardware since the model addition, and we want to keep that.
| cleanup(torch_device, gc_collect=True) | ||
|
|
||
| @slow | ||
| def test_fixture_single_matches(self): |
There was a problem hiding this comment.
can you add the reproducer scripts? so I can reproduce the output in case there are outputs differences with out CI setup/hardware.
if the original model is private like musicflamingo, we will do something like the musicflamingo tests, where the expected outputs are computed with the model at merge
| rope_parameters: dict | None = None | ||
|
|
||
| def __post_init__(self, **kwargs): | ||
| if self.rope_parameters is None: |
There was a problem hiding this comment.
why did you need to shift it here?
There was a problem hiding this comment.
This needs to happen before super().__post_init__() because the base config standardizes rope_parameters=None into {"rope_theta": 10000.0, "rope_type": "default"}. If we set the Music Flamingo default after super(), the fallback no longer runs and the config gets the wrong RoPE value instead of rope_theta=1200 / partial_rotary_factor=0.2.
|
run-slow: audioflamingonext |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ebezzam
left a comment
There was a problem hiding this comment.
Some fixes are also needed for the config!
| if isinstance(self.audio_config, dict): | ||
| self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder") | ||
| elif self.audio_config is None: | ||
| self.audio_config = {"model_type": "audioflamingo3_encoder"} | ||
| if self.rope_parameters is None: | ||
| self.rope_parameters = { | ||
| "rope_type": "default", | ||
| "rope_theta": kwargs.get("rope_theta", 1200), | ||
| "partial_rotary_factor": kwargs.get("partial_rotary_factor", 0.2), | ||
| } | ||
| if isinstance(self.audio_config, dict): | ||
| self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingonext_encoder") | ||
| self.audio_config = CONFIG_MAPPING[self.audio_config["model_type"]](**self.audio_config) | ||
| elif self.audio_config is None: | ||
| self.audio_config = CONFIG_MAPPING["audioflamingonext_encoder"]() |
There was a problem hiding this comment.
two code blocks for self.audio_config are getting genetated
| if isinstance(self.audio_config, dict): | ||
| self.audio_config["model_type"] = self.audio_config.get("model_type", "audioflamingo3_encoder") | ||
| elif self.audio_config is None: | ||
| self.audio_config = {"model_type": "audioflamingo3_encoder"} |
There was a problem hiding this comment.
modular doesn't let you override the other self.audio_config unfortunately. But what stops from using MusicFlamingoConfig as is?
|
@ebezzam thanks for the reviews!
|
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44830&sha=205f8c |
This PR adds
AudioFlamingoNextas a separate model name that inherits directly fromMusicFlamingo#43538 and keeps the same architecture and behavior.Changes:
audioflamingonextmodel files