Skip to content

Add Vocos model#39403

Open
Manalelaidouni wants to merge 127 commits intohuggingface:mainfrom
Manalelaidouni:add-vocos-model
Open

Add Vocos model#39403
Manalelaidouni wants to merge 127 commits intohuggingface:mainfrom
Manalelaidouni:add-vocos-model

Conversation

@Manalelaidouni
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR aims at integrating Vocos model to transformers.

Vocos is a neural vocoder designed for high quality audio synthesis in TTS pipelines and related tasks, outpeforms HifiGan and it is significantly faster. It has 2 main variants :

  • VocosModel can be used as a standalone vocoder in audio generation pipeline, the goal is to use it as a drop in vocoder in YuE model. It can also be used together with VocosFeatureExtractor to synthesis audio from mel-spectrogram features.
  • VocosWithEncodecModel : integrates the EnCodec neural audio codec model into Vocos for end-to-end audio compression and reconstruction.

This is a continuation of integrating model components for the new YuE model (mention in #36784).

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
@ArthurZucker @eustlb @ylacombe

@Manalelaidouni Manalelaidouni marked this pull request as draft July 14, 2025 22:50
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! My main comment is to remove the hidden states post processing!

Comment thread src/transformers/models/vocos/modeling_vocos.py Outdated
@ArthurZucker ArthurZucker requested a review from eustlb July 16, 2025 13:33
@Manalelaidouni Manalelaidouni marked this pull request as ready for review July 22, 2025 13:07
@Manalelaidouni Manalelaidouni marked this pull request as draft July 22, 2025 13:26
@Manalelaidouni Manalelaidouni marked this pull request as ready for review July 22, 2025 15:29
@Manalelaidouni
Copy link
Copy Markdown
Contributor Author

Manalelaidouni commented Jul 22, 2025

Thanks for reviewing! the failing tests seem unrelated to my changes, but I realized the latest datasets 4.0.0 loads different audio samples than earlier versions which was causing integration tests to fail in CI.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late review!

Comment thread src/transformers/models/vocos/modeling_vocos.py Outdated
Comment thread src/transformers/models/vocos/modeling_vocos.py Outdated
Comment thread src/transformers/models/vocos/modeling_vocos.py Outdated
Comment thread src/transformers/models/vocos/feature_extraction_vocos.py Outdated
@ArthurZucker
Copy link
Copy Markdown
Collaborator

If you can merge main adress the small comment and we can merge!

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=5a8643

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=bd4727

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=b7bac4

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=d55b0f

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=c504d3

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=4cc616

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=39af5a

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39403&sha=3e8df7

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vocos, vocos_encodec

@Manalelaidouni
Copy link
Copy Markdown
Contributor Author

Hey @eustlb @ebezzam I pushed few changes and the PR is in a mergeable shape again, would appreciate your review when you have a moment. The CI failures look unrelated except for the date updating in docs,

  • Removed return_audio_only from feature extractor and simplified the processor flow.
  • feature extractor and processor return a, attention_mask so batch outputs can be trimmed consistently, models now accept and passes it through both VocosModel and VocosEncodecModel so that there is output trimming for batched audio uses the mask (similar to Parakeet style handling)
  • Updated tests accordingly, fixtures, gist and the model doc cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants