-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Implement VibeVoice #40546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement VibeVoice #40546
Conversation
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pengzhiliang thanks for the PR! This is an exciting model to add 🔥
My first comments are mainly on rearranging content to be consistent with the other models in Transformers, and creating a modular file to better optimize copying components from other models in Transformers.
There are also some other files to modify:
- in
src/transformers/models/auto - in
docs - and eventually some tests (for which a lot of code can be copied from other models)
As an example of typical files to create/modify, you can check out the Qwen2.5-Omni PR, which is also multimodal.
|
Hopefully this PR can get merged, since the deletion of the VibeVoice repo not sure the original authors can continue contributing to this PR |
|
Glad to see that someone has picked this PR up, thanks 🤗 |
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eustlb a self-review with potential discussion points to hopefully help with your review!
| - [bezzam/VibeVoice-1.5B](https://huggingface.co/bezzam/VibeVoice-1.5B) | ||
| - [bezzam/VibeVoice-7B](https://huggingface.co/bezzam/VibeVoice-7B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update
| model_id = "bezzam/VibeVoice-1.5Bv2" | ||
| # model_id = "bezzam/VibeVoice-7Bv2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update in all examples
| ### Training | ||
|
|
||
| TODO | ||
|
|
||
| ### Full-graph compilation | ||
|
|
||
| TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo if applicable
|
|
||
| One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features (this model) and another for [semantic](./vibevoice_semantic_tokenizer) features. | ||
|
|
||
| A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update
|
|
||
| *Note: the semantic tokenizer can only be used to encode audio to extract semantic features.* | ||
|
|
||
| A model checkpoint is available at [bezzam/VibeVoice-SemanticTokenizer](https://huggingface.co/bezzam/VibeVoice-SemanticTokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update
| # TODO (ebezzam) original has an implementation which should be verified (and would need noise scheduler from `diffusers`): | ||
| # https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modeling_vibevoice.py#L407 | ||
| if acoustic_loss_mask is not None: | ||
| raise ValueError("Diffusion loss computation not implemented yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the comment, computing the diffusion (speech generation less) isn't implemented. It would need the diffusers library for a noise scheduler. To avoid importing diffusers here, perhaps it could be an input (loss_noise_scheduler) along with the acoustic_loss_mask. Namely that a user manually creates a noise scheduler outside the model like here
| # NOTE (ebezzam) original hardcodes scaling within sampling: https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modular_vibevoice_tokenizer.py#L963 | ||
| # scaling moved here in case future implementations modify `vae_std` but keep internal scaling | ||
| self.vae_std = vae_std | ||
| self.vae_scaling_factor = vae_scaling_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts?
One alternative is to altogether remove vae_scaling_factor and default vae_std to 0.625 (=0.5 / 0.8), which is essentially the resulting value from their hardcoding
src/transformers/models/vibevoice_semantic_tokenizer/modular_vibevoice_semantic_tokenizer.py
Show resolved
Hide resolved
| def require_diffusers(test_case): | ||
| """ | ||
| Decorator marking a test that requires diffusers | ||
| """ | ||
| return unittest.skipUnless(is_diffusers_available(), "test requires diffusers")(test_case) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, needed for the integration test to be able to use a noise scheduler
| all_model_classes = (CsmForConditionalGeneration,) if is_torch_available() else () | ||
|
|
||
| test_resize_embeddings = False | ||
| test_resize_embeddings_untied = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally copied this setting in my VibeVoice tests, but it needed to be removed after the tied weights refactoring
|
run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer |
|
This comment contains models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"] |
CI ResultsModel CI Report❌ Failed tests
|
|
run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer |
|
This comment contains models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer |
What does this PR do?
Merge the model from https://github.com/microsoft/VibeVoice/tree/main
HF:
https://huggingface.co/microsoft/VibeVoice-1.5B