Skip to content

Conversation

@pengzhiliang
Copy link

Copy link
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhiliang thanks for the PR! This is an exciting model to add 🔥

My first comments are mainly on rearranging content to be consistent with the other models in Transformers, and creating a modular file to better optimize copying components from other models in Transformers.

There are also some other files to modify:

  • in src/transformers/models/auto
  • in docs
  • and eventually some tests (for which a lot of code can be copied from other models)

As an example of typical files to create/modify, you can check out the Qwen2.5-Omni PR, which is also multimodal.

@fakerybakery
Copy link

Hopefully this PR can get merged, since the deletion of the VibeVoice repo not sure the original authors can continue contributing to this PR

Galigator added a commit to Galigator/transformers that referenced this pull request Oct 8, 2025
@fakerybakery
Copy link

Glad to see that someone has picked this PR up, thanks 🤗

Copy link
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eustlb a self-review with potential discussion points to hopefully help with your review!

Comment on lines +30 to +31
- [bezzam/VibeVoice-1.5B](https://huggingface.co/bezzam/VibeVoice-1.5B)
- [bezzam/VibeVoice-7B](https://huggingface.co/bezzam/VibeVoice-7B)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update

Comment on lines +72 to +73
model_id = "bezzam/VibeVoice-1.5Bv2"
# model_id = "bezzam/VibeVoice-7Bv2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update in all examples

Comment on lines +509 to +515
### Training

TODO

### Full-graph compilation

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo if applicable


One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features (this model) and another for [semantic](./vibevoice_semantic_tokenizer) features.

A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update


*Note: the semantic tokenizer can only be used to encode audio to extract semantic features.*

A model checkpoint is available at [bezzam/VibeVoice-SemanticTokenizer](https://huggingface.co/bezzam/VibeVoice-SemanticTokenizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update

Comment on lines +392 to +395
# TODO (ebezzam) original has an implementation which should be verified (and would need noise scheduler from `diffusers`):
# https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modeling_vibevoice.py#L407
if acoustic_loss_mask is not None:
raise ValueError("Diffusion loss computation not implemented yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the comment, computing the diffusion (speech generation less) isn't implemented. It would need the diffusers library for a noise scheduler. To avoid importing diffusers here, perhaps it could be an input (loss_noise_scheduler) along with the acoustic_loss_mask. Namely that a user manually creates a noise scheduler outside the model like here

Comment on lines +121 to +124
# NOTE (ebezzam) original hardcodes scaling within sampling: https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modular_vibevoice_tokenizer.py#L963
# scaling moved here in case future implementations modify `vae_std` but keep internal scaling
self.vae_std = vae_std
self.vae_scaling_factor = vae_scaling_factor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts?

One alternative is to altogether remove vae_scaling_factor and default vae_std to 0.625 (=0.5 / 0.8), which is essentially the resulting value from their hardcoding

Comment on lines +1193 to +1199
def require_diffusers(test_case):
"""
Decorator marking a test that requires diffusers
"""
return unittest.skipUnless(is_diffusers_available(), "test requires diffusers")(test_case)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, needed for the integration test to be able to use a noise scheduler

cc @ArthurZucker

all_model_classes = (CsmForConditionalGeneration,) if is_torch_available() else ()

test_resize_embeddings = False
test_resize_embeddings_untied = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally copied this setting in my VibeVoice tests, but it needed to be removed after the tied weights refactoring

@ebezzam
Copy link
Contributor

ebezzam commented Dec 5, 2025

run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

  • vibevoice_acoustic_tokenizer:
    tests/models/vibevoice_acoustic_tokenizer/test_modeling_vibevoice_acoustic_tokenizer.py::VibeVoiceAcousticTokenizerIntegrationTest::test_batch_integration

  • vibevoice_semantic_tokenizer:
    tests/models/vibevoice_semantic_tokenizer/test_modeling_vibevoice_semantic_tokenizer.py::VibeVoiceSemanticTokenizerIntegrationTest::test_batch_integration

@ebezzam
Copy link
Contributor

ebezzam commented Dec 5, 2025

run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants