Implement VibeVoice #40546

pengzhiliang · 2025-08-29T12:47:34Z

What does this PR do?

Merge the model from https://github.com/microsoft/VibeVoice/tree/main

HF:
https://huggingface.co/microsoft/VibeVoice-1.5B

update

ebezzam

@pengzhiliang thanks for the PR! This is an exciting model to add 🔥

My first comments are mainly on rearranging content to be consistent with the other models in Transformers, and creating a modular file to better optimize copying components from other models in Transformers.

There are also some other files to modify:

in src/transformers/models/auto
in docs
and eventually some tests (for which a lot of code can be copied from other models)

As an example of typical files to create/modify, you can check out the Qwen2.5-Omni PR, which is also multimodal.

src/transformers/models/vibevoice/__init__.py

src/transformers/models/vibevoice/audio_streamer.py

src/transformers/models/vibevoice/configuration_vibevoice.py

src/transformers/models/vibevoice/vibevoice_processor.py

src/transformers/models/vibevoice/vibevoice_audio_processor.py

fakerybakery · 2025-09-06T22:37:22Z

Hopefully this PR can get merged, since the deletion of the VibeVoice repo not sure the original authors can continue contributing to this PR

fakerybakery · 2025-10-15T01:33:05Z

Glad to see that someone has picked this PR up, thanks 🤗

…ding cache.

ebezzam

@eustlb a self-review with potential discussion points to hopefully help with your review!

ebezzam · 2025-12-04T10:37:17Z

docs/source/en/model_doc/vibevoice.md

+- [bezzam/VibeVoice-1.5B](https://huggingface.co/bezzam/VibeVoice-1.5B)
+- [bezzam/VibeVoice-7B](https://huggingface.co/bezzam/VibeVoice-7B)


ebezzam · 2025-12-04T10:37:46Z

docs/source/en/model_doc/vibevoice.md

+model_id = "bezzam/VibeVoice-1.5Bv2"
+# model_id = "bezzam/VibeVoice-7Bv2"


To update in all examples

ebezzam · 2025-12-04T10:38:16Z

docs/source/en/model_doc/vibevoice.md

+### Training
+
+TODO
+
+### Full-graph compilation
+
+TODO


Todo if applicable

ebezzam · 2025-12-04T10:38:44Z

docs/source/en/model_doc/vibevoice_acoustic_tokenizer.md

+
+One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features (this model) and another for [semantic](./vibevoice_semantic_tokenizer) features.
+
+A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer)


ebezzam · 2025-12-04T10:39:00Z

docs/source/en/model_doc/vibevoice_semantic_tokenizer.md

+
+*Note: the semantic tokenizer can only be used to encode audio to extract semantic features.*
+
+A model checkpoint is available at [bezzam/VibeVoice-SemanticTokenizer](https://huggingface.co/bezzam/VibeVoice-SemanticTokenizer)


ebezzam · 2025-12-05T11:03:38Z

src/transformers/models/vibevoice/modular_vibevoice.py

+        # TODO (ebezzam) original has an implementation which should be verified (and would need noise scheduler from `diffusers`):
+        # https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modeling_vibevoice.py#L407
+        if acoustic_loss_mask is not None:
+            raise ValueError("Diffusion loss computation not implemented yet.")


As said in the comment, computing the diffusion (speech generation less) isn't implemented. It would need the diffusers library for a noise scheduler. To avoid importing diffusers here, perhaps it could be an input (loss_noise_scheduler) along with the acoustic_loss_mask. Namely that a user manually creates a noise scheduler outside the model like here

ebezzam · 2025-12-05T11:10:17Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+        # NOTE (ebezzam) original hardcodes scaling within sampling: https://github.com/pengzhiliang/transformers/blob/6e6e60fb95ca908feb0b039483adcc009809f579/src/transformers/models/vibevoice/modular_vibevoice_tokenizer.py#L963
+        # scaling moved here in case future implementations modify `vae_std` but keep internal scaling
+        self.vae_std = vae_std
+        self.vae_scaling_factor = vae_scaling_factor


Thoughts?

One alternative is to altogether remove vae_scaling_factor and default vae_std to 0.625 (=0.5 / 0.8), which is essentially the resulting value from their hardcoding

src/transformers/models/vibevoice_semantic_tokenizer/modular_vibevoice_semantic_tokenizer.py

ebezzam · 2025-12-05T11:19:42Z

src/transformers/testing_utils.py

+def require_diffusers(test_case):
+    """
+    Decorator marking a test that requires diffusers
+    """
+    return unittest.skipUnless(is_diffusers_available(), "test requires diffusers")(test_case)
+
+


In the end, needed for the integration test to be able to use a noise scheduler

cc @ArthurZucker

ebezzam · 2025-12-05T11:21:00Z

tests/models/csm/test_modeling_csm.py

    all_model_classes = (CsmForConditionalGeneration,) if is_torch_available() else ()

    test_resize_embeddings = False
-    test_resize_embeddings_untied = False


I had originally copied this setting in my VibeVoice tests, but it needed to be removed after the tied weights refactoring

ebezzam · 2025-12-05T12:44:48Z

run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

github-actions · 2025-12-05T12:45:56Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"]
quantizations: []

github-actions · 2025-12-05T13:02:37Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

vibevoice_acoustic_tokenizer:
tests/models/vibevoice_acoustic_tokenizer/test_modeling_vibevoice_acoustic_tokenizer.py::VibeVoiceAcousticTokenizerIntegrationTest::test_batch_integration
vibevoice_semantic_tokenizer:
tests/models/vibevoice_semantic_tokenizer/test_modeling_vibevoice_semantic_tokenizer.py::VibeVoiceSemanticTokenizerIntegrationTest::test_batch_integration

ebezzam · 2025-12-05T14:17:23Z

run-slow: vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

github-actions · 2025-12-05T14:18:41Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice", "models/vibevoice_acoustic_tokenizer", "models/vibevoice_semantic_tokenizer"]
quantizations: []

github-actions · 2025-12-05T14:49:20Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

github-actions · 2025-12-05T15:30:00Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vibevoice, vibevoice_acoustic_tokenizer, vibevoice_semantic_tokenizer

pengzhiliang added 4 commits May 23, 2025 19:24

Merge pull request #1 from huggingface/main

db16649

update

Merge branch 'huggingface:main' into main

bc681cd

Merge branch 'huggingface:main' into main

30f09dd

Merge VibeVoice model

6e6e60f

ebezzam added New model Audio labels Aug 29, 2025

ebezzam requested changes Aug 29, 2025

View reviewed changes

twobob mentioned this pull request Sep 5, 2025

[Feature] [New Model]: VibeVoice-1.5B : A Frontier Long Conversational Text-to-Speech Model sgl-project/sglang#9697

Open

2 tasks

ebezzam added 8 commits September 23, 2025 17:40

Restructuring.

5f461b6

Formatting.

c91497f

Clean up processor and feature extractor.

0bef8e3

Move audio processing to __call__.

f6893e2

Clean up tokenizer, use modular.

f508927

Move token sequence prep to __call__.

67fcd3e

Move batch prep to __call__.

90490d1

More efficient tokenization.

8698bb5

Galigator added a commit to Galigator/transformers that referenced this pull request Oct 8, 2025

Apply PR huggingface#40546

dca2cc5

ebezzam added 10 commits October 10, 2025 18:59

Start separate acoustic tokenizer model.

92d7a7d

Add init for acoustic tokenizer.

add4e35

Remove audio tokenizer from main model.

396bb2c

Separate semantic model.

fa6100c

Switch to modular and remove debug path.

24e9f13

Start conversion script for main model, clean up semantic modeling.

84c27ad

Update semantic modular.

d7d409e

Single conversion script.

06313d0

Clean up semantic tokenizer modeling.

4af74e9

Update modular.

47fdd93

More semantic cleanup.

8de03e7

ebezzam added 17 commits November 27, 2025 17:07

Merge branch 'main' into main

c4c77c5

Format

8fc05f7

Reorg like llava

1bac844

Address processor tests.

bf7539d

Clean up init weights.

84e9045

Add tokenizer tests.

40def51

add diffusers decorator for testing

55749f6

Remove batch mask to keep full batch for tokenizers, and simplify pad…

36ee0b6

…ding cache.

Remove batch mask from tests.

8f78f5f

Code style/quality

56cf79c

Nits

da44f63

Docs skeleton and style nits.

57b3600

Update docs and usage.

0a0abcb

Tokenizer nits.

6b7c4bc

Cleanup and make auto device work.

75a9ef1

Simplify save_audio method

2e0f106

Nits.

13b82d3

ebezzam reviewed Dec 5, 2025

View reviewed changes

Merge branch 'main' into main

8585368

ebezzam added 2 commits December 5, 2025 14:07

model classes needing **kwargs in their forward

47e7814

Updated expected output from runner.

117b68a

Skip failing parallelism tests for audio tokenizers.

cf1fb19

Bhanu068 mentioned this pull request Dec 7, 2025

[New Model]: VibeVoice vllm-project/vllm-omni#184

Open

1 task

		- [bezzam/VibeVoice-1.5B](https://huggingface.co/bezzam/VibeVoice-1.5B)
		- [bezzam/VibeVoice-7B](https://huggingface.co/bezzam/VibeVoice-7B)

		model_id = "bezzam/VibeVoice-1.5Bv2"
		# model_id = "bezzam/VibeVoice-7Bv2"


		One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features (this model) and another for [semantic](./vibevoice_semantic_tokenizer) features.

		A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer)


		Note: the semantic tokenizer can only be used to encode audio to extract semantic features.

		A model checkpoint is available at [bezzam/VibeVoice-SemanticTokenizer](https://huggingface.co/bezzam/VibeVoice-SemanticTokenizer)

Implement VibeVoice #40546

Are you sure you want to change the base?

Implement VibeVoice #40546

Conversation

pengzhiliang commented Aug 29, 2025

What does this PR do?

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fakerybakery commented Sep 6, 2025

Uh oh!

fakerybakery commented Oct 15, 2025

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

ebezzam commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

CI Results

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants