Skip to content

Fix CSM TextToAudioPipeline missing <bos> token#45525

Merged
Rocketknight1 merged 2 commits intohuggingface:mainfrom
jiqing-feng:csm
Apr 20, 2026
Merged

Fix CSM TextToAudioPipeline missing <bos> token#45525
Rocketknight1 merged 2 commits intohuggingface:mainfrom
jiqing-feng:csm

Conversation

@jiqing-feng
Copy link
Copy Markdown
Contributor

@jiqing-feng jiqing-feng commented Apr 20, 2026

What does this PR do?

CsmProcessor defaults add_special_tokens=False (designed for apply_chat_template, which includes <bos> in the Jinja template). When the pipeline calls preprocessor(text) directly for raw text input, <bos> (128000) and <eos> (128001) are missing from the tokenized sequence. Without these tokens the model receives malformed input it was never trained on, making generation unstable — certain seed/sampling parameter combinations cause the model to emit all-zero codebook frames, which are treated as EOS (codebook_eos_token_id=0), resulting in an empty audio tensor that crashes Mimi's Conv1d decoder.

Fix: set add_special_tokens=True for CSM in pipeline preprocess.

Reproduction

from transformers import pipeline, set_seed

pipe = pipeline("text-to-speech", model="sesame/csm-1b")
set_seed(777)
output = pipe("Hello, my dog is cooler than you!", forward_params={"do_sample": True, "temperature": 0.7, "top_k": 50, "top_p": 0.95})
# RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1).
# Kernel size can't be greater than actual input size

error:

......
  File "/home/jiqing/transformers/src/transformers/models/csm/generation_csm.py", line 478, in generate
    codec_decode_output = self.codec_model.decode(audio_codes_batch.transpose(0, 1).unsqueeze(0))                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      File "/home/jiqing/transformers/src/transformers/models/mimi/modeling_mimi.py", line 1666, in decode
    audio_values, decoder_past_key_values = self._decode_frame(                                                                                                                  ^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/models/mimi/modeling_mimi.py", line 1619, in _decode_frame
    embeddings = self.quantizer.decode(codes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/models/mimi/modeling_mimi.py", line 1344, in decode
    quantized_out = self.semantic_residual_vector_quantizer.decode(codes[:, : self.num_semantic_quantizers])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/models/mimi/modeling_mimi.py", line 1292, in decode
    quantized_out = self.output_proj(quantized_out)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 385, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 380, in _conv_forward
    return F.conv1d(
           ^^^^^^^^^
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

Hi @Rocketknight1 . Would you please review this PR? Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng changed the title fix csm pipeline Fix CSM TextToAudioPipeline missing <bos> token Apr 20, 2026
@jiqing-feng jiqing-feng marked this pull request as ready for review April 20, 2026 06:23
Copy link
Copy Markdown
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the fix makes sense, thank you!

@Rocketknight1 Rocketknight1 enabled auto-merge April 20, 2026 15:29
@Rocketknight1 Rocketknight1 added this pull request to the merge queue Apr 20, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merged via the queue into huggingface:main with commit ce77bc3 Apr 20, 2026
16 checks passed
lvliang-intel pushed a commit to lvliang-intel/transformers that referenced this pull request Apr 21, 2026
fix csm pipeline

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
artem-spector pushed a commit to artem-spector/transformers that referenced this pull request Apr 21, 2026
fix csm pipeline

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants