Skip to content

Fix whisper return_language with return_timestamp=word#39938

Closed
Metric-Void wants to merge 5 commits intohuggingface:mainfrom
Metric-Void:whisper-langtoken-fix
Closed

Fix whisper return_language with return_timestamp=word#39938
Metric-Void wants to merge 5 commits intohuggingface:mainfrom
Metric-Void:whisper-langtoken-fix

Conversation

@Metric-Void
Copy link
Copy Markdown

@Metric-Void Metric-Void commented Aug 5, 2025

What does this PR do?

Fixes #39404.

Add a switch to Whisper.generate() that allows preserving some special tokens, then stripped in retrieve_segments to ensure timestamp alignment.

Tested on short and long audios. Tested on English, French, and Cantonese. Prediction and timestamp results align, and language is detected correctly.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@eustlb @ebezzam

Local failed tests (WSL2, RUN_SLOW)

$ pytest tests/models/whisper
================================================================================================================= short test summary info ==================================================================================================================
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_flex_attention_with_grads - torch._inductor.exc.InductorError: LoweringException: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpw7mv95z8/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpw7mv95z8/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', ...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp2jbthzzq/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmp2jbthzzq/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperEncoderModelTest::test_flex_attention_with_grads - torch._inductor.exc.InductorError: LoweringException: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpbszajy61/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpbszajy61/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', ...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperEncoderModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpevr_eml0/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpevr_eml0/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_compilation_all_outputs - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpghb4htrw/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpghb4htrw/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_compile_model_forward - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpb3fj6t8c/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpb3fj6t8c/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_from_inputs_embeds_with_static_cache - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp122w6v5o/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmp122w6v5o/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_with_static_cache - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpee6hyznt/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpee6hyznt/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpbz2lnr80/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpbz2lnr80/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_padding_side_in_kwargs - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_tokenizer_initialization_with_conflicting_key - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_tokenizer_mismatch_warning - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_truncation_side_in_kwargs - ImportError: 
=========================================================================================== 13 failed, 445 passed, 295 skipped, 36 warnings in 166.72s (0:02:46) ===========================================================================================

I don't think any of these failures are related to this PR.

@ebezzam
Copy link
Copy Markdown
Contributor

ebezzam commented Aug 14, 2025

@Metric-Void thanks for the PR!

Does the same example in #39404 (below), now return the expected timestamps and language? Could you share the output?

import torch
from transformers import pipeline
from transformers.configuration_utils import PretrainedConfig

pipeline = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-tiny",
    torch_dtype=torch.float16,
    config=PretrainedConfig(
      attn_implementation="flash_attention_2"
    )
)
result = pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", return_language=True, return_timestamps='word')

result["chunks"]

Regarding errors, the ones you are getting are related to missing libraries. When I run the tests, I get the following:

# pytest tests/models/whisper
==== short test summary info ====
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_multi_gpu_data_parallel_forward - TypeError: EncoderDecoderCache.__init__() missing 1 required positional argument: 'cross_attention_cache'
==== 1 failed, 467 passed, 285 skipped, 37 warnings in 418.89s (0:06:58) ========

It is consistent before and after your changes so you haven't introduced any failing tests 👍

I would still wait for @eustlb's input on how to adjust Whisper's generate code.

@ebezzam ebezzam requested a review from eustlb August 14, 2025 11:23
@Metric-Void
Copy link
Copy Markdown
Author

Metric-Void commented Aug 14, 2025

@ebezzam Yes, here's the output. #39404 was mine, so it only makes sense if it fixes that.

[{'text': ' I', 'timestamp': (0.0, 1.36), 'language': 'english'},
 {'text': ' have', 'timestamp': (1.36, 1.68), 'language': 'english'},
 {'text': ' a', 'timestamp': (1.68, 1.94), 'language': 'english'},
 {'text': ' dream,', 'timestamp': (1.94, 3.76), 'language': 'english'},
 {'text': ' but', 'timestamp': (3.76, 3.94), 'language': 'english'},
 {'text': ' one', 'timestamp': (3.94, 4.18), 'language': 'english'},
 {'text': ' day,', 'timestamp': (4.18, 6.16), 'language': 'english'},
 {'text': ' this', 'timestamp': (6.16, 6.58), 'language': 'english'},
 {'text': ' nation', 'timestamp': (6.58, 7.2), 'language': 'english'},
 {'text': ' will', 'timestamp': (7.2, 7.82), 'language': 'english'},
 {'text': ' rise', 'timestamp': (7.82, 8.3), 'language': 'english'},
 {'text': ' up,', 'timestamp': (8.3, 10.18), 'language': 'english'},
 {'text': ' live', 'timestamp': (10.18, 10.56), 'language': 'english'},
 {'text': ' out', 'timestamp': (10.56, 10.98), 'language': 'english'},
 {'text': ' the', 'timestamp': (10.98, 11.02), 'language': 'english'},
 {'text': ' true', 'timestamp': (11.02, 11.3), 'language': 'english'},
 {'text': ' meaning', 'timestamp': (11.3, 11.6), 'language': 'english'},
 {'text': ' of', 'timestamp': (11.6, 11.84), 'language': 'english'},
 {'text': ' its', 'timestamp': (11.84, 12.08), 'language': 'english'},
 {'text': ' dream.', 'timestamp': (12.54, 12.98), 'language': 'english'}]

More tests in https://gist.github.com/Metric-Void/79f7fcecc432d0e648af0fd896b5016a. Though it seemed like whisper (at least the tiny model) does not predict additional language tokens when the language changes.

For the long canterville.ogg, I diff'd the outputs before and after the fix. The only change is the addition of language tags.

I'm not sure if I should add tests to test this use case. There was such a test but was removed afterwards.

def test_return_timestamps_and_language_in_preprocess(self):
pipe = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-tiny",
chunk_length_s=8,
stride_length_s=1,
return_language=True,
)
data = load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True, trust_remote_code=True)
sample = next(iter(data))
res = pipe(sample["audio"]["array"])
self.assertEqual(
res,
{
"text": " Conquered returned to its place amidst the tents.",
"chunks": [{"language": "english", "text": " Conquered returned to its place amidst the tents."}],
},
)
res = pipe(sample["audio"]["array"], return_timestamps=True)
self.assertEqual(
res,
{
"text": " Conquered returned to its place amidst the tents.",
"chunks": [
{
"timestamp": (0.0, 3.36),
"language": "english",
"text": " Conquered returned to its place amidst the tents.",
}
],
},
)
res = pipe(sample["audio"]["array"], return_timestamps="word")
# fmt: off
self.assertEqual(
res,
{
'text': ' Conquered returned to its place amidst the tents.',
'chunks': [
{"language": "english",'text': ' Conquered', 'timestamp': (0.5, 1.2)},
{"language": "english", 'text': ' returned', 'timestamp': (1.2, 1.64)},
{"language": "english",'text': ' to', 'timestamp': (1.64, 1.84)},
{"language": "english",'text': ' its', 'timestamp': (1.84, 2.02)},
{"language": "english",'text': ' place', 'timestamp': (2.02, 2.28)},
{"language": "english",'text': ' amidst', 'timestamp': (2.28, 2.8)},
{"language": "english",'text': ' the', 'timestamp': (2.8, 2.98)},
{"language": "english",'text': ' tents.', 'timestamp': (2.98, 3.48)},
],
},
)

Copy link
Copy Markdown
Contributor

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Metric-Void for sharing the outputs and tests!

Could you add some of your tests to test_modeling_whisper.py so that we don't get this problem again? Thanks 👍

Comment thread src/transformers/pipelines/automatic_speech_recognition.py
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

@Metric-Void
Copy link
Copy Markdown
Author

I've added tests to test_pipelines_automatic_speech_recognition.py, since this feature depends on calling from the pipeline. That's also where the test originally was.

Also added comments to explain why two tokens.

@Metric-Void Metric-Void requested a review from ebezzam August 20, 2025 20:10
Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Metric-Void, thanks for the work! 🤗

Actually, adding such a parameter isn’t necessary since the decoder input ids can be retrieved from tokens['segments'][0][0]['result']['sequences']. I’m strongly against adding it, as a lot of effort and thorough testing already went into fixing the Whisper generation logic and ensuring a 1-to-1 correspondence with the OAI implementation.

As you noticed, language changes aren’t detected because only the first 30 seconds of the input are used for language detection. Would you mind reworking the logic to remove changes to generation_whisper.py and instead handle the decoder input IDs directly as mentioned above?

If you prefer, I can also quickly open a PR to supersede this one and add you as a co-author.

@eustlb
Copy link
Copy Markdown
Contributor

eustlb commented Oct 6, 2025

@Metric-Void any updates on this?

mavibirdesmi added a commit to mavibirdesmi/transformers that referenced this pull request Oct 28, 2025
mavibirdesmi added a commit to mavibirdesmi/transformers that referenced this pull request Oct 28, 2025
@FredHaa
Copy link
Copy Markdown
Contributor

FredHaa commented Nov 16, 2025

Since no progress has been made in this PR I have created a new one which fixes the issue without touching generation_whisper.py as requested by @eustlb

#42227

@Metric-Void
Copy link
Copy Markdown
Author

Thank you. I couldn't find a way to make the modification without changing the pipeline, risking compatibility with pipelines that don't have these two switches enabled.

@FredHaa FredHaa mentioned this pull request Jan 15, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper return_language with pipeline no longer working

4 participants