Fix whisper `return_language` with `return_timestamp=word` by Metric-Void · Pull Request #39938 · huggingface/transformers

Metric-Void · 2025-08-05T22:42:52Z

What does this PR do?

Add a switch to Whisper.generate() that allows preserving some special tokens, then stripped in retrieve_segments to ensure timestamp alignment.

Tested on short and long audios. Tested on English, French, and Cantonese. Prediction and timestamp results align, and language is detected correctly.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@eustlb @ebezzam

Local failed tests (WSL2, RUN_SLOW)

$ pytest tests/models/whisper
================================================================================================================= short test summary info ==================================================================================================================
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_flex_attention_with_grads - torch._inductor.exc.InductorError: LoweringException: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpw7mv95z8/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpw7mv95z8/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', ...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp2jbthzzq/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmp2jbthzzq/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperEncoderModelTest::test_flex_attention_with_grads - torch._inductor.exc.InductorError: LoweringException: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpbszajy61/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpbszajy61/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', ...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperEncoderModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpevr_eml0/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpevr_eml0/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_compilation_all_outputs - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpghb4htrw/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpghb4htrw/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_compile_model_forward - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpb3fj6t8c/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpb3fj6t8c/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_from_inputs_embeds_with_static_cache - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp122w6v5o/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmp122w6v5o/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_generate_with_static_cache - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpee6hyznt/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpee6hyznt/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_sdpa_can_compile_dynamic - torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpbz2lnr80/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpbz2lnr80/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/metricvoid...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_padding_side_in_kwargs - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_tokenizer_initialization_with_conflicting_key - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_tokenizer_mismatch_warning - ImportError: 
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_truncation_side_in_kwargs - ImportError: 
=========================================================================================== 13 failed, 445 passed, 295 skipped, 36 warnings in 166.72s (0:02:46) ===========================================================================================

I don't think any of these failures are related to this PR.

ebezzam · 2025-08-14T11:23:06Z

@Metric-Void thanks for the PR!

Does the same example in #39404 (below), now return the expected timestamps and language? Could you share the output?

import torch
from transformers import pipeline
from transformers.configuration_utils import PretrainedConfig

pipeline = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-tiny",
    torch_dtype=torch.float16,
    config=PretrainedConfig(
      attn_implementation="flash_attention_2"
    )
)
result = pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", return_language=True, return_timestamps='word')

result["chunks"]

Regarding errors, the ones you are getting are related to missing libraries. When I run the tests, I get the following:

# pytest tests/models/whisper
==== short test summary info ====
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_multi_gpu_data_parallel_forward - TypeError: EncoderDecoderCache.__init__() missing 1 required positional argument: 'cross_attention_cache'
==== 1 failed, 467 passed, 285 skipped, 37 warnings in 418.89s (0:06:58) ========

It is consistent before and after your changes so you haven't introduced any failing tests 👍

I would still wait for @eustlb's input on how to adjust Whisper's generate code.

Metric-Void · 2025-08-14T18:03:31Z

@ebezzam Yes, here's the output. #39404 was mine, so it only makes sense if it fixes that.

[{'text': ' I', 'timestamp': (0.0, 1.36), 'language': 'english'},
 {'text': ' have', 'timestamp': (1.36, 1.68), 'language': 'english'},
 {'text': ' a', 'timestamp': (1.68, 1.94), 'language': 'english'},
 {'text': ' dream,', 'timestamp': (1.94, 3.76), 'language': 'english'},
 {'text': ' but', 'timestamp': (3.76, 3.94), 'language': 'english'},
 {'text': ' one', 'timestamp': (3.94, 4.18), 'language': 'english'},
 {'text': ' day,', 'timestamp': (4.18, 6.16), 'language': 'english'},
 {'text': ' this', 'timestamp': (6.16, 6.58), 'language': 'english'},
 {'text': ' nation', 'timestamp': (6.58, 7.2), 'language': 'english'},
 {'text': ' will', 'timestamp': (7.2, 7.82), 'language': 'english'},
 {'text': ' rise', 'timestamp': (7.82, 8.3), 'language': 'english'},
 {'text': ' up,', 'timestamp': (8.3, 10.18), 'language': 'english'},
 {'text': ' live', 'timestamp': (10.18, 10.56), 'language': 'english'},
 {'text': ' out', 'timestamp': (10.56, 10.98), 'language': 'english'},
 {'text': ' the', 'timestamp': (10.98, 11.02), 'language': 'english'},
 {'text': ' true', 'timestamp': (11.02, 11.3), 'language': 'english'},
 {'text': ' meaning', 'timestamp': (11.3, 11.6), 'language': 'english'},
 {'text': ' of', 'timestamp': (11.6, 11.84), 'language': 'english'},
 {'text': ' its', 'timestamp': (11.84, 12.08), 'language': 'english'},
 {'text': ' dream.', 'timestamp': (12.54, 12.98), 'language': 'english'}]

More tests in https://gist.github.com/Metric-Void/79f7fcecc432d0e648af0fd896b5016a. Though it seemed like whisper (at least the tiny model) does not predict additional language tokens when the language changes.

For the long canterville.ogg, I diff'd the outputs before and after the fix. The only change is the addition of language tags.

I'm not sure if I should add tests to test this use case. There was such a test but was removed afterwards.

transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.py

Lines 369 to 421 in b31d595

    
           def test_return_timestamps_and_language_in_preprocess(self): 
        
               pipe = pipeline( 
        
                   task="automatic-speech-recognition", 
        
                   model="openai/whisper-tiny", 
        
                   chunk_length_s=8, 
        
                   stride_length_s=1, 
        
                   return_language=True, 
        
               ) 
        
               data = load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True, trust_remote_code=True) 
        
               sample = next(iter(data)) 
        
               res = pipe(sample["audio"]["array"]) 
        
               self.assertEqual( 
        
                   res, 
        
                   { 
        
                       "text": " Conquered returned to its place amidst the tents.", 
        
                       "chunks": [{"language": "english", "text": " Conquered returned to its place amidst the tents."}], 
        
                   }, 
        
               ) 
        
               res = pipe(sample["audio"]["array"], return_timestamps=True) 
        
               self.assertEqual( 
        
                   res, 
        
                   { 
        
                       "text": " Conquered returned to its place amidst the tents.", 
        
                       "chunks": [ 
        
                           { 
        
                               "timestamp": (0.0, 3.36), 
        
                               "language": "english", 
        
                               "text": " Conquered returned to its place amidst the tents.", 
        
                           } 
        
                       ], 
        
                   }, 
        
               ) 
        
               res = pipe(sample["audio"]["array"], return_timestamps="word") 
        
               # fmt: off 
        
               self.assertEqual( 
        
                   res, 
        
                   { 
        
                       'text': ' Conquered returned to its place amidst the tents.', 
        
                       'chunks': [ 
        
                           {"language": "english",'text': ' Conquered', 'timestamp': (0.5, 1.2)}, 
        
                           {"language": "english", 'text': ' returned', 'timestamp': (1.2, 1.64)}, 
        
                           {"language": "english",'text': ' to', 'timestamp': (1.64, 1.84)}, 
        
                           {"language": "english",'text': ' its', 'timestamp': (1.84, 2.02)}, 
        
                           {"language": "english",'text': ' place', 'timestamp': (2.02, 2.28)}, 
        
                           {"language": "english",'text': ' amidst', 'timestamp': (2.28, 2.8)}, 
        
                           {"language": "english",'text': ' the', 'timestamp': (2.8, 2.98)}, 
        
                           {"language": "english",'text': ' tents.', 'timestamp': (2.98, 3.48)}, 
        
                       ], 
        
                   }, 
        
               )

ebezzam

thanks @Metric-Void for sharing the outputs and tests!

Could you add some of your tests to test_modeling_whisper.py so that we don't get this problem again? Thanks 👍

github-actions · 2025-08-20T20:04:10Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

Metric-Void · 2025-08-20T20:10:48Z

I've added tests to test_pipelines_automatic_speech_recognition.py, since this feature depends on calling from the pipeline. That's also where the test originally was.

Also added comments to explain why two tokens.

eustlb

Hey @Metric-Void, thanks for the work! 🤗

Actually, adding such a parameter isn’t necessary since the decoder input ids can be retrieved from tokens['segments'][0][0]['result']['sequences']. I’m strongly against adding it, as a lot of effort and thorough testing already went into fixing the Whisper generation logic and ensuring a 1-to-1 correspondence with the OAI implementation.

As you noticed, language changes aren’t detected because only the first 30 seconds of the input are used for language detection. Would you mind reworking the logic to remove changes to generation_whisper.py and instead handle the decoder input IDs directly as mentioned above?

If you prefer, I can also quickly open a PR to supersede this one and add you as a co-author.

eustlb · 2025-10-06T17:08:10Z

@Metric-Void any updates on this?

FredHaa · 2025-11-16T13:11:53Z

Since no progress has been made in this PR I have created a new one which fixes the issue without touching generation_whisper.py as requested by @eustlb

#42227

Metric-Void · 2025-11-16T19:06:04Z

Thank you. I couldn't find a way to make the modification without changing the pipeline, risking compatibility with pipelines that don't have these two switches enabled.

Metric-Void and others added 3 commits August 5, 2025 17:47

Whisper token language fix

7db5705

Style; Avoid negative timestamps with incorrect subtraction.

a07d091

Merge branch 'main' into whisper-langtoken-fix

fcb099a

ebezzam added Audio Whisper labels Aug 14, 2025

ebezzam requested a review from eustlb August 14, 2025 11:23

Metric-Void mentioned this pull request Aug 16, 2025

Whisper return_language with pipeline no longer working #39404

Closed

4 tasks

ebezzam requested changes Aug 20, 2025

View reviewed changes

Comment thread src/transformers/pipelines/automatic_speech_recognition.py

Metric-Void added 2 commits August 20, 2025 16:02

Fix timestamp alignment in some cases.

eef6478

Comments and test case

be5356d

Metric-Void requested a review from ebezzam August 20, 2025 20:10

eustlb reviewed Sep 18, 2025

View reviewed changes

mavibirdesmi added a commit to mavibirdesmi/transformers that referenced this pull request Oct 28, 2025

fix: add huggingface#39938

9988fb5

mavibirdesmi added a commit to mavibirdesmi/transformers that referenced this pull request Oct 28, 2025

fix: add huggingface#39938

5ef4152

Metric-Void closed this Nov 16, 2025

FredHaa mentioned this pull request Jan 15, 2026

Fix whisper return language #42227

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix whisper `return_language` with `return_timestamp=word`#39938

Fix whisper `return_language` with `return_timestamp=word`#39938
Metric-Void wants to merge 5 commits intohuggingface:mainfrom
Metric-Void:whisper-langtoken-fix

Metric-Void commented Aug 5, 2025 •

edited

Loading

Uh oh!

ebezzam commented Aug 14, 2025

Uh oh!

Metric-Void commented Aug 14, 2025 •

edited

Loading

Uh oh!

ebezzam left a comment

Uh oh!

Uh oh!

github-actions Bot commented Aug 20, 2025

Uh oh!

Metric-Void commented Aug 20, 2025

Uh oh!

eustlb left a comment

Uh oh!

eustlb commented Oct 6, 2025

Uh oh!

FredHaa commented Nov 16, 2025

Uh oh!

Metric-Void commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Metric-Void commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Local failed tests (WSL2, RUN_SLOW)

Uh oh!

ebezzam commented Aug 14, 2025

Uh oh!

Metric-Void commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Aug 20, 2025

Uh oh!

Metric-Void commented Aug 20, 2025

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb commented Oct 6, 2025

Uh oh!

FredHaa commented Nov 16, 2025

Uh oh!

Metric-Void commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Metric-Void commented Aug 5, 2025 •

edited

Loading

Metric-Void commented Aug 14, 2025 •

edited

Loading