[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! by eustlb · Pull Request #36632 · huggingface/transformers

eustlb · 2025-03-10T18:49:21Z

What is happening?

After a careful review of the original OpenAI Whisper codebase:

1️⃣ Context
OpenAI’s implementation follows a slightly different approach:
• First, they compute text tokens.
• Then, they redo a forward pass to retrieve cross-attention weights (which is inefficient—hence our different approach).
• Their forward pass takes as input:
[SOT sequence + no_timestamps token + all text tokens + EOS token]
• A hook retrieves cross-attention weights, meaning each token gets its cross-attention values (shape: [num_heads, 1500]).
• After scaling operations and dynamic time warping, they compute alignments between each token and its corresponding audio sequence length index (a value between 0 and 1499).
• Since each token represents 0.02 sec of audio, it can be mapped to a timestamp.

2️⃣ The important part
• These timestamp values are used as end-of-word times when merging tokens into words.
• Each timestamp represents the timing for the end of a token.
• But wait—how do they determine both start and end times when boundaries require N+1 timestamps for N tokens? 🤔
• Simple: they retrieve timestamps for the N text tokens and use the no_timestamps token as a boundary marker (which always ends up as 0.0s).

What is incorrect in our implem?

Our pipeline incorrectly treated timestamp tokens as start times instead of end times.
Moreover, token_timestamps are not correctly aligned in the current implementation:

timestamps[batch_idx, 1:] = torch.tensor(jump_times)

makes that last jump_times (corresponding to token_timestamps) is associated with eos token, while by design the eos token cannot have a token timestamp (remember that we do not have access to it's cross attention weights). Instead, it is better to replicate to last predicted token timestamp for the eos token. And this is not equivalent to what is done in the current implem where we take token timestamps as start of time and unalign by one token timestamps and tokens since this set up makes that we loose the last token timestamp in the process (it is associated with eos token and then cut out when concatenating sequences).

Also, we take into account the decoder_input_ids when computing the DTW which can negatively impact it's precision while OAI implem doesn't

The fix this PR brings

This PR fixes that by:

Setting the first timestamp to 0.0s in the pipeline, similar to OpenAI’s implementation.
Correctly using timestamp tokens as end times instead of start times and correctly aligning them with tokens.
Skip decoder_input_ids when computing DTW and set them as 0.0s

Other changes

num_frames kwarg is broken (and was not documented anyway........)!
The kwarg return_token_timestamps for the processor, supposed to add num_frames as a kwarg for Whisper's generate is not useful! The attention_mask can be used to infer the number of frames, and IMO it is not a good practice to silently require return_token_timestamps=True for the processor (it is not mentioned in Whisper's generate doc) to after having precise token timestamps in generate (and this is even why our test was not correct 🫠). Instead, we want the user to pass the attention mask to use return_token_timestamp, and warn if he does not!

🚨 Changes for the user

This is kinda breaking to the extent that token timestamps are now aligned with tokens and represent the end time of the token, while before they were all shifted by one and represented start time of token.

snippet

from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline
from datasets import load_dataset

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
speech_samples = ds.sort("id").select(range(1))[:1]["audio"]
input_speech = [x["array"] for x in speech_samples]

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

inputs = processor(input_speech, return_tensors="pt", sampling_rate=16_000, return_attention_mask=True)

generate_outputs = model.generate(**inputs, return_token_timestamps=True)
print(f"Token timestamps: {generate_outputs['token_timestamps']}")

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
)

pipeline_result = pipe(input_speech[0], return_timestamps="word")
chunks = pipeline_result.get('chunks', [])

print("\nWord timestamps:")
for chunk in chunks:
    start = chunk['timestamp'][0]
    end = chunk['timestamp'][1] if chunk['timestamp'][1] is not None else "END"
    word = chunk['text'].strip()
    print(f"{word:10} {start:4.2f} → {end}")

current output with main

Token timestamps: [0.44, 0.82, 0.96, 1.12, 1.12, 1.22, 1.5, 1.72, 2.0, 2.34, 2.5, 2.66, 3.18, 3.58, 3.68, 3.8, 4.1, 4.32, 4.58, 4.94, 5.4]
Word timestamps:
Mr.        0.44 → 0.96
Quilter    0.96 → 1.22
is         1.22 → 1.5
the        1.50 → 1.72
apostle    1.72 → 2.0
of         2.00 → 2.34
the        2.34 → 2.5
middle     2.50 → 2.66
classes    2.66 → 3.18
and        3.18 → 3.58
we         3.58 → 3.68
are        3.68 → 3.8
glad       3.80 → 4.1
to         4.10 → 4.32
welcome    4.32 → 4.58
his        4.58 → 4.94
gospel.    4.94 → None (bug)

output with this PR

Token timestamps: [0.0, 0.96, 1.12, 1.12, 1.22, 1.5, 1.72, 1.98, 2.34, 2.5, 2.66, 3.2, 3.58, 3.68, 3.8, 4.1, 4.32, 4.58, 4.94, 5.42, 5.84]
Mr.        0.00 → 0.98
Quilter    0.98 → 1.22
is         1.22 → 1.5
the        1.50 → 1.72
apostle    1.72 → 1.98
of         1.98 → 2.34
the        2.34 → 2.5
middle     2.50 → 2.66
classes    2.66 → 3.2
and        3.20 → 3.56
we         3.56 → 3.68
are        3.68 → 3.8
glad       3.80 → 4.1
to         4.10 → 4.3
welcome    4.30 → 4.58
his        4.58 → 4.94
gospel.    4.94 → 5.84

github-actions · 2025-03-10T18:49:35Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

HuggingFaceDocBuilderDev · 2025-03-10T19:16:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

csetanmayjain · 2025-03-18T09:36:14Z

I applied the mentioned fixes in this PR on transformers==4.49.0, but the issue persists:

[torch.index_select(weights[:, :, i, :], dim=0, index=beam_indices[:, i])]
Interestingly, this problem only occurs with a few chunks, despite their properties (duration, format, context, etc.) being almost identical to other chunks that process successfully.

ArthurZucker

Would rather put this breaking change in the next release!
Otherwise would be nice to have a "visual" example of the fix (to just see string of timestemps) but LGTM otherwise

ArthurZucker · 2025-03-20T10:18:34Z

+            warnings.warn(
+                f"`return_token_timestamps` is deprecated for {self.__class__.__name__} and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it."
+            )


use logger.warning_once please

ArthurZucker · 2025-03-20T10:19:02Z

+                else:
+                    generation_config.num_frames = torch.tensor(generation_config.num_frames)
+
+                warnings.warn(


why use warning and logger hahha let'salso only warn once and warn as little as possible

Noah-Jaffe · 2025-04-10T01:18:11Z

please merge, i cant run whisper on any of my large files

Noah-Jaffe · 2025-04-10T14:31:54Z

tested on my machine, this does fix the issue

eustlb · 2025-04-11T11:17:28Z

Addressed the logging comments + added a visual example of the fix to the PR's comment @ArthurZucker 😊
Ready to merge!

eustlb · 2025-04-14T08:16:37Z

cc @Cyrilvallez, I've addressed Athur's comments, can you approve the PR please?

Cyrilvallez

LGTM on principle, but looks like whisper tokenization tests are not happy. You probably need to fix the tests there to reflect the new behavior! 🤗

Cyrilvallez

Yes, LGTM! Time to ship it! 🤗🚀

…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values

eustlb and others added 2 commits March 10, 2025 19:21

timestamp token is end of token time !!!

ee99668

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

a12d77e

github-actions Bot marked this pull request as draft March 10, 2025 18:49

eustlb marked this pull request as ready for review March 10, 2025 18:49

github-actions Bot requested a review from ArthurZucker March 10, 2025 18:50

eustlb changed the title ~~[Whisper ]timestamp token is end of token time !!!~~ [Whisper] Fix pipeline word timestamp: timestamp token is end of token time !!! Mar 10, 2025

This was referenced Mar 10, 2025

#33512 handle last element out of range error #33625

Closed

[Whisper] TypeError: '<=' not supported between instances of 'NoneType' and 'float' #33552

Closed

ensure correct alignment between tokens and timestamp tokens

0069742

eustlb changed the title ~~[Whisper] Fix pipeline word timestamp: timestamp token is end of token time !!!~~ [Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! Mar 11, 2025

eustlb added 3 commits March 11, 2025 15:57

ignore input tokens for DTW computation

5e41560

use num_frames to avoid token timestamp hallucinations

875fa0e

token timestamps test updates !

5498462

Noah-Jaffe approved these changes Mar 11, 2025

View reviewed changes

eustlb added 3 commits March 12, 2025 18:49

num_frames: deprecate and use attention_mask instead

cee24ef

avoid breaking change

e563f4a

fix the pipeline usage for chunk approach

b4a4e87

eustlb mentioned this pull request Mar 13, 2025

A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

Closed

make style

7b4e919

ArthurZucker reviewed Mar 20, 2025

View reviewed changes

ibevers mentioned this pull request Apr 8, 2025

Failed wav files in feature extraction sensein/b2aiprep#145

Closed

eustlb and others added 4 commits April 10, 2025 18:31

better logging

2b5ddf6

better logging

fd2038c

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

a4a0a4e

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

99f30bf

eustlb and others added 3 commits April 11, 2025 11:15

make style

0554896

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

7618dde

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

5042b27

ibevers mentioned this pull request Apr 11, 2025

Make sure one person can run feature extraction on preemptable Engaging partition sensein/b2aiprep#139

Closed

Cyrilvallez reviewed Apr 16, 2025

View reviewed changes

This was referenced May 21, 2025

[Whisper + beam search] fix usage of beam_indices #38259

Merged

Whisper word-level timestamp extraction fails with beam search #36093

Closed

eustlb mentioned this pull request Jun 26, 2025

Significant WER Increase with Whisper Chunking Compared to Long-Form Transcription #38347

Closed

4 tasks

eustlb and others added 5 commits June 26, 2025 16:26

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

403a367

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

7089c26

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

f0086f1

update tests with correct values

f3aa38b

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

046eed9

eustlb enabled auto-merge (squash) June 27, 2025 12:15

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

5f071e9

Cyrilvallez approved these changes Jun 27, 2025

View reviewed changes

Merge branch 'main' into whisper-pipeline-fix-word-timestamps

b64b0c5

eustlb merged commit 2b85b6c into huggingface:main Jun 27, 2025
20 checks passed

eustlb mentioned this pull request Jul 7, 2025

Incorrect word timestamps and word repetitions with Whisper-Large-v3-turbo model #37248

Closed

4 tasks

Jemoka mentioned this pull request Jul 11, 2025

Whisper Error TalkBank/batchalign2#37

Closed

xenova mentioned this pull request Jul 17, 2025

whisper-base_timestamped broken with chunk_length_s=30 huggingface/transformers.js#1358

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!!#36632

[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!!#36632
eustlb merged 24 commits intohuggingface:mainfrom
eustlb:whisper-pipeline-fix-word-timestamps

eustlb commented Mar 10, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2025

Uh oh!

csetanmayjain commented Mar 18, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Mar 20, 2025

Uh oh!

ArthurZucker Mar 20, 2025

Uh oh!

Noah-Jaffe commented Apr 10, 2025

Uh oh!

Noah-Jaffe commented Apr 10, 2025

Uh oh!

eustlb commented Apr 11, 2025

Uh oh!

eustlb commented Apr 14, 2025

Uh oh!

Cyrilvallez left a comment

Uh oh!

Cyrilvallez left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

eustlb commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The fix this PR brings

Other changes

🚨 Changes for the user

Uh oh!

github-actions Bot commented Mar 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2025

Uh oh!

csetanmayjain commented Mar 18, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Noah-Jaffe commented Apr 10, 2025

Uh oh!

Noah-Jaffe commented Apr 10, 2025

Uh oh!

eustlb commented Apr 11, 2025

Uh oh!

eustlb commented Apr 14, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eustlb commented Mar 10, 2025 •

edited

Loading