[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!!#36632
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I applied the mentioned fixes in this PR on transformers==4.49.0, but the issue persists:
|
ArthurZucker
left a comment
There was a problem hiding this comment.
Would rather put this breaking change in the next release!
Otherwise would be nice to have a "visual" example of the fix (to just see string of timestemps) but LGTM otherwise
| warnings.warn( | ||
| f"`return_token_timestamps` is deprecated for {self.__class__.__name__} and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it." | ||
| ) |
There was a problem hiding this comment.
use logger.warning_once please
| else: | ||
| generation_config.num_frames = torch.tensor(generation_config.num_frames) | ||
|
|
||
| warnings.warn( |
There was a problem hiding this comment.
why use warning and logger hahha let'salso only warn once and warn as little as possible
|
please merge, i cant run whisper on any of my large files |
|
tested on my machine, this does fix the issue |
|
Addressed the logging comments + added a visual example of the fix to the PR's comment @ArthurZucker 😊 |
|
cc @Cyrilvallez, I've addressed Athur's comments, can you approve the PR please? |
Cyrilvallez
left a comment
There was a problem hiding this comment.
LGTM on principle, but looks like whisper tokenization tests are not happy. You probably need to fix the tests there to reflect the new behavior! 🤗
Cyrilvallez
left a comment
There was a problem hiding this comment.
Yes, LGTM! Time to ship it! 🤗🚀
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
…ken time !!! (huggingface#36632) * timestamp token is end of token time !!! * ensure correct alignment between tokens and timestamp tokens * ignore input tokens for DTW computation * use num_frames to avoid token timestamp hallucinations * token timestamps test updates ! * num_frames: deprecate and use attention_mask instead * avoid breaking change * fix the pipeline usage for chunk approach * make style * better logging * better logging * make style * update tests with correct values
Fixes #33552 #36228
What is happening?
After a careful review of the original OpenAI Whisper codebase:1️⃣ Context
OpenAI’s implementation follows a slightly different approach:
• First, they compute text tokens.
• Then, they redo a forward pass to retrieve cross-attention weights (which is inefficient—hence our different approach).
• Their forward pass takes as input:
[SOT sequence + no_timestamps token + all text tokens + EOS token]• A hook retrieves cross-attention weights, meaning each token gets its cross-attention values (shape:
[num_heads, 1500]).• After scaling operations and dynamic time warping, they compute alignments between each token and its corresponding audio sequence length index (a value between 0 and 1499).
• Since each token represents 0.02 sec of audio, it can be mapped to a timestamp.
2️⃣ The important part
• These timestamp values are used as end-of-word times when merging tokens into words.
• Each timestamp represents the timing for the end of a token.
• But wait—how do they determine both start and end times when boundaries require N+1 timestamps for N tokens? 🤔
• Simple: they retrieve timestamps for the N text tokens and use the no_timestamps token as a boundary marker (which always ends up as 0.0s).
What is incorrect in our implem?
Our pipeline incorrectly treated timestamp tokens as start times instead of end times.
Moreover, token_timestamps are not correctly aligned in the current implementation:
makes that last jump_times (corresponding to token_timestamps) is associated with
eos token, while by design the eos token cannot have a token timestamp (remember that we do not have access to it's cross attention weights). Instead, it is better to replicate to last predicted token timestamp for the eos token. And this is not equivalent to what is done in the current implem where we take token timestamps as start of time and unalign by one token timestamps and tokens since this set up makes that we loose the last token timestamp in the process (it is associated with eos token and then cut out when concatenating sequences).Also, we take into account the decoder_input_ids when computing the DTW which can negatively impact it's precision while OAI implem doesn't
The fix this PR brings
This PR fixes that by:
decoder_input_idswhen computing DTW and set them as 0.0sOther changes
num_frameskwarg is broken (and was not documented anyway........)!The kwarg
return_token_timestampsfor the processor, supposed to addnum_framesas a kwarg for Whisper's generate is not useful! Theattention_maskcan be used to infer the number of frames, and IMO it is not a good practice to silently requirereturn_token_timestamps=Truefor the processor (it is not mentioned in Whisper's generate doc) to after having precise token timestamps ingenerate(and this is even why our test was not correct 🫠). Instead, we want the user to pass the attention mask to usereturn_token_timestamp, and warn if he does not!🚨 Changes for the user
This is kinda breaking to the extent that token timestamps are now aligned with tokens and represent the end time of the token, while before they were all shifted by one and represented start time of token.
snippet
current output with main
output with this PR