[Whisper] Pipeline: handle long form generation#35750
[Whisper] Pipeline: handle long form generation#35750eustlb merged 15 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
It's missing a test IMO! 🤗
| elif self.type == "seq2seq_whisper" and not ignore_warning: | ||
| logger.warning( | ||
| "Using `chunk_length_s` with Whisper models is not recommended and will result in unreliable results, as it uses it's own chunking mechanism " | ||
| "(cf. Whisper original paper, section 3.8. Long-form Transcription)." |
There was a problem hiding this comment.
As I mentioned offline would be a pity to not use that batch algo in some cases! But up to debate!
There was a problem hiding this comment.
True! I just want to make sure:
- the user knows that for seq2seq models the pipeline's chunking mechanism is unreliable. A warning already exists for that, it is just not taking whisper into account...
- ensure the user is not using the pipeline to do long-form transcription (or at least he knows he could use something more reliable when it comes to whisper) !!
I've updated the warning accordingly
There was a problem hiding this comment.
I'm a bit confused. Can we rely on the pipeline to produce accurate word-level timestamps for long-form transcription with Whisper models? If not, what are the potential pitfalls or failure modes? The merge request “Token-level timestamps for long-form generation in Whisper”
doesn’t seem to discuss this issue in much detail.
|
I tried the code in this PR on a sample audio. The chunk timestamps go out of sync with the audio, and it gets worse the longer the input audio is. |
|
Hi @eustlb, could you please tell what is the status of this PR? |
|
Hey @as-suvorov, it's waiting for a core maintainer's approval to merge. @ArthurZucker I've addressed your comment, ready to merge 🤗 |
| cut_off_length=None, | ||
| return_token_timestamps=False, | ||
| force_unique_generate_call=False, | ||
| skip_ending_double_timestamps=False, |
There was a problem hiding this comment.
we are missing documentation on this one no?
There was a problem hiding this comment.
added a comment explaining this hidden parameters and links to related PRs to understand why we need it
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
* handle long form generation * add warning * correct incorrect in place token change * update test to catch edge case * make style * update warning * add doc
What does this PR do?
Fixes #34210 #31942 #36602
In the tokenizer decoding logic for the pipeline, timestamp offsetting when the call to Whisper's generate have seeking (meaning generating for a new segment).
EDIT
This also fixes another issue, indirectly spotted in #36612: when
condition_on_prev_tokens=True, we need to use last generated tokens asdecoder_input_ids. Nevertheless, this requires skipping one of the double ending tokens (cf #34537) to match OAI implementation, done via in place modificationtokens=tokens[:-1]. But we actually need this token to be kept for further decoding (also cf #34537) !!TODO
chunk_length_s=60e.g. ? → actually Whisper just should not be used withchunk_length_sset! Added a warningtest_large_timestamp_generationwithcondition_on_prev_tokens=True