Support generating with fallback for short form audio in Whisper by kamilakesbi · Pull Request #30984 · huggingface/transformers

kamilakesbi · 2024-05-23T10:30:38Z

What does this PR do?

The aim of this PR is to refacto the Whisper generate method to handle both short form and long form audio generation similarly. It will support short form audio generation with fallback (as requested in #29508).

Here's what I've done:

Removed previous short-form scripts:

I've removed the part of the code used for short form generation. This involve lines 562 to 603 and lines 498 to 505 in main. Now when a short form audio (or a batched short form of audio) is passed to generate, it's processed by the part of the code previously used for long form generation.

Use is_shortform to still distinguish between short form and long form in some cases:

In the _postprocess_outputs method we only return past_key_values if the audios are short form. For long form audios it is too expensive. (cf. this line).
In _retrieve_max_frames_and_seek : For long form audios, we necessarily need to pass an attention mask but not for short form audios. We can thus compute max_frames and seek without relying on the attention mask for short form audios.
I've also updated the split_by_batch_index method: the previous method was broken when return_dict_in_generate was set to True for different short form audio cases. Now it handles both short form and long form audios.
I've removed the is_shortform parameter from the inputs to the _retrieve_logit_processors method to allow the use of generation_config.no_speech_threshold for short form audios.
I've removed is_shortfrom parameter from the inputs to the _set_return_outputs method to allow the use of logprob_threshold for short form audios.

Make num_return_sequences>1 compatible with generate_with_fallback:

This is a bit tricky because generate_with_fallback can't handle num_return_sequences>1 by design. I've added a new method, called _expand_variables_for_generation , which expands the different variables before passing into generate_with_fallback when generation_config.num_return_sequences>1. After expansion it will set generation_config.num_return_sequences to 1 for compatibility with generate_with_fallback.

Ensure that the output format for short form audio is compatible with the output format in main:

The output format for long-form audio is different from that for short-form audio. In order to ensure that the output is similar to that obtained in main when processing short form audio, we need to add a few post-processing steps: This is what is done in lines 721 to 765. In particular here:

We add an EOS token to the output sequence as it was removed during generation with fallback.
We return the token timestamps if return_token_timestamps is True in the correct format (see here).
If return_dict_in_generate is True, we use the new method _stack_split_outputs to get the output dict (containing all attributes (scores, encoder_attentions, etc.)) in the right format. _stack_split_outputs basically performs the opposite operations to split_by_batch_index .

Make failing slow tests to pass:

I've updated some failing slow tests and made them pass (see here).

Add new tests to make sure generation with fallback works for short form audios:

I've added two tests: test_whisper_shortform_single_batch_prev_cond and test_whisper_shortform_multi_batch_hard_prev_cond.

Who can review:

@sanchit-gandhi

HuggingFaceDocBuilderDev · 2024-05-23T10:53:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sanchit-gandhi

Looks like a good start @kamilakesbi. Two biggest suggestions are related to the designs of i) assisted generation, and ii) num return sequences. Think both can be simplified and assisted generation made more rigorous.

Two further design questions:

Should we return the original decoder_input_ids and EOS tokens in the sequences for long-form generation as well? IMO this is an inconsistency that we return them for short-form, but not long-form, and I would be in-favour of unifying the two in this PR
Is it correct to de-activate beam search when temperature>0? We currently don't do this for long-form generation, but given the original Whisper repo does, it would be good to determine whether this is a 'bug' or an intended design decision

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

kamilakesbi · 2024-07-17T09:38:30Z

@ArthurZucker thanks for your review! I took your remarks into account :)

Failing tests are unrelated to this PR. If this is ok for you we can perhaps merge or wait for the CI to be green...

ArthurZucker · 2024-07-18T12:16:58Z

Let's wait for the full CI seems alright now!

ArthurZucker · 2024-07-18T12:17:30Z

Also a question ont answered!

kamilakesbi · 2024-07-18T12:50:17Z

The CI is green yes :) if it's ok for you I can merge!

ArthurZucker

Thanks! Last to nits and you can merge!

amyeroberts added the Audio label May 23, 2024

kamilakesbi added 8 commits May 24, 2024 16:02

remove is_shortform

f752535

adapt _retrieve_max_frames_and_seek for short_form

d58d3fa

return bos token in short and long form

9eda3eb

add decoder_input_ids to short form audios

7a97483

add eos token for short form

ba36c8e

handle short form token_timestamps

d347633

no need to return scores

7eddab3

add is_shortform conditions

07e7db3

kamilakesbi force-pushed the fallback_short_form branch from 956cfb4 to 07e7db3 Compare May 24, 2024 14:02

kamilakesbi added 10 commits May 24, 2024 17:25

handle when max_new_tokens is None - short form

fe1da29

handle assistant decoding

24769d7

fix

54ad952

handle return_dict_in_generate

1534c8e

handle split_by_batch for encoder_attentions attribute

b37372e

handle num_beams>1

f58af54

handle num_return_sequences>1 in generate_with_fallback

9125277

handle num_return_sequences>1 with return_dict_in_generate=True

78252b1

raise error if max_new_tokens + decoder_inputs_ids > max_target_pos

0d0c720

fix

a47e9ba

sanchit-gandhi reviewed May 29, 2024

View reviewed changes

kamilakesbi and others added 8 commits May 29, 2024 19:02

apply review suggestions

78c3842

fix

779b741

Update src/transformers/models/whisper/generation_whisper.py

2cc9813

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Update src/transformers/models/whisper/generation_whisper.py

51abdcb

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Update src/transformers/models/whisper/generation_whisper.py

accf48c

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

fix

49faeb9

logits for both short form and long form

6c2d76d

handle if logits_processor is None

e2cd613

update test

73a9805

kamilakesbi requested a review from ArthurZucker July 17, 2024 09:38

kamilakesbi and others added 6 commits July 18, 2024 10:07

Merge branch 'main' into fallback_short_form

2e36400

make style

a96d600

small fix

299de5d

fix

17b9d47

fix test_new_cache_format

3ef601e

fix past_key_values

d1cc8d3

ArthurZucker reviewed Jul 18, 2024

View reviewed changes

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

kamilakesbi added 2 commits July 18, 2024 15:49

fix

ed8cc34

make style

6b7b3d6

kamilakesbi force-pushed the fallback_short_form branch from a00d2e8 to 6b7b3d6 Compare July 18, 2024 14:45

fix slow tests

b503c14

ArthurZucker approved these changes Jul 19, 2024

View reviewed changes

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

fix

cb4201e

kamilakesbi merged commit 89575b5 into huggingface:main Jul 19, 2024

kamilakesbi mentioned this pull request Jul 27, 2024

Finish short form / long from generation integration in Whisper #32263

Closed

drewhouston mentioned this pull request Aug 1, 2024

Some Whisper beam search output (sequences_scores, etc.) is lost in _stack_split_outputs #32373

Closed

4 tasks

ylacombe mentioned this pull request Sep 12, 2024

Fix missing sequences_scores in the Whisper beam search output #32970

Merged

5 tasks

Nik-Kras mentioned this pull request Sep 12, 2024

Whisper Beam Search doesn't work #33445

Closed

4 tasks

eustlb mentioned this pull request Oct 13, 2024

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

Merged

Conversation

kamilakesbi commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Removed previous short-form scripts:

Use is_shortform to still distinguish between short form and long form in some cases:

Make num_return_sequences>1 compatible with generate_with_fallback:

Ensure that the output format for short form audio is compatible with the output format in main:

Make failing slow tests to pass:

Add new tests to make sure generation with fallback works for short form audios:

Who can review:

Uh oh!

HuggingFaceDocBuilderDev commented May 23, 2024

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kamilakesbi commented Jul 17, 2024

Uh oh!

ArthurZucker commented Jul 18, 2024

Uh oh!

ArthurZucker commented Jul 18, 2024

Uh oh!

kamilakesbi commented Jul 18, 2024

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kamilakesbi commented May 23, 2024 •

edited

Loading