[Whisper + beam search] fix usage of `beam_indices` by gante · Pull Request #38259 · huggingface/transformers

gante · 2025-05-21T14:32:10Z

What does this PR do?

Fixes the shape issues reported in #36093, which have been around since the code was added 👀 . It doesn't fix the quality of word timestamp outputs (see e.g. #36632), but rather how we gather the cross attentions from the right beams with beam search, which was broken.

test_tiny_token_timestamp_batch_generation is a test that has the same pattern, beam search + timestamps, and is failing on main with the same exception as reported in #36093. This PR does NOT fix that test, but allows the test to move past the shape exception until the output quality checks, which are broken 🙃

gante · 2025-05-21T14:33:25Z

            tensor containing the timestamps in seconds for each predicted token
        """
        # Create a list with `decoder_layers` elements, each a tensor of shape
-        # (batch size, attention_heads, output length, input length).


shape comments were incorrect for the case w/beam search

gante · 2025-05-21T14:34:10Z

        weight_length = None

        if "beam_indices" in generate_outputs:
-            # If beam search has been used, the output sequences may have been generated for more timesteps than their sequence_lengths


In this if block, I've rewritten comments to better explain what's happening

gante · 2025-05-21T14:38:38Z

-            # beam search takes `decoder_input_ids` into account in the `beam_indices` length
-            # but forgot to shift the beam_indices by the number of `decoder_input_ids`
-            beam_indices = torch.zeros_like(generate_outputs.beam_indices[:, :weight_length])
-            # we actually shift the beam indices here
-            beam_indices[:, num_input_ids:] = generate_outputs.beam_indices[:, : weight_length - num_input_ids]


This is not correct with the beam search refactor (#35802): beam_indices was corrected to have the same output length as the other optional outputs (= length of generated tokens)

gante · 2025-05-21T14:39:32Z

+            # In that case, the `cross_attentions` weights are too long and we have to make sure that they have
+            # the right `output_length`

-            weights = weights[:, :, :weight_length]


redundant: we rebuild weights below with sequence length = range(unrolled_beam_indices.shape[1]) (= weight_length)

gante · 2025-05-21T14:41:31Z

-            # since the beam search strategy chooses the most probable sequences at the end of the search.
-            # In that case, the cross_attentions weights are too long and we have to make sure that they have the right output_length
-            weight_length = (generate_outputs.beam_indices != -1).sum(-1).max()
-            weight_length = weight_length if num_input_ids is None else weight_length + num_input_ids


root issue of #36093: weight_length is off by 1. The comments in the new version explain why :)

HuggingFaceDocBuilderDev · 2025-05-21T14:45:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

Just some nits on the shape comments. Important step to have something to "work" again even if it's not producing the correct output quality-wise at first :)

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

gante added 5 commits May 21, 2025 13:00

tmp

3426db5

fix test_tiny_token_timestamp_batch_generation

fcd89d4

better comments

023ad36

test

4bfc7fd

comments

a4912b7

gante commented May 21, 2025

View reviewed changes

gante requested review from eustlb and vasqu May 21, 2025 14:45

gante mentioned this pull request May 21, 2025

Whisper word-level timestamp extraction fails with beam search #36093

Closed

4 tasks

vasqu approved these changes May 23, 2025

View reviewed changes

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

gante commented May 23, 2025

View reviewed changes

Comment thread src/transformers/models/whisper/generation_whisper.py Outdated

Apply suggestions from code review

cc32d45

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

gante enabled auto-merge (squash) May 23, 2025 09:53

gante merged commit a6b51e7 into huggingface:main May 23, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper + beam search] fix usage of `beam_indices`#38259

[Whisper + beam search] fix usage of `beam_indices`#38259
gante merged 6 commits intohuggingface:mainfrom
gante:fix_36093

gante commented May 21, 2025 •

edited

Loading

Uh oh!

gante May 21, 2025

Uh oh!

gante May 21, 2025

Uh oh!

gante May 21, 2025

Uh oh!

gante May 21, 2025

Uh oh!

gante May 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 21, 2025

Uh oh!

vasqu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gante commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gante May 21, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 21, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 21, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 21, 2025

Choose a reason for hiding this comment

Uh oh!

gante May 21, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 21, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gante commented May 21, 2025 •

edited

Loading