Skip to content

Finish short form / long from generation integration in Whisper  #32263

@kamilakesbi

Description

@kamilakesbi

Feature request

Unify the output format to Whisper Generate method for short form/long form generation.

Motivation

In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.

However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.

  • For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of ModelOutput with additional information (attention masks, hidden states, ...) if return_dict_in_generate is set to True (we can now also use return_segments with short form generation).

  • For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the sequences of token ids and a list of all segments if return_segments is set to True. Note that if both return_dict_in_generate and return_segments are set to true, the additional information (attention masks, hidden states) will be contained in segments. However, at the moment we can't get an instance of ModelOutput as output with long form generation.

Should we work on this ?

Ideally, we should also unify the output format for the Whisper generate method so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.

The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.

cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions