Finish short form / long from generation integration in Whisper 

### Feature request

Unify the output format to Whisper Generate method for short form/long form generation. 

### Motivation

In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback. 

However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this [line](https://github.com/kamilakesbi/transformers/blob/cb4201e8ed5b087933b6ee48f464d26b7b73f5a7/src/transformers/models/whisper/generation_whisper.py#L722). 

- For **short form generation** the output format can be either a torch tensor containing the sequence of token ids or an instance of `ModelOutput` with additional information (attention masks, hidden states, ...) if `return_dict_in_generate` is set to True (we can now also use `return_segments` with short form generation). 

- For **long form generation** the output is either a torch tensor with the sequence of token ids, or a dict containing the `sequences` of token ids and a list of all `segments` if `return_segments` is set to True. Note that if both `return_dict_in_generate` and `return_segments` are set to true, the additional information (attention masks, hidden states) will be contained in `segments`. However, at the moment we can't get an instance of `ModelOutput` as output with long form generation. 

### Should we work on this ? 

Ideally, we should also unify the output format for the Whisper `generate` method so that users don't have to distinguish between short and long form audio.  They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline. 

The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking **Backward Compatibility** on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests. 

cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish short form / long from generation integration in Whisper #32263

Feature request

Motivation

Should we work on this ?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Finish short form / long from generation integration in Whisper #32263

Description

Feature request

Motivation

Should we work on this ?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions