Feature request
Unify the output format to Whisper Generate method for short form/long form generation.
Motivation
In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.
However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.
-
For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of ModelOutput with additional information (attention masks, hidden states, ...) if return_dict_in_generate is set to True (we can now also use return_segments with short form generation).
-
For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the sequences of token ids and a list of all segments if return_segments is set to True. Note that if both return_dict_in_generate and return_segments are set to true, the additional information (attention masks, hidden states) will be contained in segments. However, at the moment we can't get an instance of ModelOutput as output with long form generation.
Should we work on this ?
Ideally, we should also unify the output format for the Whisper generate method so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.
The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.
cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?
Feature request
Unify the output format to Whisper Generate method for short form/long form generation.
Motivation
In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.
However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.
For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of
ModelOutputwith additional information (attention masks, hidden states, ...) ifreturn_dict_in_generateis set to True (we can now also usereturn_segmentswith short form generation).For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the
sequencesof token ids and a list of allsegmentsifreturn_segmentsis set to True. Note that if bothreturn_dict_in_generateandreturn_segmentsare set to true, the additional information (attention masks, hidden states) will be contained insegments. However, at the moment we can't get an instance ofModelOutputas output with long form generation.Should we work on this ?
Ideally, we should also unify the output format for the Whisper
generatemethod so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.
cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?