Missing timestamp offset using Whisper with pipeline and sequential decoding

### System Info

- `transformers` version: 4.45.2
- Platform: macOS-15.0.1-arm64-arm-64bit
- Python version: 3.12.1
- Huggingface_hub version: 0.23.3
- Safetensors version: 0.4.3
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no

### Who can help?

@Rocketknight1  @gante @ylacombe

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

1. `pip install transformers==4.45.2`
2. Setup a Whisper pipeline using `chunk_length_s=0` (which is sequential long-form decoding according to the model card (at least for large-v3)) and `return_timestamps=True`
3. Transcribe an audio longer than 30s

    ```py
    from transformers import pipeline
    import torch

    audio_file = '<an-audio-file-longer-than-30-s>'
    chunked = False

    pipe = pipeline(
        'automatic-speech-recognition',
        model='openai/whisper-small',
        chunk_length_s=30 if chunked else 0,
        return_timestamps=True,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device='cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu',
    )

    result = pipe(audio_file)
    transcript = '\n'.join(
        f"({chunk['timestamp'][0]}, {chunk['timestamp'][1]})\t{chunk['text']}" for chunk in result['chunks']
    )
    print(transcript)
    ```

4. See that the timestamps start at 0.0s after 30s

    ```
    (0.0, 4.44)      Er hatte schon mal eine Schnauze voll von allem und jedem.
    (4.44, 6.28)     Und er hat den Schluss getroffen.
    (6.28, 7.8)      Es hilft nichts mehr.
    (7.8, 9.28)      Ich wandere aus.
    (9.28, 11.4)     Das kann ein Grund sein,
    (11.4, 14.48)    wieso er eine Heimat für immer der Rückenträger will.
    (14.48, 16.72)   Oder es ist etwas ganz anderes.
    (16.72, 19.24)   Der wohl bekannt ist Grund...
    (19.24, 20.36)  ... die Liebe.
    (20.36, 22.44)   So ist es bei Hans Muster.
    (22.44, 24.72)   Die Liebe hat ihn nach Deutschland gezogen.
    (24.72, 27.0)    Und dort ist er seit vier Jahren.
    (27.0, 29.4)     Aber welter der für immer dort bleibt.
    (0.0, 1.0)       Gute Frage.
    (1.0, 4.0)       Ich stelle mir einen Gart am Viertel vor im PO bei den Leuten.
    (4.0, 7.0)       Und bis dort her, mein Name ist Peter Müller.
    (7.0, 11.0)      Und ich bin Wassermelone Heines vom Harry Styles.
    ```


### Expected behavior

The timestamps should be correct, also if the audio is longer than 30s (as if the chunked-algorithm is used):

```
(0.0, 4.44)      Er hatte schon mal eine Schnauze voll von allem und jedem.
(4.44, 6.28)     Und er hat den Schluss getroffen.
(6.28, 7.8)      Es hilft nichts mehr.
(7.8, 9.28)      Ich wandere aus.
(9.28, 11.4)     Das kann ein Grund sein,
(11.4, 14.48)    wieso er eine Heimat für immer der Rückenträger will.
(14.48, 16.72)   Oder es ist etwas ganz anderes.
(16.72, 19.24)   Der wohl bekannt ist Grund...
(19.24, 20.36)  ... die Liebe.
(20.36, 22.44)   So ist es bei Hans Muster.
(22.44, 24.72)   Die Liebe hat ihn nach Deutschland gezogen.
(24.72, 26.0)    Und dort ist er seit vier Jahren.
(26.0, 29.0)     Aber welter der für immer dort bleibt, gute Frage.
(29.0, 32.0)     Wir stellen es dir an, am Viertel vor, im PO bei den Leuten.
(32.0, 35.0)     Und bis dort her, mein Name ist Peter Müller.
(35.0, 39.0)     Und jetzt ein Wassermelon Heines vom Harry Styles.
```

The output is from above script using `chunked=True`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing timestamp offset using Whisper with pipeline and sequential decoding #34210

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Missing timestamp offset using Whisper with pipeline and sequential decoding #34210

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions