tokenizer decode decode with timestamp fails for extended vocabulary

### System Info

python=3.10.13
transformers==4.44.1
torch==2.1.2

### Who can help?

@sanchit-gandhi @ylacombe @eustlb @arthurz

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Decoding with timestamps produces unexpected results when the vocabulary is extended

```
>>> from transformers import WhisperTokenizer, AddedToken
>>> tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-base', language="English", task="transcribe", predict_timestamps=True)
>>> extended_vocab = ['newword1']
>>> extended_vocab = [AddedToken(t, single_word=True, lstrip=True) for t in extended_vocab]
>>> tokenizer.add_tokens(extended_vocab)
1
>>> print(len(tokenizer))
51866
>>> print(tokenizer.convert_ids_to_tokens(51865))
newword1
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokens
[50258, 50259, 50359, 50364, 51865, 220, 50375, 50257]
>>> tokenizer.decode(tokens, skip_special_tokens=True)
'newword1 '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|>newword1 <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|><|30.02|> <|30.24|><|endoftext|>'
>>> tokens = tokenizer('<|0.00|> word <|0.22|>').input_ids # something in the vocabulary
>>> tokenizer.decode(tokens, skip_special_tokens=True)
' word '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|> word <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> word <|0.22|><|endoftext|>'
```

The problem arises in [https://github.com/huggingface/transformers/blob/9613933b022ddbf085e2c593ed4ceea4c734179a/src/transformers/models/whisper/tokenization_whisper.py#L546]( https://github.com/huggingface/transformers/blob/9613933b022ddbf085e2c593ed4ceea4c734179a/src/transformers/models/whisper/tokenization_whisper.py#L546)

see issue [20225](https://github.com/huggingface/transformers/issues/20225)


### Expected behavior

I would expect the timestamps to remain consistent from tokenizing and decoding.

```
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> newword1<|0.22|><|endoftext|>'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer decode decode with timestamp fails for extended vocabulary #35330

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions