Decoding with timestamps produces unexpected results when the vocabulary is extended
>>> from transformers import WhisperTokenizer, AddedToken
>>> tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-base', language="English", task="transcribe", predict_timestamps=True)
>>> extended_vocab = ['newword1']
>>> extended_vocab = [AddedToken(t, single_word=True, lstrip=True) for t in extended_vocab]
>>> tokenizer.add_tokens(extended_vocab)
1
>>> print(len(tokenizer))
51866
>>> print(tokenizer.convert_ids_to_tokens(51865))
newword1
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokens
[50258, 50259, 50359, 50364, 51865, 220, 50375, 50257]
>>> tokenizer.decode(tokens, skip_special_tokens=True)
'newword1 '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|>newword1 <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|><|30.02|> <|30.24|><|endoftext|>'
>>> tokens = tokenizer('<|0.00|> word <|0.22|>').input_ids # something in the vocabulary
>>> tokenizer.decode(tokens, skip_special_tokens=True)
' word '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|> word <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> word <|0.22|><|endoftext|>'
I would expect the timestamps to remain consistent from tokenizing and decoding.
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> newword1<|0.22|><|endoftext|>'
System Info
python=3.10.13
transformers==4.44.1
torch==2.1.2
Who can help?
@sanchit-gandhi @ylacombe @eustlb @ArthurZ
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Decoding with timestamps produces unexpected results when the vocabulary is extended
The problem arises in
transformers/src/transformers/models/whisper/tokenization_whisper.py
Line 546 in 9613933
see issue 20225
Expected behavior
I would expect the timestamps to remain consistent from tokenizing and decoding.