Skip to content

tokenizer decode decode with timestamp fails for extended vocabulary #35330

@bnestor

Description

@bnestor

System Info

python=3.10.13
transformers==4.44.1
torch==2.1.2

Who can help?

@sanchit-gandhi @ylacombe @eustlb @ArthurZ

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Decoding with timestamps produces unexpected results when the vocabulary is extended

>>> from transformers import WhisperTokenizer, AddedToken
>>> tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-base', language="English", task="transcribe", predict_timestamps=True)
>>> extended_vocab = ['newword1']
>>> extended_vocab = [AddedToken(t, single_word=True, lstrip=True) for t in extended_vocab]
>>> tokenizer.add_tokens(extended_vocab)
1
>>> print(len(tokenizer))
51866
>>> print(tokenizer.convert_ids_to_tokens(51865))
newword1
>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokens
[50258, 50259, 50359, 50364, 51865, 220, 50375, 50257]
>>> tokenizer.decode(tokens, skip_special_tokens=True)
'newword1 '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|>newword1 <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|><|30.02|> <|30.24|><|endoftext|>'
>>> tokens = tokenizer('<|0.00|> word <|0.22|>').input_ids # something in the vocabulary
>>> tokenizer.decode(tokens, skip_special_tokens=True)
' word '
>>> tokenizer.decode(tokens, skip_special_tokens=False)
'<|startoftranscript|><|en|><|transcribe|> word <|endoftext|>'
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> word <|0.22|><|endoftext|>'

The problem arises in

see issue 20225

Expected behavior

I would expect the timestamps to remain consistent from tokenizing and decoding.

>>> tokens = tokenizer('<|0.00|> newword1 <|0.22|>').input_ids
>>> tokenizer.decode(tokens, skip_special_tokens=False, decode_with_timestamps=True)
'<|startoftranscript|><|en|><|transcribe|><|0.00|> newword1<|0.22|><|endoftext|>'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions