Skip to content

fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft)#45199

Merged
itazap merged 5 commits intohuggingface:mainfrom
harshaljanjani:fix/wav2vec2-phoneme-tokenizer-regressions
Apr 14, 2026
Merged

fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft)#45199
itazap merged 5 commits intohuggingface:mainfrom
harshaljanjani:fix/wav2vec2-phoneme-tokenizer-regressions

Conversation

@harshaljanjani
Copy link
Copy Markdown
Contributor

@harshaljanjani harshaljanjani commented Apr 2, 2026

What does this PR do?

The following Wav2Vec2PhonemeCTC use cases were identified and fixed in this PR:

05c0e1d ("rm slow tokenizers") added self.backend = kwargs.pop("backend", None). Wav2Vec2PhonemeCTCTokenizer already used self.backend for its phonemizer EspeakBackend object set in init_backend. Regardless of call order, one clobbers the other; either the base class overwrites the phonemizer object with None (breaking phonemize()), or the phonemizer object overwrites the base class's serializable value (breaking save_pretrained with EspeakBackend is not JSON serializable). Renamed to self._phonemizer_backend so both attributes coexist. Followed the same naming convention used for _word_delimiter_token and _phone_delimiter_token in the same file.
→ Same commit consolidated tokenization_utils.py into tokenization_python.py. In the old code, _encode_plus had return_offsets_mapping as a named param and raised NotImplementedError before reaching tokenize(). After the refactor, return_offsets_mapping is no longer a named param in _encode_plus, so it flows through **kwargstokenize()prepare_for_tokenization(), which had a fixed signature. Added **kwargs to match the base class contract at tokenization_python.py#L836-L838. No other models are affected; Wav2Vec2PhonemeCTCTokenizer is the only override of prepare_for_tokenization that was missing **kwargs :)
→ For more details on reproducing the bug and the output screenshots, please visit the linked issue!

Fixes #45198

cc: @Rocketknight1 @itazap

CI coverage fixed by this PR (as suggested for inclusion in the PR):

CI run test coverage of this behavior:

models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest:

test_batch_encode_dynamic_overflowing, test_batch_encode_plus_batch_sequence_length, test_batch_encode_plus_padding, test_call, test_case_insensitive, test_change_phonemizer_lang, test_chat_template, test_chat_template_batched, test_decode_with_del, test_empty_input_string, test_encode, test_encode_basic_padding, test_encode_decode, test_encode_decode_with_del, test_encode_decode_with_del_filter, test_encode_plus_with_padding_0, test_encode_plus_with_padding_1, test_encode_with_del, test_mask_output, test_maximum_encoding_length_pair_input, test_maximum_encoding_length_single_input, test_number_of_added_tokens, test_offsets, test_offsets_batch, test_padding_to_multiple_of, test_phonemize, test_phonemize_with_word_del, test_prepare_seq2seq_batch, test_pretokenized_inputs, test_right_and_left_truncation, test_save_and_load_tokenizer, test_special_tokens_mask, test_special_tokens_mask_input_pairs, test_token_type_ids, test_tokenizer_add_new_tokens

Repro output after the fixes (feel free to cross-check):

2

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?

@harshaljanjani harshaljanjani marked this pull request as ready for review April 2, 2026 20:13
@github-actions github-actions Bot requested a review from ArthurZucker April 2, 2026 20:13
Copy link
Copy Markdown
Collaborator

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! good catch, apologies for the overlap in naming

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 9, 2026

let's just add a small test so that we catch this next time 😉

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

token_ids = tokenizer("maɪ c", do_phonemize=False).input_ids
self.assertEqual(token_ids, [3, 200]) # mai should be <unk> (=3)

def test_phonemizer_backend_not_clobbered(self):
Copy link
Copy Markdown
Contributor Author

@harshaljanjani harshaljanjani Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itazap Added, please check and let me know if it's alright :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as per the review.

@harshaljanjani harshaljanjani requested a review from itazap April 9, 2026 08:08
@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Good day @itazap; just checking in to see if there have been any updates :)

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 13, 2026

hey sorry, good news is actually the existing tests like test_phenomize, test_encode, etc. already call _phenomize so we don't need a new test 😅 the bad news these tests are failing with and without this change

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

@itazap Makes sense, removed :)

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 13, 2026

I'd prefer to wait to merge a fix that allows these tests to pass! Looking into it as well

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Apr 13, 2026

Ah, sorry I missed what you truly meant in the prev. comment. I've completed the investigation and expanded the PR coverage. Dropping an explanation of why this is a bit more involved now, could you double check if this fixes all the tests and if the fixes make sense?

tokenization_wav2vec2_phoneme.py; dropped word_delimiter_token / phone_delimiter_token from super().init() (they're delimiters not vocab tokens), so V5 auto-promotion no longer adds " " into the vocab. Round-trip is still preserved by reinjecting them into init_kwargs and re-reading from model_specific_special_tokens on load.
→ Found a get_tokenizer() precedence bug: kwargs.update(cls.special_tokens_map) was clobbering user kwargs (e.g. pad_token="[PAD]"<pad>). Copied the correct wav2vec2 pattern verbatim :))

image

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: wav2vec2_phoneme

@itazap itazap enabled auto-merge April 14, 2026 13:46
@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Apr 14, 2026

perfect, thanks a lot! 🤗

@itazap itazap added this pull request to the merge queue Apr 14, 2026
Merged via the queue into huggingface:main with commit 5b565a5 Apr 14, 2026
17 checks passed
@harshaljanjani harshaljanjani deleted the fix/wav2vec2-phoneme-tokenizer-regressions branch April 14, 2026 14:00
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…vec2-lv-60-espeak-cv-ft) (huggingface#45199)

* fix: Resolve regressions from tokenizer refactor

* chore: Add regression test

* nit: Remove the test

* fix: Expand test coverage to all tests

---------

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Wav2Vec2 wav2vec2-lv-60-espeak-cv-ft: save_pretrained and tokenization fail

3 participants