fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft) by harshaljanjani · Pull Request #45199 · huggingface/transformers

harshaljanjani · 2026-04-02T20:03:22Z

What does this PR do?

The following Wav2Vec2PhonemeCTC use cases were identified and fixed in this PR:

→ 05c0e1d ("rm slow tokenizers") added self.backend = kwargs.pop("backend", None). Wav2Vec2PhonemeCTCTokenizer already used self.backend for its phonemizer EspeakBackend object set in init_backend. Regardless of call order, one clobbers the other; either the base class overwrites the phonemizer object with None (breaking phonemize()), or the phonemizer object overwrites the base class's serializable value (breaking save_pretrained with EspeakBackend is not JSON serializable). Renamed to self._phonemizer_backend so both attributes coexist. Followed the same naming convention used for _word_delimiter_token and _phone_delimiter_token in the same file.
→ Same commit consolidated tokenization_utils.py into tokenization_python.py. In the old code, _encode_plus had return_offsets_mapping as a named param and raised NotImplementedError before reaching tokenize(). After the refactor, return_offsets_mapping is no longer a named param in _encode_plus, so it flows through **kwargs → tokenize() → prepare_for_tokenization(), which had a fixed signature. Added **kwargs to match the base class contract at tokenization_python.py#L836-L838. No other models are affected; Wav2Vec2PhonemeCTCTokenizer is the only override of prepare_for_tokenization that was missing **kwargs :)
→ For more details on reproducing the bug and the output screenshots, please visit the linked issue!

Fixes #45198

cc: @Rocketknight1 @itazap

CI coverage fixed by this PR (as suggested for inclusion in the PR):

CI run test coverage of this behavior:

→ models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest:

test_batch_encode_dynamic_overflowing, test_batch_encode_plus_batch_sequence_length, test_batch_encode_plus_padding, test_call, test_case_insensitive, test_change_phonemizer_lang, test_chat_template, test_chat_template_batched, test_decode_with_del, test_empty_input_string, test_encode, test_encode_basic_padding, test_encode_decode, test_encode_decode_with_del, test_encode_decode_with_del_filter, test_encode_plus_with_padding_0, test_encode_plus_with_padding_1, test_encode_with_del, test_mask_output, test_maximum_encoding_length_pair_input, test_maximum_encoding_length_single_input, test_number_of_added_tokens, test_offsets, test_offsets_batch, test_padding_to_multiple_of, test_phonemize, test_phonemize_with_word_del, test_prepare_seq2seq_batch, test_pretokenized_inputs, test_right_and_left_truncation, test_save_and_load_tokenizer, test_special_tokens_mask, test_special_tokens_mask_input_pairs, test_token_type_ids, test_tokenizer_add_new_tokens

Repro output after the fixes (feel free to cross-check):

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you fix any necessary existing tests?

itazap

Thank you! good catch, apologies for the overlap in naming

itazap · 2026-04-09T07:40:18Z

let's just add a small test so that we catch this next time 😉

HuggingFaceDocBuilderDev · 2026-04-09T07:43:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

harshaljanjani · 2026-04-09T08:08:00Z

        token_ids = tokenizer("maɪ c", do_phonemize=False).input_ids
        self.assertEqual(token_ids, [3, 200])  # mai should be <unk> (=3)

+    def test_phonemizer_backend_not_clobbered(self):


@itazap Added, please check and let me know if it's alright :)

Removed as per the review.

harshaljanjani · 2026-04-13T04:46:04Z

Good day @itazap; just checking in to see if there have been any updates :)

itazap · 2026-04-13T12:55:34Z

hey sorry, good news is actually the existing tests like test_phenomize, test_encode, etc. already call _phenomize so we don't need a new test 😅 the bad news these tests are failing with and without this change

harshaljanjani · 2026-04-13T13:02:13Z

@itazap Makes sense, removed :)

itazap · 2026-04-13T14:24:50Z

I'd prefer to wait to merge a fix that allows these tests to pass! Looking into it as well

harshaljanjani · 2026-04-13T17:18:44Z

Ah, sorry I missed what you truly meant in the prev. comment. I've completed the investigation and expanded the PR coverage. Dropping an explanation of why this is a bit more involved now, could you double check if this fixes all the tests and if the fixes make sense?

→ tokenization_wav2vec2_phoneme.py; dropped word_delimiter_token / phone_delimiter_token from super().init() (they're delimiters not vocab tokens), so V5 auto-promotion no longer adds " " into the vocab. Round-trip is still preserved by reinjecting them into init_kwargs and re-reading from model_specific_special_tokens on load.
→ Found a get_tokenizer() precedence bug: kwargs.update(cls.special_tokens_map) was clobbering user kwargs (e.g. pad_token="[PAD]" → <pad>). Copied the correct wav2vec2 pattern verbatim :))

github-actions · 2026-04-14T11:56:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: wav2vec2_phoneme

itazap · 2026-04-14T13:46:15Z

perfect, thanks a lot! 🤗

…vec2-lv-60-espeak-cv-ft) (huggingface#45199) * fix: Resolve regressions from tokenizer refactor * chore: Add regression test * nit: Remove the test * fix: Expand test coverage to all tests --------- Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

fix: Resolve regressions from tokenizer refactor

a4e3613

harshaljanjani marked this pull request as ready for review April 2, 2026 20:13

github-actions Bot requested a review from ArthurZucker April 2, 2026 20:13

itazap approved these changes Apr 9, 2026

View reviewed changes

chore: Add regression test

aaadca6

harshaljanjani commented Apr 9, 2026

View reviewed changes

harshaljanjani requested a review from itazap April 9, 2026 08:08

nit: Remove the test

4185a1d

fix: Expand test coverage to all tests

77f1cf5

Merge branch 'main' into fix/wav2vec2-phoneme-tokenizer-regressions

15146d9

itazap enabled auto-merge April 14, 2026 13:46

itazap added this pull request to the merge queue Apr 14, 2026

Merged via the queue into huggingface:main with commit 5b565a5 Apr 14, 2026
17 checks passed

harshaljanjani deleted the fix/wav2vec2-phoneme-tokenizer-regressions branch April 14, 2026 14:00

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft)#45199

fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft)#45199
itazap merged 5 commits intohuggingface:mainfrom
harshaljanjani:fix/wav2vec2-phoneme-tokenizer-regressions

harshaljanjani commented Apr 2, 2026 •

edited

Loading

Uh oh!

itazap left a comment

Uh oh!

itazap commented Apr 9, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2026

Uh oh!

harshaljanjani Apr 9, 2026 •

edited

Loading

Uh oh!

harshaljanjani Apr 13, 2026

Uh oh!

harshaljanjani commented Apr 13, 2026

Uh oh!

itazap commented Apr 13, 2026

Uh oh!

harshaljanjani commented Apr 13, 2026

Uh oh!

itazap commented Apr 13, 2026

Uh oh!

harshaljanjani commented Apr 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

itazap commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

harshaljanjani commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Uh oh!

itazap left a comment

Choose a reason for hiding this comment

Uh oh!

itazap commented Apr 9, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2026

Uh oh!

harshaljanjani Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshaljanjani Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

harshaljanjani commented Apr 13, 2026

Uh oh!

itazap commented Apr 13, 2026

Uh oh!

harshaljanjani commented Apr 13, 2026

Uh oh!

itazap commented Apr 13, 2026

Uh oh!

harshaljanjani commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

itazap commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

harshaljanjani commented Apr 2, 2026 •

edited

Loading

harshaljanjani Apr 9, 2026 •

edited

Loading

harshaljanjani commented Apr 13, 2026 •

edited

Loading