fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft)#45199
Conversation
itazap
left a comment
There was a problem hiding this comment.
Thank you! good catch, apologies for the overlap in naming
|
let's just add a small test so that we catch this next time 😉 |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| token_ids = tokenizer("maɪ c", do_phonemize=False).input_ids | ||
| self.assertEqual(token_ids, [3, 200]) # mai should be <unk> (=3) | ||
|
|
||
| def test_phonemizer_backend_not_clobbered(self): |
There was a problem hiding this comment.
@itazap Added, please check and let me know if it's alright :)
There was a problem hiding this comment.
Removed as per the review.
|
Good day @itazap; just checking in to see if there have been any updates :) |
|
hey sorry, good news is actually the existing tests like |
|
@itazap Makes sense, removed :) |
|
I'd prefer to wait to merge a fix that allows these tests to pass! Looking into it as well |
|
Ah, sorry I missed what you truly meant in the prev. comment. I've completed the investigation and expanded the PR coverage. Dropping an explanation of why this is a bit more involved now, could you double check if this fixes all the tests and if the fixes make sense? → tokenization_wav2vec2_phoneme.py; dropped
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: wav2vec2_phoneme |
|
perfect, thanks a lot! 🤗 |
…vec2-lv-60-espeak-cv-ft) (huggingface#45199) * fix: Resolve regressions from tokenizer refactor * chore: Add regression test * nit: Remove the test * fix: Expand test coverage to all tests --------- Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

What does this PR do?
The following Wav2Vec2PhonemeCTC use cases were identified and fixed in this PR:
→ 05c0e1d ("rm slow tokenizers") added self.backend = kwargs.pop("backend", None).
Wav2Vec2PhonemeCTCTokenizeralready usedself.backendfor its phonemizerEspeakBackendobject set in init_backend. Regardless of call order, one clobbers the other; either the base class overwrites the phonemizer object withNone(breaking phonemize()), or the phonemizer object overwrites the base class's serializable value (breakingsave_pretrainedwithEspeakBackend is not JSON serializable). Renamed toself._phonemizer_backendso both attributes coexist. Followed the same naming convention used for_word_delimiter_tokenand_phone_delimiter_tokenin the same file.→ Same commit consolidated
tokenization_utils.pyintotokenization_python.py. In the old code, _encode_plus had return_offsets_mapping as a named param and raised NotImplementedError before reaching tokenize(). After the refactor,return_offsets_mappingis no longer a named param in_encode_plus, so it flows through**kwargs→tokenize()→ prepare_for_tokenization(), which had a fixed signature. Added**kwargsto match the base class contract at tokenization_python.py#L836-L838. No other models are affected;Wav2Vec2PhonemeCTCTokenizeris the only override ofprepare_for_tokenizationthat was missing**kwargs:)→ For more details on reproducing the bug and the output screenshots, please visit the linked issue!
Fixes #45198
cc: @Rocketknight1 @itazap
CI coverage fixed by this PR (as suggested for inclusion in the PR):
CI run test coverage of this behavior:
→ models/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py::Wav2Vec2PhonemeCTCTokenizerTest:
Repro output after the fixes (feel free to cross-check):
Code Agent Policy
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.