use TokenizersBackend#42894
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| ("udop", "UdopTokenizer" if is_tokenizers_available() else None), | ||
| ("umt5", "T5Tokenizer" if is_tokenizers_available() else None), | ||
| ("video_llava", "LlamaTokenizer" if is_tokenizers_available() else None), | ||
| ("video_llava", "TokenizersBackend" if is_tokenizers_available() else None), |
There was a problem hiding this comment.
@ArthurZucker Just wondering if the LlamaTokenizer for all of these was causing issues?
There was a problem hiding this comment.
yep! Because we now enforce Llama if it is mapped to Llama, all of these were assumed to have Llama like pre tokenizer but they really don't!
There was a problem hiding this comment.
we should / could just remove them entirely from the mapping! defaulting to TokenizersBackend!
6a942de to
b71b245
Compare
273d2cb to
a31bb4f
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aria, auto, blenderbot, canine, chameleon, chinese_clip, code_llama, deepseek_vl, deepseek_vl_hybrid, ernie4_5_vl_moe, granite_speech, layoutlmv2, nougat, parakeet, pixtral |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42894&sha=0354e3 |
…3202) PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path.
* Fix tokenizer auto_map being ignored for custom models (#43202) PR #42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path. * style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>
* us `TokenizersBackend` * fixes * pioritize mapping * pioritize mapping * only use mapping for some models * fix fallback * undo debug thing * add case to tokenizersbackend init * add default bos eos token to tok backend * set bos eos * fix more models * mistrla idefics * fix stopping criteria test * fix stopping criteria test * try stopping criteria fix * rebase * update tokenizer model for stopping criteria test * fix tuple mapping for ministral * ignore `tokenizer_class` as it is always wrong * up * try to fix idefics * fix unispeech and maybe other: fallback if conversion was not possible to the saveclass * nits * fixup * TIL that it was ALSO saved in config.json... * arf * fallback to tok config if no config json * people who map to Llama probably don't even want llama either.. * processors to load tokbackend * auto fix order * try diff order * mistral fix for weird chars * reorder * random fix attempt for failing tests that are failing locally so idk how to check these * trying an older commit * fix mistral * map unispeech * try something out * update * nits * trying to be a little bit more restrictive * token type ids for tokenizers should be explicits... let's see which test fail this and we'll add to the specific classes? * Nit * idefics 1-2 are actually the only ones that should map to llama force * small fixes * fix layout * fixup * fix some tests * 1 nit * aria fix * style * canine * fixup * very small test * style * update to tokenizersbackend --------- Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-52.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-196.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-217.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-111.ec2.internal> Co-authored-by: itazap <ita.zaporozhets@huggingface.co> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-100.ec2.internal>
…3219) * Fix tokenizer auto_map being ignored for custom models (huggingface#43202) PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path. * style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>
transformers 5.0.0rc1 changed from ByteLevel decoder to a Sequence decoder that strips space markers (▁) from SentencePiece token pieces, causing all spaces to be lost during decode. This silently broke DeepSeek (and likely all SentencePiece-based models). Pinned to <5.0.0 until the fix in huggingface/transformers#42894 ships in a stable release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12.
) * Add an integration test for LASR using pipe and chunked decoding * Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched * Enable LasrForCTCIntegrationTest * add require_torch_accelerator * Use a publicly accessible test model for LASR and update integration test goldens * Correct the tokenizer mapping for LASR models Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after #42894, which caused Google-Health/medasr#12. * Remove require_read_token since we now use a publicly assessible test checkpoint * update values for runners --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
…gingface#42823) * Add an integration test for LASR using pipe and chunked decoding * Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched * Enable LasrForCTCIntegrationTest * add require_torch_accelerator * Use a publicly accessible test model for LASR and update integration test goldens * Correct the tokenizer mapping for LASR models Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12. * Remove require_read_token since we now use a publicly assessible test checkpoint * update values for runners --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
What does this PR do?
Fixes a bunch of issues (including #42874)
With this PR, we now kinda enforce that if there is a specific
pythonpath to be used (meaning there is aXXXXTokenizerclass that has python code whatever the backend) the tokenizer's savedtokenizer_classneeds to match the class mapped intokenization_autoto themodel_type.config.jsonwe are just gonna use the serializedtokenizer_classbut in most cases its gonna produce gibberish outputs because many many models on the hub use sayLlamaTokenizerwhen the tokenizer they actually need is completely different.config.json, we extract themodel_type, and check ifTOKENIZER_MAPPING[model_type]matches the serializedtokenizer_class.TokenizersBackend: we assume its not intended + its not a special python path + its a recent model -> we useTokenizersBackend. Falling back upon failure to convert to thetokenizer_class.tokenizerarchitecture.trust_remote_code=Truewe always use the class people wantWe cannot rely on the
tokenizer_class, and we should not rely on it, but we rely on the model type.Also 90% of the tokenizers on the hub don't need special code, and are just supported OOB by
tokenizersthat the motivations behind this decision.We cannot fix ALL
tokenizers_config.jsonon the hub + we actually don't want to. We want to just read thetokenizer.jsonin general usingTokenizersBackend.What matters, breaking changes 🔴 🔴 🔴 🔴 🔴 🔴
TokenizersBackendno longer default to havingtoken_type_ids. If you want them, set the flag.Given:
This
still works, because if there is no
config.jsonwe do fallback to the serializedtokenizer_classin thetokenizer_config.json. But this class, for MOST of the models out there is WRONG! 😉