Skip to content

use TokenizersBackend#42894

Merged
ArthurZucker merged 61 commits intomainfrom
fix-tokenizer-auto
Jan 7, 2026
Merged

use TokenizersBackend#42894
ArthurZucker merged 61 commits intomainfrom
fix-tokenizer-auto

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Dec 16, 2025

What does this PR do?

Fixes a bunch of issues (including #42874)

With this PR, we now kinda enforce that if there is a specific python path to be used (meaning there is a XXXXTokenizer class that has python code whatever the backend) the tokenizer's saved tokenizer_class needs to match the class mapped in tokenization_auto to the model_type.

  • If there is no config.json we are just gonna use the serialized tokenizer_class but in most cases its gonna produce gibberish outputs because many many models on the hub use say LlamaTokenizer when the tokenizer they actually need is completely different.
  • If there is a config.json, we extract the model_type, and check if TOKENIZER_MAPPING[model_type] matches the serialized tokenizer_class.
    • if NO: we just use TokenizersBackend: we assume its not intended + its not a special python path + its a recent model -> we use TokenizersBackend. Falling back upon failure to convert to the tokenizer_class.
    • if YES: its safe to say this is intended, we have a good match / map we can enforce the tokenizer architecture.
  • If it is remote code, trust_remote_code=True we always use the class people want

We cannot rely on the tokenizer_class, and we should not rely on it, but we rely on the model type.

Also 90% of the tokenizers on the hub don't need special code, and are just supported OOB by tokenizers that the motivations behind this decision.

We cannot fix ALL tokenizers_config.json on the hub + we actually don't want to. We want to just read the tokenizer.json in general using TokenizersBackend.

What matters, breaking changes 🔴 🔴 🔴 🔴 🔴 🔴

  1. TokenizersBackend no longer default to having token_type_ids. If you want them, set the flag.

Given:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
tok.push_to_hub("ArthurZ/MyTokenizer") 

This

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
assert tok.__class__ == Qwen2Tokenizer

still works, because if there is no config.json we do fallback to the serialized tokenizer_class in the tokenizer_config.json. But this class, for MOST of the models out there is WRONG! 😉

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

("udop", "UdopTokenizer" if is_tokenizers_available() else None),
("umt5", "T5Tokenizer" if is_tokenizers_available() else None),
("video_llava", "LlamaTokenizer" if is_tokenizers_available() else None),
("video_llava", "TokenizersBackend" if is_tokenizers_available() else None),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker Just wondering if the LlamaTokenizer for all of these was causing issues?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! Because we now enforce Llama if it is mapped to Llama, all of these were assumed to have Llama like pre tokenizer but they really don't!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should / could just remove them entirely from the mapping! defaulting to TokenizersBackend!

@itazap itazap force-pushed the fix-tokenizer-auto branch from 273d2cb to a31bb4f Compare January 4, 2026 23:12
@itazap itazap marked this pull request as ready for review January 5, 2026 15:15
@itazap itazap marked this pull request as draft January 5, 2026 15:16
@itazap itazap mentioned this pull request Jan 5, 2026
@ArthurZucker ArthurZucker marked this pull request as ready for review January 6, 2026 10:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 7, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, auto, blenderbot, canine, chameleon, chinese_clip, code_llama, deepseek_vl, deepseek_vl_hybrid, ernie4_5_vl_moe, granite_speech, layoutlmv2, nougat, parakeet, pixtral

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 7, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42894&sha=0354e3

@ArthurZucker ArthurZucker merged commit 9daee2e into main Jan 7, 2026
24 of 26 checks passed
@ArthurZucker ArthurZucker deleted the fix-tokenizer-auto branch January 7, 2026 16:49
Anri-Lombard added a commit to Anri-Lombard/transformers that referenced this pull request Jan 11, 2026
…3202)

PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class
doesn't match the registered tokenizer for a model_type. However, this
early-exit was placed before the auto_map check, causing custom tokenizers
with trust_remote_code to be ignored.

This fix moves the auto_map extraction before the early-exit check and adds
tokenizer_auto_map is None to the condition, so models with custom tokenizers
properly use the dynamic module loading path.
@vasqu vasqu mentioned this pull request Jan 12, 2026
5 tasks
vasqu added a commit that referenced this pull request Jan 22, 2026
* Fix tokenizer auto_map being ignored for custom models (#43202)

PR #42894 added an early-exit to TokenizersBackend when tokenizer_class
doesn't match the registered tokenizer for a model_type. However, this
early-exit was placed before the auto_map check, causing custom tokenizers
with trust_remote_code to be ignored.

This fix moves the auto_map extraction before the early-exit check and adds
tokenizer_auto_map is None to the condition, so models with custom tokenizers
properly use the dynamic module loading path.

* style

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* us `TokenizersBackend`

* fixes

* pioritize mapping

* pioritize mapping

* only use mapping for some models

* fix fallback

* undo debug thing

* add case to tokenizersbackend init

* add default bos eos token to tok backend

* set bos eos

* fix more models

* mistrla idefics

* fix stopping criteria test

* fix stopping criteria test

* try stopping criteria fix

* rebase

* update tokenizer model for stopping criteria test

* fix tuple mapping for ministral

* ignore `tokenizer_class` as it is always wrong

* up

* try to fix idefics

* fix unispeech and maybe other: fallback if conversion was not possible to the saveclass

* nits

* fixup

* TIL that it was ALSO saved in config.json...

* arf

* fallback to tok config if no config json

* people who map to Llama probably don't even want llama either..

* processors to load tokbackend

* auto fix order

* try diff order

* mistral fix for weird chars

* reorder

* random fix attempt for failing tests that are failing locally so idk how to check these

* trying an older commit

* fix mistral

* map unispeech

* try something out

* update

* nits

* trying to be a little bit more restrictive

* token type ids for tokenizers should be explicits... let's see which test fail this and we'll add to the specific classes?

* Nit

* idefics 1-2 are actually the only ones that should map to llama force

* small fixes

* fix layout

* fixup

* fix some tests

* 1 nit

* aria fix

* style

* canine

* fixup

* very small test

* style

* update to tokenizersbackend

---------

Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-52.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-196.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-217.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-111.ec2.internal>
Co-authored-by: itazap <ita.zaporozhets@huggingface.co>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-100.ec2.internal>
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
…3219)

* Fix tokenizer auto_map being ignored for custom models (huggingface#43202)

PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class
doesn't match the registered tokenizer for a model_type. However, this
early-exit was placed before the auto_map check, causing custom tokenizers
with trust_remote_code to be ignored.

This fix moves the auto_map extraction before the early-exit check and adds
tokenizer_auto_map is None to the condition, so models with custom tokenizers
properly use the dynamic module loading path.

* style

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
yshk-mxim added a commit to yshk-mxim/agent-memory that referenced this pull request Feb 13, 2026
transformers 5.0.0rc1 changed from ByteLevel decoder to a Sequence decoder
that strips space markers (▁) from SentencePiece token pieces, causing all
spaces to be lost during decode. This silently broke DeepSeek (and likely
all SentencePiece-based models). Pinned to <5.0.0 until the fix in
huggingface/transformers#42894 ships in a stable release.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kho added a commit to kho/transformers that referenced this pull request Mar 6, 2026
Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12.
github-merge-queue Bot pushed a commit that referenced this pull request Mar 12, 2026
)

* Add an integration test for LASR using pipe and chunked decoding

* Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched

* Enable LasrForCTCIntegrationTest

* add require_torch_accelerator

* Use a publicly accessible test model for LASR and update integration test goldens

* Correct the tokenizer mapping for LASR models

Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after #42894, which caused Google-Health/medasr#12.

* Remove require_read_token since we now use a publicly assessible test checkpoint

* update values for runners

---------

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
michaelbenayoun pushed a commit to michaelbenayoun/transformers that referenced this pull request Mar 12, 2026
…gingface#42823)

* Add an integration test for LASR using pipe and chunked decoding

* Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched

* Enable LasrForCTCIntegrationTest

* add require_torch_accelerator

* Use a publicly accessible test model for LASR and update integration test goldens

* Correct the tokenizer mapping for LASR models

Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12.

* Remove require_read_token since we now use a publicly assessible test checkpoint

* update values for runners

---------

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants