use `TokenizersBackend` by ArthurZucker · Pull Request #42894 · huggingface/transformers

ArthurZucker · 2025-12-16T10:20:59Z

What does this PR do?

Fixes a bunch of issues (including #42874)

With this PR, we now kinda enforce that if there is a specific python path to be used (meaning there is a XXXXTokenizer class that has python code whatever the backend) the tokenizer's saved tokenizer_class needs to match the class mapped in tokenization_auto to the model_type.

If there is no config.json we are just gonna use the serialized tokenizer_class but in most cases its gonna produce gibberish outputs because many many models on the hub use say LlamaTokenizer when the tokenizer they actually need is completely different.
If there is a config.json, we extract the model_type, and check if TOKENIZER_MAPPING[model_type] matches the serialized tokenizer_class.
- if NO: we just use TokenizersBackend: we assume its not intended + its not a special python path + its a recent model -> we use TokenizersBackend. Falling back upon failure to convert to the tokenizer_class.
- if YES: its safe to say this is intended, we have a good match / map we can enforce the tokenizer architecture.
If it is remote code, trust_remote_code=True we always use the class people want

We cannot rely on the tokenizer_class, and we should not rely on it, but we rely on the model type.

Also 90% of the tokenizers on the hub don't need special code, and are just supported OOB by tokenizers that the motivations behind this decision.

We cannot fix ALL tokenizers_config.json on the hub + we actually don't want to. We want to just read the tokenizer.json in general using TokenizersBackend.

What matters, breaking changes 🔴 🔴 🔴 🔴 🔴 🔴

TokenizersBackend no longer default to having token_type_ids. If you want them, set the flag.

Given:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
tok.push_to_hub("ArthurZ/MyTokenizer")

This

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
assert tok.__class__ == Qwen2Tokenizer

still works, because if there is no config.json we do fallback to the serialized tokenizer_class in the tokenizer_config.json. But this class, for MOST of the models out there is WRONG! 😉

HuggingFaceDocBuilderDev · 2025-12-16T10:32:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2025-12-18T22:05:04Z

        ("udop", "UdopTokenizer" if is_tokenizers_available() else None),
        ("umt5", "T5Tokenizer" if is_tokenizers_available() else None),
-        ("video_llava", "LlamaTokenizer" if is_tokenizers_available() else None),
+        ("video_llava", "TokenizersBackend" if is_tokenizers_available() else None),


@ArthurZucker Just wondering if the LlamaTokenizer for all of these was causing issues?

yep! Because we now enforce Llama if it is mapped to Llama, all of these were assumed to have Llama like pre tokenizer but they really don't!

we should / could just remove them entirely from the mapping! defaulting to TokenizersBackend!

…okenizer-auto

github-actions · 2026-01-07T16:18:49Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, auto, blenderbot, canine, chameleon, chinese_clip, code_llama, deepseek_vl, deepseek_vl_hybrid, ernie4_5_vl_moe, granite_speech, layoutlmv2, nougat, parakeet, pixtral

github-actions · 2026-01-07T16:26:41Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42894&sha=0354e3

…3202) PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path.

* Fix tokenizer auto_map being ignored for custom models (#43202) PR #42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path. * style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>

* us `TokenizersBackend` * fixes * pioritize mapping * pioritize mapping * only use mapping for some models * fix fallback * undo debug thing * add case to tokenizersbackend init * add default bos eos token to tok backend * set bos eos * fix more models * mistrla idefics * fix stopping criteria test * fix stopping criteria test * try stopping criteria fix * rebase * update tokenizer model for stopping criteria test * fix tuple mapping for ministral * ignore `tokenizer_class` as it is always wrong * up * try to fix idefics * fix unispeech and maybe other: fallback if conversion was not possible to the saveclass * nits * fixup * TIL that it was ALSO saved in config.json... * arf * fallback to tok config if no config json * people who map to Llama probably don't even want llama either.. * processors to load tokbackend * auto fix order * try diff order * mistral fix for weird chars * reorder * random fix attempt for failing tests that are failing locally so idk how to check these * trying an older commit * fix mistral * map unispeech * try something out * update * nits * trying to be a little bit more restrictive * token type ids for tokenizers should be explicits... let's see which test fail this and we'll add to the specific classes? * Nit * idefics 1-2 are actually the only ones that should map to llama force * small fixes * fix layout * fixup * fix some tests * 1 nit * aria fix * style * canine * fixup * very small test * style * update to tokenizersbackend --------- Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-52.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-196.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-217.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-111.ec2.internal> Co-authored-by: itazap <ita.zaporozhets@huggingface.co> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-100.ec2.internal>

…3219) * Fix tokenizer auto_map being ignored for custom models (huggingface#43202) PR huggingface#42894 added an early-exit to TokenizersBackend when tokenizer_class doesn't match the registered tokenizer for a model_type. However, this early-exit was placed before the auto_map check, causing custom tokenizers with trust_remote_code to be ignored. This fix moves the auto_map extraction before the early-exit check and adds tokenizer_auto_map is None to the condition, so models with custom tokenizers properly use the dynamic module loading path. * style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>

transformers 5.0.0rc1 changed from ByteLevel decoder to a Sequence decoder that strips space markers (▁) from SentencePiece token pieces, causing all spaces to be lost during decode. This silently broke DeepSeek (and likely all SentencePiece-based models). Pinned to <5.0.0 until the fix in huggingface/transformers#42894 ships in a stable release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12.

) * Add an integration test for LASR using pipe and chunked decoding * Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched * Enable LasrForCTCIntegrationTest * add require_torch_accelerator * Use a publicly accessible test model for LASR and update integration test goldens * Correct the tokenizer mapping for LASR models Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after #42894, which caused Google-Health/medasr#12. * Remove require_read_token since we now use a publicly assessible test checkpoint * update values for runners --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>

…gingface#42823) * Add an integration test for LASR using pipe and chunked decoding * Revise goldens in LasrForCTCIntegrationTest.test_model_integration_batched * Enable LasrForCTCIntegrationTest * add require_torch_accelerator * Use a publicly accessible test model for LASR and update integration test goldens * Correct the tokenizer mapping for LASR models Because of the out of date tokenizer mapping, AutoTokenizer started returning TokenizersBackend instead LasrTokenizer after huggingface#42894, which caused Google-Health/medasr#12. * Remove require_read_token since we now use a publicly assessible test checkpoint * update values for runners --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>

itazap reviewed Dec 18, 2025

View reviewed changes

itazap force-pushed the fix-tokenizer-auto branch from 6a942de to b71b245 Compare December 19, 2025 23:55

molbap mentioned this pull request Dec 22, 2025

fix failure of llava/pixtral #42985

Merged

ArthurZucker and others added 16 commits January 4, 2026 18:06

us TokenizersBackend

d26fd7c

fixes

bbf72ad

pioritize mapping

fb5a8ac

pioritize mapping

d412f4c

only use mapping for some models

4c2d7b8

fix fallback

9434def

undo debug thing

79ce0a5

add case to tokenizersbackend init

9cfb636

add default bos eos token to tok backend

de02e4a

set bos eos

8b33c75

fix more models

6873788

mistrla idefics

4afaea9

fix stopping criteria test

ec7e88a

fix stopping criteria test

498f00e

try stopping criteria fix

08cf347

rebase

a31bb4f

itazap force-pushed the fix-tokenizer-auto branch from 273d2cb to a31bb4f Compare January 4, 2026 23:12

itazap and others added 3 commits January 4, 2026 18:52

update tokenizer model for stopping criteria test

305f493

fix tuple mapping for ministral

d53fa27

Merge branch 'main' into fix-tokenizer-auto

f3ce355

itazap marked this pull request as ready for review January 5, 2026 15:15

itazap marked this pull request as draft January 5, 2026 15:16

ArthurZucker mentioned this pull request Jan 5, 2026

Wrong tokenizer decoder type in Transformers v5 #43066

Closed

4 tasks

awni mentioned this pull request Jan 5, 2026

Issue with converting and quantising deepseek-ai/DeepSeek-R1-Distill-Llama-8B ml-explore/mlx-lm#703

Closed

itazap mentioned this pull request Jan 5, 2026

Fix nativ tok #42874

Closed

ArthurZucker marked this pull request as ready for review January 6, 2026 10:08

ArthurZucker and others added 9 commits January 7, 2026 16:07

1 nit

b601ae6

aria fix

24a9449

style

6ef783f

canine

88d62cc

fixup

ea537d8

very small test

f979d6c

style

50c20f1

Merge branch 'main' of github.com:huggingface/transformers into fix-t…

618dadc

…okenizer-auto

update to tokenizersbackend

0354e33

itazap approved these changes Jan 7, 2026

View reviewed changes

ArthurZucker merged commit 9daee2e into main Jan 7, 2026
24 of 26 checks passed

ArthurZucker deleted the fix-tokenizer-auto branch January 7, 2026 16:49

This was referenced Jan 11, 2026

Fix tokenizer auto_map being ignored for custom models #43219

Merged

tokenizer.decode producing bad results in some cases from 5.0.0rc1 to 5.0.0rc2 #43202

Closed

vasqu mentioned this pull request Jan 12, 2026

Add EXAONE-MoE implementations #43080

Merged

5 tasks

Rocketknight1 mentioned this pull request Jan 16, 2026

Replace config.get() with getattr(config) #43321

Merged

DavidMChan mentioned this pull request Feb 17, 2026

TOKENIZER_MAPPING_NAMES sometimes returns None, but from_pretrained assumes otherwise #44117

Closed

4 tasks

harshaljanjani mentioned this pull request Feb 20, 2026

fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding #44187

Merged

5 tasks

kho mentioned this pull request Mar 6, 2026

MedAsr does not work on HF transformers 5.0.0 or 5.0.1, prints <epsilon> Google-Health/medasr#12

Closed

guapisolo mentioned this pull request Apr 8, 2026

[v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 radixark/miles#926

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `TokenizersBackend`#42894

use `TokenizersBackend`#42894
ArthurZucker merged 61 commits intomainfrom
fix-tokenizer-auto

ArthurZucker commented Dec 16, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 16, 2025

Uh oh!

itazap Dec 18, 2025

Uh oh!

ArthurZucker Jan 6, 2026

Uh oh!

ArthurZucker Jan 6, 2026

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArthurZucker commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What matters, breaking changes 🔴 🔴 🔴 🔴 🔴 🔴

Uh oh!

HuggingFaceDocBuilderDev commented Dec 16, 2025

Uh oh!

itazap Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArthurZucker commented Dec 16, 2025 •

edited

Loading