Fix local_files_only tokenizer fallback when tokenizer files are missing (Issue 45538) by Brianzhengca · Pull Request #45541 · huggingface/transformers

Brianzhengca · 2026-04-21T01:29:43Z

What does this PR do?

Root Cause

from_pretrained(..., local_files_only=True) let missing tokenizer files resolve to None and still proceeded into _from_pretrained(...). That allowed a stub tokenizer to initialize instead of raising, which led to the bogus huge model_max_length.

Describe the Fix

The fix adds a fail-closed check in PreTrainedTokenizerBase.from_pretrained.

After resolving files, it now checks whether all real tokenizer assets for that class resolved to None.
If they did, it raises the existing Can't load tokenizer... OSError instead of continuing into
_from_pretrained(...) and constructing a stub tokenizer

Local Tests

Check with BertTokenizer and CLIPTokenizer using a nonexistent model id and
local_files_only=True.
Before the fix, they loaded unexpectedly with the huge fallback max length.
After the fix, they raised the expected OSError.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. (CLIPTokenizer uses 10**30 as model_max_length #45538)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @Cyrilvallez

ArthurZucker

Thanks.
Failing tests are related:

FAILED tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_custom_tokenizer_with_mismatched_tokenizer_class - OSError: Can't load tokenizer for 'hf-internal-testing/test_unregistered_dynamic'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/test_unregistered_dynamic' is the correct path to a directory containing all relevant files for a NopTokenizer tokenizer.
FAILED tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_init_tokenizer_with_trust - OSError: Can't load tokenizer for 'hf-internal-testing/test_unregistered_dynamic'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/test_unregistered_dynamic' is the correct path to a directory containing all relevant files for a NopTokenizer tokenizer.

and DIA ones, can you please fix them accordingly? 🤗

HuggingFaceDocBuilderDev · 2026-04-22T04:58:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Brianzhengca · 2026-04-22T08:16:38Z

@ArthurZucker, just fixed my code for the tests. Please take a look when you have time. Thank you!

Fix local tokenizer load

249d2ed

ArthurZucker approved these changes Apr 22, 2026

View reviewed changes

JarJuicy and others added 3 commits April 22, 2026 00:35

fix failing tests: allow fileless custom tokenizers

995d4bf

Merge branch 'main' into clip_tokenizer_max_model_length

b3ca380

fix failing tests: scope tokenizer guard

6637bac

Merge branch 'main' into clip_tokenizer_max_model_length

e902b1b

This was referenced Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix local_files_only tokenizer fallback when tokenizer files are missing (Issue 45538)#45541

Fix local_files_only tokenizer fallback when tokenizer files are missing (Issue 45538)#45541
Brianzhengca wants to merge 5 commits intohuggingface:mainfrom
Brianzhengca:clip_tokenizer_max_model_length

Brianzhengca commented Apr 21, 2026

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Brianzhengca commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Brianzhengca commented Apr 21, 2026

What does this PR do?

Root Cause

Describe the Fix

Local Tests

Code Agent Policy

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Brianzhengca commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker left a comment •

edited

Loading