[`fix`] Always early return for non-Mistral models in _patch_mistral_regex by tomaarsen · Pull Request #45444 · huggingface/transformers

tomaarsen · 2026-04-14T19:28:34Z

What does this PR do?

Resolves huggingface/sentence-transformers#3724

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Details

The non-mistral model_type early exit in _patch_mistral_regex was nested inside if transformers_version <= "4.57.2". If the case, mistral_config_detected is set to True and the warning fired for any local large-vocab non-mistral tokenizer (Qwen3, Gemma3, etc.). Reproducer:

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.2"}, f)
    tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
    print(type(tokenizer))

[transformers] The tokenizer you are loading from 'C:\Users\tom\AppData\Local\Temp\tmp_orxp87p' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

This warning is obviously nonsensical: this isn't even a mistral model. It's fixed very simply by separating the model_type check from the version check, so it runs regardless of transformers_version. There's no warning anymore from the reproducer now.

P.s. I know I can merge the two consecutive if ... as they both just return tokenizer, but I don't love massive if-branches.

Who can review?

cc @vasqu @zucchini-nlp

Tom Aarsen

Regardless of transformers version

HuggingFaceDocBuilderDev · 2026-04-14T19:38:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2026-04-14T19:49:39Z

Have you rechecked on latest main?

I can't repro with

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.2"}, f)
    tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
    print(type(tokenizer))

to emit a warning

zucchini-nlp · 2026-04-14T19:53:48Z

Ig this makes sense if transformers_version == None or absent in saved config files, but also I don't know when/why it can be None

tomaarsen · 2026-04-14T20:39:12Z

My env:

transformers version: 5.6.0.dev0
Platform: Windows-10-10.0.26200-SP0
Python version: 3.11.13
Huggingface_hub version: 1.10.1
Safetensors version: 0.6.2
Accelerate version: 1.11.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.0+cu126 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: No
GPU type: NVIDIA GeForce RTX 3090

My main locally was on 27fbb51. I can also reproduce it on the latest main a.k.a. 7028c30.
Using this script:

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.3"}, f)
    AutoTokenizer.from_pretrained(tmp_dir)

[transformers] The tokenizer you are loading from 'C:\Users\tom\AppData\Local\Temp\tmp7vgecfgt' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

I can also reproduce it on WSL2.

[transformers] The tokenizer you are loading from '/tmp/tmp2gt4rlwa' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

Tom Aarsen

vasqu · 2026-04-15T10:00:13Z

+                if is_local and transformers_model_type not in [
+                    "mistral",
+                    "mistral3",
+                    "voxtral",
+                    "ministral",
+                    "pixtral",
+                ]:
+                    return tokenizer
+                if transformers_version and version.parse(transformers_version) > version.parse("4.57.3"):


Can repro now, but isn't this a case where the versioning boundaries were wrong? Version 4.57.3 was missing in between; the last comparison should be version.parse(transformers_version) > version.parse("4.57.2") instead.

This current version assumes model type to exist which can also fail (None). And honestly, I don't want to accept a case where the version is None - that should indicate that we should still check --> maybe add a warning in that case for more user information.

~~I'm not 100% on whether it should be 4.57.3 or 4.57.2, but I can reproduce it on:~~

~~https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/config.json#L26 (v4.51.0)~~

~~https://huggingface.co/microsoft/harrier-oss-v1-270m/blob/main/config.json#L51 (v4.57.6)~~

~~While using main transformers myself. So I feel like the fix is not just about moving the boundary one patch version.~~

Edit: I was hardcoding the config to 4.57.3, it might be a boundary issue.

If you'd like, I can update the mistral, mistral3, etc. list to also include None, then if model_type is None, it will not early exit and we assume that mistral_config_detected=True (but in my opinion, that's not necessarily great, as then the "The tokenizer you are loading from '...' with an incorrect regex pattern: This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue." warning always trigger when model_type is None.

So then we had 2 bugs: 1. 4.57.3 always warns 2. faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning

tomaarsen · 2026-04-15T11:09:50Z

Okay, I'm changing this up after some more discussion. Reverting to the old approach, but using fully new boundaries. I think the 4.57.2/3 boundaries were put in place because the expectation was that this would be the last patches. However, when move released, the boundaries weren't updated, and we ended up with 2 bugs:

4.57.3 always warns because it was missed from the boundaries
faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning

For reference, I think #42389 is the actual fix, released in v5.0.0rc0, so everything before that should check for mistral-like tokenizers, and everything after should be skipped. I'm using v5.0.0 instead as we don't need to include release candidates.

cc @vasqu

Tom Aarsen

vasqu

LGTM, cc @ArthurZucker for viz

github-actions · 2026-04-15T12:03:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

ArthurZucker · 2026-04-20T08:19:27Z

indeed ty!

Always early return for non-Mistral models in _patch_mistral_regex

19674ae

Regardless of transformers version

vasqu reviewed Apr 15, 2026

View reviewed changes

Revert to old approach, but use 5.0.0 as 5.0.0rc0 introduced the fix

63cb6d9

So then we had 2 bugs: 1. 4.57.3 always warns 2. faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning

Exclusively check for mistral BEFORE v5

1799d1c

vasqu approved these changes Apr 15, 2026

View reviewed changes

Comment thread tests/models/auto/test_tokenization_auto.py Outdated

Also link to this PR in test reference

7ba03b6

tomaarsen added this pull request to the merge queue Apr 16, 2026

Merged via the queue into huggingface:main with commit 4d6e51b Apr 16, 2026
28 checks passed

tomaarsen deleted the fix/spurious_mistral_regex branch April 16, 2026 12:19

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`fix`] Always early return for non-Mistral models in _patch_mistral_regex#45444

[`fix`] Always early return for non-Mistral models in _patch_mistral_regex#45444
tomaarsen merged 4 commits intohuggingface:mainfrom
tomaarsen:fix/spurious_mistral_regex

tomaarsen commented Apr 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 14, 2026

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

zucchini-nlp commented Apr 14, 2026

Uh oh!

tomaarsen commented Apr 14, 2026 •

edited

Loading

Uh oh!

vasqu Apr 15, 2026

Uh oh!

tomaarsen Apr 15, 2026 •

edited

Loading

Uh oh!

tomaarsen commented Apr 15, 2026 •

edited

Loading

Uh oh!

vasqu left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

ArthurZucker commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tomaarsen commented Apr 14, 2026

What does this PR do?

Code Agent Policy

Before submitting

Details

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 14, 2026

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

zucchini-nlp commented Apr 14, 2026

Uh oh!

tomaarsen commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tomaarsen Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomaarsen commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

ArthurZucker commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tomaarsen commented Apr 14, 2026 •

edited

Loading

tomaarsen Apr 15, 2026 •

edited

Loading

tomaarsen commented Apr 15, 2026 •

edited

Loading