Skip to content

[fix] Always early return for non-Mistral models in _patch_mistral_regex#45444

Merged
tomaarsen merged 4 commits intohuggingface:mainfrom
tomaarsen:fix/spurious_mistral_regex
Apr 16, 2026
Merged

[fix] Always early return for non-Mistral models in _patch_mistral_regex#45444
tomaarsen merged 4 commits intohuggingface:mainfrom
tomaarsen:fix/spurious_mistral_regex

Conversation

@tomaarsen
Copy link
Copy Markdown
Member

What does this PR do?

Resolves huggingface/sentence-transformers#3724

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Details

The non-mistral model_type early exit in _patch_mistral_regex was nested inside if transformers_version <= "4.57.2". If the case, mistral_config_detected is set to True and the warning fired for any local large-vocab non-mistral tokenizer (Qwen3, Gemma3, etc.). Reproducer:

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.2"}, f)
    tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
    print(type(tokenizer))
[transformers] The tokenizer you are loading from 'C:\Users\tom\AppData\Local\Temp\tmp_orxp87p' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

This warning is obviously nonsensical: this isn't even a mistral model. It's fixed very simply by separating the model_type check from the version check, so it runs regardless of transformers_version. There's no warning anymore from the reproducer now.

P.s. I know I can merge the two consecutive if ... as they both just return tokenizer, but I don't love massive if-branches.

Who can review?

cc @vasqu @zucchini-nlp

  • Tom Aarsen

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 14, 2026

Have you rechecked on latest main?

I can't repro with

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.2"}, f)
    tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
    print(type(tokenizer))

to emit a warning

@zucchini-nlp
Copy link
Copy Markdown
Member

Ig this makes sense if transformers_version == None or absent in saved config files, but also I don't know when/why it can be None

@tomaarsen
Copy link
Copy Markdown
Member Author

tomaarsen commented Apr 14, 2026

My env:

  • transformers version: 5.6.0.dev0
  • Platform: Windows-10-10.0.26200-SP0
  • Python version: 3.11.13
  • Huggingface_hub version: 1.10.1
  • Safetensors version: 0.6.2
  • Accelerate version: 1.11.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.0+cu126 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA GeForce RTX 3090

My main locally was on 27fbb51. I can also reproduce it on the latest main a.k.a. 7028c30.
Using this script:

import json, tempfile
from transformers import AutoTokenizer

with tempfile.TemporaryDirectory() as tmp_dir:
    AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
    with open(f"{tmp_dir}/config.json", "w") as f:
        json.dump({"model_type": "qwen3", "transformers_version": "4.57.3"}, f)
    AutoTokenizer.from_pretrained(tmp_dir)
[transformers] The tokenizer you are loading from 'C:\Users\tom\AppData\Local\Temp\tmp7vgecfgt' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

I can also reproduce it on WSL2.

[transformers] The tokenizer you are loading from '/tmp/tmp2gt4rlwa' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
  • Tom Aarsen

Comment on lines +1327 to +1335
if is_local and transformers_model_type not in [
"mistral",
"mistral3",
"voxtral",
"ministral",
"pixtral",
]:
return tokenizer
if transformers_version and version.parse(transformers_version) > version.parse("4.57.3"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can repro now, but isn't this a case where the versioning boundaries were wrong? Version 4.57.3 was missing in between; the last comparison should be version.parse(transformers_version) > version.parse("4.57.2") instead.

This current version assumes model type to exist which can also fail (None). And honestly, I don't want to accept a case where the version is None - that should indicate that we should still check --> maybe add a warning in that case for more user information.

Copy link
Copy Markdown
Member Author

@tomaarsen tomaarsen Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% on whether it should be 4.57.3 or 4.57.2, but I can reproduce it on:

While using main transformers myself. So I feel like the fix is not just about moving the boundary one patch version.

Edit: I was hardcoding the config to 4.57.3, it might be a boundary issue.

If you'd like, I can update the mistral, mistral3, etc. list to also include None, then if model_type is None, it will not early exit and we assume that mistral_config_detected=True (but in my opinion, that's not necessarily great, as then the "The tokenizer you are loading from '...' with an incorrect regex pattern: This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue." warning always trigger when model_type is None.

So then we had 2 bugs:

1. 4.57.3 always warns
2. faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning
@tomaarsen
Copy link
Copy Markdown
Member Author

tomaarsen commented Apr 15, 2026

Okay, I'm changing this up after some more discussion. Reverting to the old approach, but using fully new boundaries. I think the 4.57.2/3 boundaries were put in place because the expectation was that this would be the last patches. However, when move released, the boundaries weren't updated, and we ended up with 2 bugs:

  1. 4.57.3 always warns because it was missed from the boundaries
  2. faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning

For reference, I think #42389 is the actual fix, released in v5.0.0rc0, so everything before that should check for mistral-like tokenizers, and everything after should be skipped. I'm using v5.0.0 instead as we don't need to include release candidates.

cc @vasqu

  • Tom Aarsen

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @ArthurZucker for viz

Comment thread tests/models/auto/test_tokenization_auto.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@tomaarsen tomaarsen added this pull request to the merge queue Apr 16, 2026
Merged via the queue into huggingface:main with commit 4d6e51b Apr 16, 2026
28 checks passed
@tomaarsen tomaarsen deleted the fix/spurious_mistral_regex branch April 16, 2026 12:19
@ArthurZucker
Copy link
Copy Markdown
Collaborator

indeed ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix_mistral_regex=True

5 participants