[fix] Always early return for non-Mistral models in _patch_mistral_regex#45444
Conversation
Regardless of transformers version
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Have you rechecked on latest main? I can't repro with import json, tempfile
from transformers import AutoTokenizer
with tempfile.TemporaryDirectory() as tmp_dir:
AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
with open(f"{tmp_dir}/config.json", "w") as f:
json.dump({"model_type": "qwen3", "transformers_version": "4.57.2"}, f)
tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
print(type(tokenizer))to emit a warning |
|
Ig this makes sense if |
|
My env:
My import json, tempfile
from transformers import AutoTokenizer
with tempfile.TemporaryDirectory() as tmp_dir:
AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B").save_pretrained(tmp_dir)
with open(f"{tmp_dir}/config.json", "w") as f:
json.dump({"model_type": "qwen3", "transformers_version": "4.57.3"}, f)
AutoTokenizer.from_pretrained(tmp_dir)I can also reproduce it on WSL2.
|
| if is_local and transformers_model_type not in [ | ||
| "mistral", | ||
| "mistral3", | ||
| "voxtral", | ||
| "ministral", | ||
| "pixtral", | ||
| ]: | ||
| return tokenizer | ||
| if transformers_version and version.parse(transformers_version) > version.parse("4.57.3"): |
There was a problem hiding this comment.
Can repro now, but isn't this a case where the versioning boundaries were wrong? Version 4.57.3 was missing in between; the last comparison should be version.parse(transformers_version) > version.parse("4.57.2") instead.
This current version assumes model type to exist which can also fail (None). And honestly, I don't want to accept a case where the version is None - that should indicate that we should still check --> maybe add a warning in that case for more user information.
There was a problem hiding this comment.
I'm not 100% on whether it should be 4.57.3 or 4.57.2, but I can reproduce it on:
https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/config.json#L26 (v4.51.0)https://huggingface.co/microsoft/harrier-oss-v1-270m/blob/main/config.json#L51 (v4.57.6)
While using main transformers myself. So I feel like the fix is not just about moving the boundary one patch version.
Edit: I was hardcoding the config to 4.57.3, it might be a boundary issue.
If you'd like, I can update the mistral, mistral3, etc. list to also include None, then if model_type is None, it will not early exit and we assume that mistral_config_detected=True (but in my opinion, that's not necessarily great, as then the "The tokenizer you are loading from '...' with an incorrect regex pattern: This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue." warning always trigger when model_type is None.
So then we had 2 bugs:
1. 4.57.3 always warns
2. faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly without a warning
|
Okay, I'm changing this up after some more discussion. Reverting to the old approach, but using fully new boundaries. I think the 4.57.2/3 boundaries were put in place because the expectation was that this would be the last patches. However, when move released, the boundaries weren't updated, and we ended up with 2 bugs:
For reference, I think #42389 is the actual fix, released in v5.0.0rc0, so everything before that should check for mistral-like tokenizers, and everything after should be skipped. I'm using v5.0.0 instead as we don't need to include release candidates. cc @vasqu
|
vasqu
left a comment
There was a problem hiding this comment.
LGTM, cc @ArthurZucker for viz
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
|
indeed ty! |
What does this PR do?
Resolves huggingface/sentence-transformers#3724
Code Agent Policy
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Details
The non-mistral
model_typeearly exit in_patch_mistral_regexwas nested insideif transformers_version <= "4.57.2". If the case,mistral_config_detectedis set toTrueand the warning fired for any local large-vocab non-mistral tokenizer (Qwen3, Gemma3, etc.). Reproducer:This warning is obviously nonsensical: this isn't even a mistral model. It's fixed very simply by separating the
model_typecheck from the version check, so it runs regardless oftransformers_version. There's no warning anymore from the reproducer now.P.s. I know I can merge the two consecutive
if ...as they both just returntokenizer, but I don't love massiveif-branches.Who can review?
cc @vasqu @zucchini-nlp