Fix unintended Hub metadata calls from _patch_mistral_regex#43603
Conversation
| if is_offline_mode(): | ||
| is_local = True |
There was a problem hiding this comment.
Wouldn't it make more sense to just adjust these lines here to something along local_files_only or is_offline_mode() - to me it looks like the core issue is that is_local can be false even if we have local_files_only=True
Also let's add a small test
There was a problem hiding this comment.
@vasqu Thanks, agreed on the direction.
I think we’re largely aligned. my understanding from reproducing this is that the core issue isn’t just how is_local is set, but that we can hit a Hub metadata call before the offline / local-only intent is fully respected.
In particular, this helper:
def is_base_mistral(model_id: str) -> bool:
model = model_info(model_id)
if model.tags is not None:
if re.search("base_model:.*mistralai", "".join(model.tags)):
return True
return False
unconditionally calls model_info(model_id), which triggers a /api/models/<repo> request. In my repro, that happens even when local_files_only=True, because the call occurs before we can short-circuit based on offline intent.
I initially tried forcing is_local = True when local_files_only or is_offline_mode(), but since the model_info() call is reached regardless, it didn’t fully prevent the network access in practice. That’s why I opted for an early return before we ever reach is_base_mistral() for non-Mistral / offline cases.
also, will add a test for this.
There was a problem hiding this comment.
Sorry, I pushed to your branch directly 990bc92 - I think this is easier that way than to collect the lines on git 😅
The problem was that local_files_only was never passed making it always default to false. The issue is that the API call only happens when is_local=False and hence local_files_only had no effect, it made the call regardless
There was a problem hiding this comment.
Makes sense, its way more cleaner approach. thanks a lot for pushing that change 😃 @vasqu
One thing, even after this, in online mode local_files_only=False we still call model_info() for any model that hits the patch path, e.g. Qwen2Tokenizer -> _patch_mistral_regex -> is_base_mistral() -> model_info().
My repro PHASE 2 blocks /api/models/<repo> for non-Mistral and it still triggers for Qwen, because is_base_mistral currently always calls model_info().
If we want to avoid unnecessary Hub metadata calls for non-Mistral models, we likely need a cheap guard (e.g. only run is_base_mistral if repo_id looks Mistral-ish: mistralai/* or contains “mistral”), either as an early return or inside is_base_mistral().
I am going to paste the test I am running to prove my point. Thanks again for your time.
There was a problem hiding this comment.
I get your point but the problem is with custom repos and custom models, we do not have much freedom here and have to check just in case. We cannot assume that only mistral ai will use mistral tokenizers.
And in this case, it's fairly inexpensive call for making sure we catch as many edge cases as possible
There was a problem hiding this comment.
makes sense, I will leave it as is then and work on adding a test for this.
| init_kwargs=self.init_kwargs, | ||
| fix_mistral_regex=kwargs.get("fix_mistral_regex"), | ||
| **kwargs, |
There was a problem hiding this comment.
That didn't really make sense, we pass kwargs either way 👀
|
the test I am running is following Step 1: preparing cache Step2: run online and offline test using following script test results: @vasqu I pulled the changes you pushed and re ran the test that I ran when the issue was reported in #43502. please let me know if my test is good or if I am missing anything. this is only happening with |
|
Re tests, I think it makes sense to extend (the following and other similar tests for other tokenizer types) transformers/tests/test_tokenization_common.py Line 2704 in 071e178 |
sure, will extend this. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice, small patch — thanks! A few follow-ups worth tacking on while we're here:
- Cache
is_base_mistralwith@lru_cacheso repeated loads of the same Hub id (notebooks, rollout loops, DDP workers) don't each hit/api/models/.... - Wrap
model_info()intry/exceptand returnFalseon any error — a Hub hiccup / 5xx / ratelimit shouldn't break tokenizer init for non-Mistral models. - Worth pairing this with #43212 (offline-load regression test) or adding a minimal test here that monkeypatches
huggingface_hub.model_infoto assert it isn't called for non-Mistral local paths.
Inline suggestions below.
- Wrap is_base_mistral with lru_cache so repeated loads of the same repo id (notebooks, rollout loops, DDP workers) don't each hit the Hub. - Swallow any Hub error in model_info — a 5xx/ratelimit/network hiccup must not block tokenizer init for non-Mistral models. - Add regression tests: (a) local_files_only=True never calls model_info, (b) a Hub failure does not break _patch_mistral_regex.
8d2aa0d to
558b20c
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…ace#43603) * Fix unintended Hub metadata calls from _patch_mistral_regex * ruff fixes * pass local files only * Cache and fail-closed model_info call, add regression tests - Wrap is_base_mistral with lru_cache so repeated loads of the same repo id (notebooks, rollout loops, DDP workers) don't each hit the Hub. - Swallow any Hub error in model_info — a 5xx/ratelimit/network hiccup must not block tokenizer init for non-Mistral models. - Add regression tests: (a) local_files_only=True never calls model_info, (b) a Hub failure does not break _patch_mistral_regex. --------- Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Arthur <arthur.zucker@gmail.com>
What does this PR do?
TokenizersBackend._patch_mistral_regex()is a Mistral-specific tokenizer patch, but the current implementation may callhuggingface_hub.model_info()during detection. That triggers an HTTP request to/api/models/<repo_id>and can occur even for non-Mistral repos in environments where outbound network calls are blocked.This PR adds minimal guardrails:
local_files_only=Trueor in offline mode.model_info().This keeps the Mistral behavior unchanged while preventing unnecessary metadata network requests for non-Mistral models.
Fixes #43502
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp @ArthurZucker
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.