Skip to content

Fix unintended Hub metadata calls from _patch_mistral_regex#43603

Merged
ArthurZucker merged 4 commits intohuggingface:mainfrom
vaibhav-research:fix/mistral-regex-no-model-info
Apr 13, 2026
Merged

Fix unintended Hub metadata calls from _patch_mistral_regex#43603
ArthurZucker merged 4 commits intohuggingface:mainfrom
vaibhav-research:fix/mistral-regex-no-model-info

Conversation

@vaibhav-research
Copy link
Copy Markdown
Contributor

What does this PR do?

TokenizersBackend._patch_mistral_regex() is a Mistral-specific tokenizer patch, but the current implementation may call huggingface_hub.model_info() during detection. That triggers an HTTP request to /api/models/<repo_id> and can occur even for non-Mistral repos in environments where outbound network calls are blocked.

This PR adds minimal guardrails:

  • Return early when local_files_only=True or in offline mode.
  • Return early for non-Mistral repo ids before calling model_info().

This keeps the Mistral behavior unchanged while preventing unnecessary metadata network requests for non-Mistral models.

Fixes #43502

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp @ArthurZucker

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Comment on lines -1206 to -1207
if is_offline_mode():
is_local = True
Copy link
Copy Markdown
Contributor

@vasqu vasqu Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to just adjust these lines here to something along local_files_only or is_offline_mode() - to me it looks like the core issue is that is_local can be false even if we have local_files_only=True

Also let's add a small test

Copy link
Copy Markdown
Contributor Author

@vaibhav-research vaibhav-research Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vasqu Thanks, agreed on the direction.

I think we’re largely aligned. my understanding from reproducing this is that the core issue isn’t just how is_local is set, but that we can hit a Hub metadata call before the offline / local-only intent is fully respected.

In particular, this helper:

def is_base_mistral(model_id: str) -> bool:
            model = model_info(model_id)
            if model.tags is not None:
                if re.search("base_model:.*mistralai", "".join(model.tags)):
                    return True
            return False

unconditionally calls model_info(model_id), which triggers a /api/models/<repo> request. In my repro, that happens even when local_files_only=True, because the call occurs before we can short-circuit based on offline intent.

I initially tried forcing is_local = True when local_files_only or is_offline_mode(), but since the model_info() call is reached regardless, it didn’t fully prevent the network access in practice. That’s why I opted for an early return before we ever reach is_base_mistral() for non-Mistral / offline cases.

also, will add a test for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I pushed to your branch directly 990bc92 - I think this is easier that way than to collect the lines on git 😅

The problem was that local_files_only was never passed making it always default to false. The issue is that the API call only happens when is_local=False and hence local_files_only had no effect, it made the call regardless

Copy link
Copy Markdown
Contributor Author

@vaibhav-research vaibhav-research Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, its way more cleaner approach. thanks a lot for pushing that change 😃 @vasqu

One thing, even after this, in online mode local_files_only=False we still call model_info() for any model that hits the patch path, e.g. Qwen2Tokenizer -> _patch_mistral_regex -> is_base_mistral() -> model_info().

My repro PHASE 2 blocks /api/models/<repo> for non-Mistral and it still triggers for Qwen, because is_base_mistral currently always calls model_info().

If we want to avoid unnecessary Hub metadata calls for non-Mistral models, we likely need a cheap guard (e.g. only run is_base_mistral if repo_id looks Mistral-ish: mistralai/* or contains “mistral”), either as an early return or inside is_base_mistral().

I am going to paste the test I am running to prove my point. Thanks again for your time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point but the problem is with custom repos and custom models, we do not have much freedom here and have to check just in case. We cannot assume that only mistral ai will use mistral tokenizers.

And in this case, it's fairly inexpensive call for making sure we catch as many edge cases as possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I will leave it as is then and work on adding a test for this.

Comment on lines 379 to 380
init_kwargs=self.init_kwargs,
fix_mistral_regex=kwargs.get("fix_mistral_regex"),
**kwargs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That didn't really make sense, we pass kwargs either way 👀

@vaibhav-research
Copy link
Copy Markdown
Contributor Author

vaibhav-research commented Jan 29, 2026

the test I am running is following

Step 1: preparing cache

from huggingface_hub import snapshot_download

MODELS = [
    "Qwen/Qwen3-30B-A3B",
    "Qwen/Qwen2.5-7B-Instruct",
    "Qwen/Qwen2-7B-Instruct",
    "mistralai/Mistral-7B-v0.1",
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "microsoft/phi-3-mini-4k-instruct",
    "gpt2",
]

ALLOW = [
    "config.json",
    "tokenizer.json",
    "tokenizer.model",
    "tokenizer_config.json",
    "special_tokens_map.json",
    "added_tokens.json",
    "merges.txt",
    "vocab.json",
    "*.tiktoken",
]

for model_id in MODELS:
    snapshot_download(repo_id=model_id, allow_patterns=ALLOW)
    print(f"cached minimal: {model_id}")
python prepare_cache.py 
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 10160.62it/s]
Download complete: : 0.00B [00:00, ?B/s]              cached minimal: Qwen/Qwen3-30B-A3B           | 0/5 [00:00<?, ?it/s]
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 12953.38it/s]
Downloading (incomplete total...): 0.00B [00:00, ?B/s]cached minimal: Qwen/Qwen2.5-7B-Instruct
Download complete: : 0.00B [00:00, ?B/s]                                                           | 0/5 [00:00<?, ?it/s]
Download complete: : 0.00B [00:00, ?B/s]
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 46707.17it/s]
                                                      cached minimal: Qwen/Qwen2-7B-Instruct
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 12694.62it/s]
Download complete: : 0.00B [00:00, ?B/s]              cached minimal: mistralai/Mistral-7B-v0.1    | 0/5 [00:00<?, ?it/s]
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 60963.72it/s]
Downloading (incomplete total...): 0.00B [00:00, ?B/s]cached minimal: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Fetching 6 files: 100%|██████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 9597.95it/s]
Fetching 5 files:   0%|                               cached minimal: microsoft/phi-3-mini-4k-instruct/5 [00:00<?, ?it/s]
Download complete: : 0.00B [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s] [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s]                                                           | 0/6 [00:00<?, ?it/s]
Fetching 5 files: 100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15196.75it/s]
Downloading (incomplete total...): 0.00B [00:00, ?B/s]cached minimal: gpt2
Download complete: : 0.00B [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s]

Step2: run online and offline test using following script

import httpx
import traceback
from transformers import AutoTokenizer

MODELS = [
    "Qwen/Qwen3-30B-A3B",
    "Qwen/Qwen2.5-7B-Instruct",
    "Qwen/Qwen2-7B-Instruct",
    "mistralai/Mistral-7B-v0.1",
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "microsoft/phi-3-mini-4k-instruct",
    "gpt2",
]

_real_send = httpx.Client.send


def _is_hf_model_info_root(req: httpx.Request) -> bool:

    try:
        host = (req.url.host or "").lower()
        path = req.url.path or ""
    except Exception:
        return False

    if "huggingface.co" not in host:
        return False
    if not path.startswith("/api/models/"):
        return False
    return "/tree/" not in path


def _blocked_send_all(self, request, *args, **kwargs):
    raise RuntimeError(f"blocked HTTP call: {request.method} {request.url}")


def _blocked_send_model_info_for_non_mistral(self, request, *args, **kwargs):
    if _is_hf_model_info_root(request):
        repo_id = request.url.path[len("/api/models/") :].split("?", 1)[0]
        repo_id_l = repo_id.lower()
        mistralish = repo_id_l.startswith("mistralai/") or ("mistral" in repo_id_l)
        if not mistralish:
            raise RuntimeError(f"blocked model_info for non-mistral: {request.method} {request.url}")
    return _real_send(self, request, *args, **kwargs)


def run_phase(name: str, *, local_files_only: bool, block_mode: str, show_stack: bool = False):
    if block_mode == "all":
        httpx.Client.send = _blocked_send_all
    elif block_mode == "model_info_non_mistral":
        httpx.Client.send = _blocked_send_model_info_for_non_mistral
    elif block_mode == "none":
        httpx.Client.send = _real_send
    else:
        raise ValueError(f"unknown block_mode={block_mode}")

    failures = []

    for model_id in MODELS:
        try:
            AutoTokenizer.from_pretrained(
                model_id,
                local_files_only=local_files_only,
                trust_remote_code=False,
            )
        except Exception as e:
            failures.append((model_id, e))
            if show_stack:
                print("\n--- stack (trimmed) ---")
                traceback.print_exc(limit=12)

    print("\n" + "=" * 88)
    print(name)
    print("=" * 88)

    if not failures:
        print("result: OK (no failures)")
        return True

    print(f"result: FAIL ({len(failures)} failures)")
    for model_id, e in failures:
        print(f"- {model_id}: {repr(e)}")

    return False


def main():
    ok1 = run_phase(
        "phase 1: local_files_only=True, block all HTTP (should be fully offline)",
        local_files_only=True,
        block_mode="all",
        show_stack=False,
    )

    ok2 = run_phase(
        "phase 2: local_files_only=False, block model_info(/api/models/<repo>) for non-mistral",
        local_files_only=False,
        block_mode="model_info_non_mistral",
        show_stack=False,
    )

    httpx.Client.send = _real_send

    print("\nsummary:")
    print(f"- phase 1: {'ok' if ok1 else 'fail'}")
    print(f"- phase 2: {'ok' if ok2 else 'fail'}")


if __name__ == "__main__":
    main()
    

test results:

python repro.py                                

========================================================================================
phase 1: local_files_only=True, block all HTTP (should be fully offline)
========================================================================================
result: OK (no failures)

========================================================================================
phase 2: local_files_only=False, block model_info(/api/models/<repo>) for non-mistral
========================================================================================
result: FAIL (3 failures)
- Qwen/Qwen3-30B-A3B: RuntimeError('blocked model_info for non-mistral: GET https://huggingface.co/api/models/Qwen/Qwen3-30B-A3B')
- Qwen/Qwen2.5-7B-Instruct: RuntimeError('blocked model_info for non-mistral: GET https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct')
- Qwen/Qwen2-7B-Instruct: RuntimeError('blocked model_info for non-mistral: GET https://huggingface.co/api/models/Qwen/Qwen2-7B-Instruct')

summary:
- phase 1: ok
- phase 2: fail

@vasqu I pulled the changes you pushed and re ran the test that I ran when the issue was reported in #43502. please let me know if my test is good or if I am missing anything. this is only happening with qwen to be specific.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Jan 29, 2026

Re tests, I think it makes sense to extend (the following and other similar tests for other tokenizer types)

def test_local_files_only(self):
👀

@vaibhav-research
Copy link
Copy Markdown
Contributor Author

Re tests, I think it makes sense to extend (the following and other similar tests for other tokenizer types)

def test_local_files_only(self):

👀

sure, will extend this.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, small patch — thanks! A few follow-ups worth tacking on while we're here:

  1. Cache is_base_mistral with @lru_cache so repeated loads of the same Hub id (notebooks, rollout loops, DDP workers) don't each hit /api/models/....
  2. Wrap model_info() in try/except and return False on any error — a Hub hiccup / 5xx / ratelimit shouldn't break tokenizer init for non-Mistral models.
  3. Worth pairing this with #43212 (offline-load regression test) or adding a minimal test here that monkeypatches huggingface_hub.model_info to assert it isn't called for non-Mistral local paths.

Inline suggestions below.

Comment thread src/transformers/tokenization_utils_tokenizers.py
vaibhav-research and others added 4 commits April 13, 2026 11:06
- Wrap is_base_mistral with lru_cache so repeated loads of the same repo
  id (notebooks, rollout loops, DDP workers) don't each hit the Hub.
- Swallow any Hub error in model_info — a 5xx/ratelimit/network hiccup
  must not block tokenizer init for non-Mistral models.
- Add regression tests: (a) local_files_only=True never calls
  model_info, (b) a Hub failure does not break _patch_mistral_regex.
@ArthurZucker ArthurZucker force-pushed the fix/mistral-regex-no-model-info branch from 8d2aa0d to 558b20c Compare April 13, 2026 09:09
@ArthurZucker ArthurZucker added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker merged commit def8e6a into huggingface:main Apr 13, 2026
29 checks passed
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…ace#43603)

* Fix unintended Hub metadata calls from _patch_mistral_regex

* ruff fixes

* pass local files only

* Cache and fail-closed model_info call, add regression tests

- Wrap is_base_mistral with lru_cache so repeated loads of the same repo
  id (notebooks, rollout loops, DDP workers) don't each hit the Hub.
- Swallow any Hub error in model_info — a 5xx/ratelimit/network hiccup
  must not block tokenizer init for non-Mistral models.
- Add regression tests: (a) local_files_only=True never calls
  model_info, (b) a Hub failure does not break _patch_mistral_regex.

---------

Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API requests are made despite setting local_files_only=True.

4 participants