Auto convert tekken.json#42299
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice! Just left a few comments!
| else ("LlamaTokenizer" if is_sentencepiece_available() else None), | ||
| "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None, | ||
| else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None), | ||
| "PreTrainedTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None, |
There was a problem hiding this comment.
Usually the pattern is (slow, fast), here it's (fast, fast), not sure if intended
Maybe it should be instead:
(None, "MistralCommonTokenizer" if is_mistral_common_available() else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None)so that we never have slow one anyway?
…ixtral-common-thing
| def is_base_mistral(model_id: str) -> bool: | ||
| model = model_info(model_id) | ||
| if model.tags is not None: | ||
| if re.search("base_model:.*mistralai", "".join(model.tags)): |
There was a problem hiding this comment.
Hmm so that's only for mistral org no? Should we directly check of `model_type in ["mistral" ....] so that it also works for other orgs?
There was a problem hiding this comment.
We can't do that until we download the config / config is there
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
* auto convert tekken.json * fix conversion * simplify * nit * model info based on the fly fix * up * last nit * fixup * call it fix mistral regex * fix behaviour for local or only tok is saved * style * rm comment at wrong palce * fix escaping * style * fix backend tokenizer attr to _tokenizer * update * up * update * fix the last red tests
| if transformers_version and version.parse(transformers_version) <= version.parse("4.57.2"): | ||
| if _is_local and _config.model_type not in [ | ||
| "mistral", | ||
| "mistral3", | ||
| "voxstral", | ||
| "ministral", | ||
| "pixtral", | ||
| ]: | ||
| return tokenizer |
There was a problem hiding this comment.
The non-existent attribute use of _config.model_type is causing massive loading failures everywhere (including CIs), please consider making a hotfix ASAP. :)
There was a problem hiding this comment.
Change it to _config.get("model_type")
There was a problem hiding this comment.
I have no idea why the CI was full green
* auto convert tekken.json * fix conversion * simplify * nit * model info based on the fly fix * up * last nit * fixup * call it fix mistral regex * fix behaviour for local or only tok is saved * style * rm comment at wrong palce * fix escaping * style * fix backend tokenizer attr to _tokenizer * update * up * update * fix the last red tests
What does this PR do?
mistral-commonis not installed, we can always convert totokenizer.jsonSuperseed #41592 and #41718 for now, fixes #41553, fixes #42283