Fix gemma gguf tokenizer#41609
Open
CodersAcademy006 wants to merge 4 commits intohuggingface:mainfrom
Open
Conversation
Isotr0py
reviewed
Oct 15, 2025
Comment on lines
+1075
to
+1095
| # --- START OF PROPOSED FIX --- | ||
| if gguf_file: | ||
| # If a GGUF file is provided, we inspect its metadata to ensure the correct tokenizer is used, | ||
| # especially for cases like Gemma where the model type alone is not enough. | ||
| try: | ||
| from ...utils.gguf import GGUFReader | ||
|
|
||
| # We need to resolve the path to the GGUF file | ||
| gguf_path = cached_file(pretrained_model_name_or_path, gguf_file, **kwargs) | ||
| if gguf_path is not None: | ||
| reader = GGUFReader(gguf_path) | ||
| architecture = reader.get_architecture() | ||
| tokenizer_model = reader.get_tokenizer_model() | ||
|
|
||
| # This is the key condition | ||
| if architecture == "gemma" and tokenizer_model == "bpe": | ||
| logger.info("Gemma GGUF with BPE tokenizer detected. Forcing tokenizer class to 'GemmaTokenizer'.") | ||
| tokenizer_config["tokenizer_class"] = "GemmaTokenizer" | ||
| except Exception as e: | ||
| logger.warning(f"Could not read GGUF metadata to determine tokenizer class: {e}") | ||
| # --- END OF PROPOSED FIX --- |
Contributor
There was a problem hiding this comment.
I think here is not a good place to add such a fix for gemma special case. Can we propose a fix at ggml integration?
transformers/src/transformers/integrations/ggml.py
Lines 675 to 737 in e2122c4
Member
|
cc @SunMarc @MekkCyber for GGUF |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41609&sha=3b7770 |
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue where
AutoTokenizer.from_pretrainedwould instantiate an incorrectUnigramtokenizer when loading from a Gemma GGUF file, instead of the correctBPE-basedGemmaTokenizer.The root cause was that the GGUF loading logic did not have an explicit check to handle the Gemma architecture specifically. This fix introduces a check that:
gguf_fileis being loaded.gemmaand the tokenizer model isbpe.tokenizer_classin the loaded configuration toGemmaTokenizer, forcing the rest of the function to instantiate the correct class.This ensures that loading a tokenizer from a Gemma GGUF file produces the same tokenization as loading from the original Hugging Face repository.
Fixes #41494