Skip to content

Fix gemma gguf tokenizer#41609

Open
CodersAcademy006 wants to merge 4 commits intohuggingface:mainfrom
CodersAcademy006:fix-gemma-gguf-tokenizer
Open

Fix gemma gguf tokenizer#41609
CodersAcademy006 wants to merge 4 commits intohuggingface:mainfrom
CodersAcademy006:fix-gemma-gguf-tokenizer

Conversation

@CodersAcademy006
Copy link
Copy Markdown

This PR fixes an issue where AutoTokenizer.from_pretrained would instantiate an incorrect Unigram tokenizer when loading from a Gemma GGUF file, instead of the correct BPE-based GemmaTokenizer.

The root cause was that the GGUF loading logic did not have an explicit check to handle the Gemma architecture specifically. This fix introduces a check that:

  1. Detects if a gguf_file is being loaded.
  2. Reads the GGUF metadata to confirm the architecture is gemma and the tokenizer model is bpe.
  3. Overrides the tokenizer_class in the loaded configuration to GemmaTokenizer, forcing the rest of the function to instantiate the correct class.

This ensures that loading a tokenizer from a Gemma GGUF file produces the same tokenization as loading from the original Hugging Face repository.

Fixes #41494

Comment on lines +1075 to +1095
# --- START OF PROPOSED FIX ---
if gguf_file:
# If a GGUF file is provided, we inspect its metadata to ensure the correct tokenizer is used,
# especially for cases like Gemma where the model type alone is not enough.
try:
from ...utils.gguf import GGUFReader

# We need to resolve the path to the GGUF file
gguf_path = cached_file(pretrained_model_name_or_path, gguf_file, **kwargs)
if gguf_path is not None:
reader = GGUFReader(gguf_path)
architecture = reader.get_architecture()
tokenizer_model = reader.get_tokenizer_model()

# This is the key condition
if architecture == "gemma" and tokenizer_model == "bpe":
logger.info("Gemma GGUF with BPE tokenizer detected. Forcing tokenizer class to 'GemmaTokenizer'.")
tokenizer_config["tokenizer_class"] = "GemmaTokenizer"
except Exception as e:
logger.warning(f"Could not read GGUF metadata to determine tokenizer class: {e}")
# --- END OF PROPOSED FIX ---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is not a good place to add such a fix for gemma special case. Can we propose a fix at ggml integration?

class GGUFGemmaConverter(GemmaConverter):
def __init__(self, tokenizer_dict):
# set dummy data to avoid unnecessary merges calculation
tokenizer_dict["merges"] = ["dummy text"]
self.proto = GGUFTokenizerSkeleton(tokenizer_dict)
self.original_tokenizer = self.proto
self.additional_kwargs = {}
def vocab(self, proto):
original_vocab = list(zip(proto.tokens, proto.scores))
updated_vocab = []
for token, score in original_vocab:
if token == "<0x09>":
updated_vocab.append(("\t", score))
elif " " in token and len(token.strip()) == 0:
underscores = "▁" * len(token)
updated_vocab.append((underscores, score))
else:
updated_vocab.append((token, score))
return updated_vocab
def normalizer(self, proto):
return normalizers.Replace(" ", "▁")
def decoder(self, replacement, add_prefix_space):
sequence = [
decoders.Replace("▁", " "),
decoders.ByteFallback(),
decoders.Fuse(),
]
if add_prefix_space:
sequence += [decoders.Strip(content=" ", left=1)]
return decoders.Sequence(sequence)
def converted(self) -> Tokenizer:
vocab_scores = self.vocab(self.proto)
tokenizer = Tokenizer(
Unigram(
vocab_scores,
unk_id=self.proto.unk_token_id,
byte_fallback=self.handle_byte_fallback,
)
)
normalizer = self.normalizer(self.proto)
if normalizer is not None:
tokenizer.normalizer = normalizer
replacement = "▁"
add_prefix_space = True
if hasattr(self.original_tokenizer, "add_prefix_space"):
add_prefix_space = self.original_tokenizer.add_prefix_space
tokenizer.decoder = self.decoder(replacement, add_prefix_space)
pre_tokenizer = self.pre_tokenizer(replacement, add_prefix_space)
if pre_tokenizer is not None:
tokenizer.pre_tokenizer = pre_tokenizer
return tokenizer

@Rocketknight1
Copy link
Copy Markdown
Member

cc @SunMarc @MekkCyber for GGUF

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41609&sha=3b7770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect tokenizer created for gemma gguf files

3 participants