Skip to content

convert : fix gemma v1 tokenizer convert#8248

Merged
ggerganov merged 2 commits intomasterfrom
gg/fix-gemma
Jul 4, 2024
Merged

convert : fix gemma v1 tokenizer convert#8248
ggerganov merged 2 commits intomasterfrom
gg/fix-gemma

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Jul 2, 2024

Follow up on #8244

It seems that Gemma v1 tokenization has always been broken due to missing add_space_prefix == false flag. Also, add tokenizer tests for both Gemma and Gemma-2

# get tokenizers
python3 convert-hf-to-gguf-update.py <hf_token>

# generate ggml vocabs and tests
python3 convert-hf-to-gguf.py models/tokenizers/gemma/   --outfile models/ggml-vocab-gemma.gguf   --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/gemma-2/ --outfile models/ggml-vocab-gemma-2.gguf --vocab-only

# run the tests
make -j tests
./tests/test-tokenizer-0 models/ggml-vocab-gemma.gguf
./tests/test-tokenizer-0 models/ggml-vocab-gemma-2.gguf

@github-actions github-actions Bot added the python python script changes label Jul 2, 2024
@mofosyne mofosyne added medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 3, 2024
@ggerganov ggerganov merged commit 20fc380 into master Jul 4, 2024
@ggerganov ggerganov deleted the gg/fix-gemma branch July 4, 2024 07:41
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 7, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants