Fix gemma2 tokenizer convert by ngxson · Pull Request #8244 · ggml-org/llama.cpp

ngxson · 2024-07-01T21:11:42Z

Ref comment:

Investigation: Investigate gemma 2 generation quality #8240 (comment)
Fix proposal: Investigate gemma 2 generation quality #8240 (comment)

The output model is capable of tokenizing special tokens (used in chat templates):

test.txt

<start_of_turn>user
hello discards HOW are you<end_of_turn>

$ ./llama-tokenize -m ggml-model-f16.gguf -f test.txt

...

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  1645 -> 'user'
   108 -> '
'
 17534 -> 'hello'
  9027 -> ' disc'
  2050 -> 'ards'
 31874 -> ' HOW'
   708 -> ' are'
   692 -> ' you'
   107 -> '<end_of_turn>'
   108 -> '
'

Perplexity is also improved from 8.9711 to 7.8952 (I'm using q8_0 because colab notebook does not have enough VRAM for f16)

$ !./llama.cpp/llama-perplexity -f ./llama.cpp/wikitext-2-raw/wiki.test.raw -ngl 99 -m ./gemma2/ggml-model-q8_0.gguf -c 1024

[277]7.8267,[278]7.8213,[279]7.8372,[280]7.8424,[281]7.8534,[282]7.8573,[283]7.8760,[284]7.8952,
Final estimate: PPL = 7.8952 +/- 0.05648

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

abetlen · 2024-07-01T22:23:07Z

@ngxson this is closer to the hf tokenizer in my tests however when trying out the cli / server I've noticed that newlines generation seems broken (doesn't occur at all).

ngxson · 2024-07-01T22:34:35Z

@abetlen Thanks for testing that. I've just tried on my side, and I can confirm that the new line is also broken.

ngxson · 2024-07-01T22:37:28Z

Edit: sorry I made a mistake. The output new line token is correct (token ID 108), I'm investigating this further

ngxson · 2024-07-01T22:51:45Z

@abetlen Turns out new line and all tokens after ID 108 are marked as control, while they should be normal token. I fixed my code and it should work correctly now:

> list 10 fruits
Here are 10 fruits:

1. Apple
2. Banana
3. Orange
4. Strawberry
5. Grapefruit
6. Mango
7. Pineapple
8. Watermelon
9. Blueberry
10. Raspberry

Tokenized (main.log):

'<start_of_turn>':106, 'user':1645, '':108, 'list':1701, ' ':235248, '1':235274, '0':235276, ' fruits':16803, '<end_of_turn>':107, '':108, '<start_of_turn>':106, 'model':2516, '':108, 'Here':4858, ' are':708, ' ':235248, '1':235274

I also split the code into _create_vocab_sentencepiece and _set_vocab_sentencepiece. This way the function is easier to be reuse.

bartowski1182 · 2024-07-01T22:56:58Z

this seems like the correct one. It even properly tokenizes the prompt with discards noted in the discussion as disc and ards

@ggerganov if you want to take a look

eran-medan · 2024-07-02T05:21:19Z

+        for i in range(108):
+            # including <unusedX>, <start_of_turn>, <end_of_turn>
+            toktypes[i] = SentencePieceTokenTypes.CONTROL
+        self.gguf_writer.add_tokenizer_model("llama")


I know it’s merged, and it’s a nitpick, and ignores the rule of 3 but especially for someone with little understanding of what the code is actually doing (it’s all magic to me) it would benefit from a separate method for this and the sequence of calls lines 582-586

while removing a small duplication, it also can serve as a helper to understand what it’s doing. And someday someone may fix something in once place and not the other.

❤️

I think you’re new to the code base.. if you have a look on the other parts of the file, there’re even more duplications. Not because we don’t care about this, but sometimes duplications make it more visible what the code does.

* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue

fix gemma2 tokenizer convert

62dc4a9

ngxson requested a review from ggerganov July 1, 2024 21:11

github-actions Bot added the python python script changes label Jul 1, 2024

remove scores

1a99b5e

ngxson mentioned this pull request Jul 1, 2024

Investigate gemma 2 generation quality #8240

Closed

improve code, fix new line issue

922a5e2

ngxson requested a review from abetlen July 1, 2024 22:52

abetlen approved these changes Jul 1, 2024

View reviewed changes

ngxson merged commit 5fac350 into ggml-org:master Jul 1, 2024

eran-medan reviewed Jul 2, 2024

View reviewed changes

ggerganov mentioned this pull request Jul 2, 2024

convert : fix gemma v1 tokenizer convert #8248

Merged

4 tasks

ki-manufaktur mentioned this pull request Jul 2, 2024

bump llama.cpp for gemma2 fixes ollama/ollama#5428

Closed

compilade mentioned this pull request Jul 8, 2024

llama : fix pre-tokenization of non-special added tokens #8228

Merged

5 tasks

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025

Fix gemma2 tokenizer convert (ggml-org#8244)

2d708cc

* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Fix gemma2 tokenizer convert (ggml-org#8244)

18bef3e

* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gemma2 tokenizer convert#8244

Fix gemma2 tokenizer convert#8244
ngxson merged 3 commits intoggml-org:masterfrom
ngxson:xsn/fix_gemma2_tokenizer

ngxson commented Jul 1, 2024 •

edited

Loading

Uh oh!

abetlen commented Jul 1, 2024 •

edited

Loading

Uh oh!

ngxson commented Jul 1, 2024 •

edited

Loading

Uh oh!

ngxson commented Jul 1, 2024

Uh oh!

ngxson commented Jul 1, 2024

Uh oh!

bartowski1182 commented Jul 1, 2024

Uh oh!

eran-medan Jul 2, 2024

Uh oh!

ngxson Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngxson commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jul 1, 2024

Uh oh!

ngxson commented Jul 1, 2024

Uh oh!

bartowski1182 commented Jul 1, 2024

Uh oh!

eran-medan Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Jul 1, 2024 •

edited

Loading

abetlen commented Jul 1, 2024 •

edited

Loading

ngxson commented Jul 1, 2024 •

edited

Loading