Fix gemma2 tokenizer convert#8244
Conversation
|
@ngxson this is closer to the hf tokenizer in my tests however when trying out the cli / server I've noticed that newlines generation seems broken (doesn't occur at all). |
|
@abetlen Thanks for testing that. I've just tried on my side, and I can confirm that the new line is also broken. |
|
Edit: sorry I made a mistake. The output new line token is correct (token ID 108), I'm investigating this further |
|
@abetlen Turns out new line and all tokens after ID 108 are marked as control, while they should be normal token. I fixed my code and it should work correctly now: Tokenized (main.log): I also split the code into |
|
this seems like the correct one. It even properly tokenizes the prompt with discards noted in the discussion as disc and ards @ggerganov if you want to take a look |
| for i in range(108): | ||
| # including <unusedX>, <start_of_turn>, <end_of_turn> | ||
| toktypes[i] = SentencePieceTokenTypes.CONTROL | ||
| self.gguf_writer.add_tokenizer_model("llama") |
There was a problem hiding this comment.
I know it’s merged, and it’s a nitpick, and ignores the rule of 3 but especially for someone with little understanding of what the code is actually doing (it’s all magic to me) it would benefit from a separate method for this and the sequence of calls lines 582-586
while removing a small duplication, it also can serve as a helper to understand what it’s doing. And someday someone may fix something in once place and not the other.
❤️
There was a problem hiding this comment.
I think you’re new to the code base.. if you have a look on the other parts of the file, there’re even more duplications. Not because we don’t care about this, but sometimes duplications make it more visible what the code does.
* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue
* fix gemma2 tokenizer convert * remove scores * improve code, fix new line issue

Ref comment:
The output model is capable of tokenizing special tokens (used in chat templates):
Perplexity is also improved from
8.9711to7.8952(I'm using q8_0 because colab notebook does not have enough VRAM for f16)