What happened?
$ ./tokenize codegemma-2b.gguf " test"
[snip]
2 -> '<bos>'
255970 -> ' '
255970 -> ' '
2121 -> ' test'
$ echo " test" | spm_encode --model codegemma-2b.model --input /dev/stdin --output_format id
255973 2195
$ echo "255970 255970 2121" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\t test"
$ echo "255973 2195" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\ttest"
Note that the input is six tabs followed by "test", i.e. "\t\t\t\t\t\ttest". Take care not to accidentally use spaces when reproducing.
Note that this is not just inserting a stray space before "test": it also breaks the tabs into two sets of 3 instead of a single set of 6.
Inputs like this (leading indentation followed by text) happen a lot with code.
There are three issues here:
- Mismatch between what the model was trained on and how llama.cpp tokenizes it. Adding a space is definitely OOD, particularly for languages with strong formatting opinions (Go) or significant whitespace (Python).
- llama.cpp tokenizer doesn't roundtrip (inserts an extraneous space).
- llama.cpp tokenizer uses more tokens to represent the input.
Thanks!
Name and Version
$ ./llama-cli --version
version: 3325 (87e25a1d)
(head as of Sat Jul 6 09:22:16 2024 +0200)
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
No response
What happened?
Note that the input is six tabs followed by "test", i.e.
"\t\t\t\t\t\ttest". Take care not to accidentally use spaces when reproducing.Note that this is not just inserting a stray space before "test": it also breaks the tabs into two sets of 3 instead of a single set of 6.
Inputs like this (leading indentation followed by text) happen a lot with code.
There are three issues here:
Thanks!
Name and Version
(head as of Sat Jul 6 09:22:16 2024 +0200)
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
No response