What happened?
Consider this code snippet:
auto chat_ml_tokens = llama_tokenize(model, "<|im_start|><|im_end|>\n", false, true);
std::cout << "chat_ml_tokens found:";
for(const auto t : chat_ml_tokens) {
std::cout << " " << t;
}
With the latest version this is generating the following output:
chat_ml_tokens found: 32001 32000 28705 13
In an earlier version of llama.cpp, the correct tokenization was generated:
chat_ml_tokens found: 32001 32000 13
This work is based on https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02.
If I tokenize each component separately, I get the correct results for each token. However, tokenizing <|im_end|>\n results in an extra 28705 token in the output. Interestingly enough, <|im_start|>\n is also correct. There's something extra special about this <|im_end|>. I haven't methodically gone over previous commits to see when this problem was introduced. Let me know if that'll help narrow the cause down.
I'm pretty confident that I can work around this problem just by tokenizing each element separately. I'll do that and run the model through some tests. That being said, there may be some other tokenization issues in the code that are being surfaced by this.
Name and Version
This is a custom app using tag b3040 of llama.cpp. https://github.com/ggerganov/llama.cpp/releases/tag/b3040
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
What happened?
Consider this code snippet:
With the latest version this is generating the following output:
In an earlier version of llama.cpp, the correct tokenization was generated:
This work is based on https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02.
If I tokenize each component separately, I get the correct results for each token. However, tokenizing
<|im_end|>\nresults in an extra 28705 token in the output. Interestingly enough,<|im_start|>\nis also correct. There's something extra special about this<|im_end|>. I haven't methodically gone over previous commits to see when this problem was introduced. Let me know if that'll help narrow the cause down.I'm pretty confident that I can work around this problem just by tokenizing each element separately. I'll do that and run the model through some tests. That being said, there may be some other tokenization issues in the code that are being surfaced by this.
Name and Version
This is a custom app using tag
b3040of llama.cpp. https://github.com/ggerganov/llama.cpp/releases/tag/b3040What operating system are you seeing the problem on?
Linux
Relevant log output
No response