llama : fix bpe tokenize from byte#2889
Conversation
|
|
This helps, but it seems like there are still pretty serious issues mith BPE handling. For BPE, I tried changing: https://github.com/ggerganov/llama.cpp/blob/ad9ddcff6ef322db5cf13785bd7c856b610d242e/llama.cpp#L6155-L6157 to if (llama_is_normal_token(model->vocab, token)
|| llama_is_user_defined_token(model->vocab, token)) {which helps, but since everything is marked as user-defined stuff like byte tokens don't get decoded. I'd guess what |
ggerganov
left a comment
There was a problem hiding this comment.
This wasn't needed in the reference implementation that is supposed to work 100% correct:
I think this is a workaround for the missing Unicode support in llama.cpp's BPE tokenizer.
Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer? For the latter to happen, we would need help from someone with enough knowledge to implement it in a compact way
From what you said over here: #2938 (review) It seems like an argument against the workaround. Basically at least for the BPE models I tested, it got them a little closer to doing something other than just crashing but they still were pretty far from being able to produce usable output. I might be wrong, but my impression is that without actually supporting the unicode stuff it won't be possible to really use them with or without the workaround. @opparco Seems like the model you were testing with is |
|
byte-fallback feature is an option in training phase. in google/sentencepiece: in huggingface/tokenizers: class BPE(Model):
"""
An implementation of the BPE (Byte-Pair Encoding) algorithm
...
byte_fallback (:obj:`bool`, `optional`):
Whether to use spm byte-fallback trick (defaults to False)
"""implementation: sentencepiece bpe compatible tokenizer #252 for (int i = 0; i != -1; i = symbols_[i].next) {
auto& symbol = symbols_[i];
auto token = vocab_.token_to_id.find(std::string(symbol.text));
if (token == vocab_.token_to_id.end()) {
// output any symbols that did not form tokens as bytes.
for (int j = 0; j < symbol.text.size(); ++j) {
gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
output.push_back(token_id);
}
} else {
output.push_back((*token).second);
}
}this code maps byte to id correctly. in libfalcon.cpp, it does not have byte-fallback feature, so broken. |
ggerganov
left a comment
There was a problem hiding this comment.
Thank you for the explanation. So it seems if the byte-fallback was enabled during training, we do need this patch.
I guess it is OK to merge then.
One more question: if we use byte-fallback feature with a model that did not have that feature enabled during training, would the tokenization end up wrong? Or would we never even hit this byte-fallback branch?
My guess is the latter, but just want to confirm
|
even in the current code, there are invalid input prompts that reach the ERROR branch. const auto token = vocab.token_to_id.find(str);
if (token == vocab.token_to_id.end()) {
for (auto j = str.begin(); j != str.end(); ++j) {
std::string byte_str(1, *j);
auto token_multibyte = vocab.token_to_id.find(byte_str);
if (token_multibyte == vocab.token_to_id.end()) {
fprintf(stderr,"ERROR: byte not found in vocab: '%s'\n", byte_str.c_str());
}
output.push_back((*token_multibyte).second);
}
} else {
output.push_back((*token).second);
}even if it reached ERROR branch with the model of byte-fallback = false, byte-fallback is not related in that case. |
in llm_tokenizer_bpe::tokenize, if lookup on vocab.token_to_id fails, llama_byte_to_token(vocab, ch) should be tried.