Skip to content

Bug: SPM tokenization breaks in at least one specific case. #7629

@snichols

Description

@snichols

What happened?

Consider this code snippet:

auto chat_ml_tokens = llama_tokenize(model, "<|im_start|><|im_end|>\n", false, true);
std::cout << "chat_ml_tokens found:";
for(const auto t : chat_ml_tokens) {
    std::cout << " " << t;
}

With the latest version this is generating the following output:

chat_ml_tokens found: 32001 32000 28705 13

In an earlier version of llama.cpp, the correct tokenization was generated:

chat_ml_tokens found: 32001 32000 13

This work is based on https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02.

If I tokenize each component separately, I get the correct results for each token. However, tokenizing <|im_end|>\n results in an extra 28705 token in the output. Interestingly enough, <|im_start|>\n is also correct. There's something extra special about this <|im_end|>. I haven't methodically gone over previous commits to see when this problem was introduced. Let me know if that'll help narrow the cause down.

I'm pretty confident that I can work around this problem just by tokenizing each element separately. I'll do that and run the model through some tests. That being said, there may be some other tokenization issues in the code that are being surfaced by this.

Name and Version

This is a custom app using tag b3040 of llama.cpp. https://github.com/ggerganov/llama.cpp/releases/tag/b3040

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions