Bug: SPM tokenization breaks in at least one specific case.

### What happened?

Consider this code snippet:
```
auto chat_ml_tokens = llama_tokenize(model, "<|im_start|><|im_end|>\n", false, true);
std::cout << "chat_ml_tokens found:";
for(const auto t : chat_ml_tokens) {
    std::cout << " " << t;
}
```

With the latest version this is generating the following output:
```
chat_ml_tokens found: 32001 32000 28705 13
```

In an earlier version of llama.cpp, the correct tokenization was generated:
```
chat_ml_tokens found: 32001 32000 13
```

This work is based on https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02.

If I tokenize each component separately, I get the correct results for each token. However, tokenizing `<|im_end|>\n` results in an extra 28705 token in the output. Interestingly enough, `<|im_start|>\n` is also correct. There's something extra special about this `<|im_end|>`. I haven't methodically gone over previous commits to see when this problem was introduced. Let me know if that'll help narrow the cause down.

I'm pretty confident that I can work around this problem just by tokenizing each element separately. I'll do that and run the model through some tests. That being said, there may be some other tokenization issues in the code that are being surfaced by this.

### Name and Version

This is a custom app using tag `b3040` of llama.cpp. https://github.com/ggerganov/llama.cpp/releases/tag/b3040

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: SPM tokenization breaks in at least one specific case. #7629

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: SPM tokenization breaks in at least one specific case. #7629

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions