Name and Version
611aa91
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA
Models
Qwen3
Problem description & steps to reproduce
I don't know if it's too soon but I'm opening this to keep track of the issue.
The original qwen3 template is not supported but the bug can be tested by using a modified template
The qwen3 template contains the following check (stripped down to the relevant section):
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
KEEP REASONING TOKENS
Meaning that in the common case the tokens are kept when the last role is assistant and the tokens are discarded when the last role is user.
The problem is that at the start of the turn, the following pseudo-code is executed:
- messages.append(user_message)
- fmt_past_msg = apply_chat_template(messages)
- messages.append(assistant_messages)
- fmt_new_msg = apply_chat_template(messages)
- diff = fmt_new_msg - fmt_past_msg
The diff is not computed correctly since the assistant message used in v1 has the thinking tokens preserved and the assistant message in v2 has the thinking tokens removed.
Relevant section of the code:
|
ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size()); |
First Bad Commit
No response
Relevant log output
std::string common_chat_format_single(...) {
[...]
fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt;
[...]
inputs.messages.push_back(new_msg);
[...]
auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt;
// get the diff part
ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());
Name and Version
611aa91
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA
Models
Qwen3
Problem description & steps to reproduce
I don't know if it's too soon but I'm opening this to keep track of the issue.
The original qwen3 template is not supported but the bug can be tested by using a modified template
The qwen3 template contains the following check (stripped down to the relevant section):
Meaning that in the common case the tokens are kept when the last role is
assistantand the tokens are discarded when the last role isuser.The problem is that at the start of the turn, the following pseudo-code is executed:
The
diffis not computed correctly since the assistant message used in v1 has the thinking tokens preserved and the assistant message in v2 has the thinking tokens removed.Relevant section of the code:
llama.cpp/common/chat.cpp
Line 320 in 611aa91
First Bad Commit
No response
Relevant log output
std::string common_chat_format_single(...) { [...] fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt; [...] inputs.messages.push_back(new_msg); [...] auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt; // get the diff part ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());