common/gemma4 : handle parsing edge cases#21760
Conversation
|
Made a discussion but you already working on a fix. Thanks! This is what I'm hitting quite often: Is it the same issue? |
|
@En3Tho this one might be related to the prompt issue, but I'll add it just in case. |
|
@Dampfinchen it should for the |
It was the second update I believe. Recently the third update released with the latest chat template. Since its an older quant, I have downloaded the new chat template from Google and injected it by using --chat-template-file. I am also running an up to date llama.cpp build that features the fixes from PR #21704 The exact lcpp command I have used:
Please note I did not download and test this PR yet, I simply thought it was a good idea to share it in the case that this would be one of these edge cases that can be improved. In Hermes Agent, it seems to work fine usually but these issues appeared after I have worked with the agent for a little while. At one point (it was around 64K of context) I noticed it was stuck, endlessly generating so I have stopped the Agent which lead to the print of the strange generation that has you worried. |
Yeah, I notice the same thing on occasion at high context. Have you tried another agent? Some of them don't return back the reasoning traces properly, and it's difficult to test everything out there. I've been running this PR for a bit on multiple runs at high context. It feels like it handles all the parsing issues I've noticed. There is one particularly nasty failure condition of 26B A4B. It will sometimes try to back out of a tool call, reason again, and then generate a new tool call. I had Claude make this visualization: https://cdn.alde.dev/llama.cpp/examples/gemma4/tool-call-derailment.html This is an intractable problem if we stream tool calls. By the time it reconsiders, we have already sent the tool call deltas to the client. The only real way to address this is to block until completion of the generation and then attempt to parse tool call(s) from the end. |
Not yet, Hermes Agent is my only Agent. I, too, have been running this PR for a little while now using Hermes Agent and it appears stable. At one point Hermes printed "empty response after tool calls — using earlier content as final answer", but I didn't find any repeating or leaking like before, so this is a good improvement. |
|
@aldehir Working on this too in https://github.com/emansom/llama.cpp/tree/gemma4-fixes emansom@d12ebae Feel free to copy whatever makes sense to your code/branch, and/or tell me what to open as pull request. |
|
I think we really need a proper state machine implementation for the code to make sense. or clean up the current PEG parsing handling so it is more logically structured. State machine that follows the possible states (and sub states) defined in the official prompt format docs: |
|
@Dampfinchen I had this with 26B in OpenCode |
I'm open to recommendations, but a state machine is nothing more than a DFA which is representable by PEG. There's an argument to be made about parsing at a token level, which is not impossible, but requires some fundamental changes.
llama.cpp is more than just Gemma 4. Keep that in mind and keep the conversation productive. I appreciate the links to your changes. I think I handled most of them here, but we could use more testing around tool call parsing. |
A state machine doesn't help if the model output doesn't align with its own format. Unless you're suggesting masking logits in adherence to the state machine, which is a good idea but out of scope for this PR.
Their parsing file is 700+ lines of Python vs. 104 lines of C++ here. A majority of the tool call parsers in vLLM are recursive descent parsers stitched together with regex and edge cases to handle streaming. Not a criticism of their approach, it's a straightforward approach. The point is that Gemma 4 tool call parsing was captured here in relatively few lines of code, with streaming support included.
Could you clarify what you're seeing fail? These come in as a single token, so the entire piece is added to the prompt. There's no way to receive two halves in separate generations. That said, the PEG parser handles this well by design: if a literal is only partially matched, it waits for more input before appending to the AST. If you think a different implementation approach would be better for the project overall, I'd suggest opening a discussion to get feedback from the other maintainers. That'd be a better venue for that conversation than this PR. |
I'm not familiar enough with that part of the codebase to weigh in. Someone more familiar with the rotary embedding layers might be able to say whether that applies here. |
I was not suggesting masking logits in adherence to a Gemma 4 format state machine, however, now that you brought it up; that may be a interesting experiment indeed?
Lines of code is not a determining factor of quality. I wasn't aware their approach relied on regexes, that doesn't seem very robust neither. The format parser implementation from Google in LiteRT-LM is a real parser afaict. I am bringing it up to suggest comparing notes with that implementation, to ensure the PEG parser handles the Gemma 4 model the same way.
Is that being tested, in a similar way as in LiteRT-LM conversation tests?
I'm not entirely sure. Google intended the Gemma 4 model to be used with stateful parsers (see their LiteRT-LM implementation). If the PEG parser can handle all the niche cases in the same way, and produce the same state then that's a good solution too. |
|
The There's also a series of partial matching tests for the PEG parser in particular: llama.cpp/tests/peg-parser/test-basic.cpp Lines 214 to 366 in 873c825 Overall, I think we're in a good place with Gemma 4. There are some interesting edge cases to handle, but I believe PEG can address them quite succinctly especially with the That said, PEG has some downsides. Error reporting, in particular, can be a challenge. There is a trade-off. For context on why we moved to this approach: previously, the parsers looked very similar to vLLM, and grammars for grammar-constrained decoding were created separately, independent of the parsing. Many implementations were incomplete because each model had to support response format and tool calling. Now we define something that closely represents a grammar, and it builds both the parser and GBNF grammar. Model support comes much quicker (usually within a day, not a week), and fixes come in as we learn more about how the model behaves in practice versus what's written in documentation. There's definitely room for improvement, and I have several ideas in the back of my mind. Like I said, it's not perfect but quite good enough (in my opinion). |
Stumbled on something in the LiteRT-LM source that may be of interest: #21836 |
|
@pwilkin don't think I have anything more to add at the moment. Note the new |
|
I've been running this since yesterday after seeing issues mentioned at the top, like I'm running it locally, but I'd suggest a merge ASAP. 👍 |
pwilkin
left a comment
There was a problem hiding this comment.
Let's keep it as it is and if we find more testable edge cases, we can do a follow-up PR.
|
@ggml-org/llama-common |
ngxson
left a comment
There was a problem hiding this comment.
I haven't reviewed this, but just giving the approval
common/gemma4 : handle parsing edge cases (ggml-org#21760)
This reverts commit e21cdc1.
|
Is there a possible fix to endless thinking? It happens even without tool calls. While this fixes the parsing, the endless reasoning loops still happen quite often with Gemma 4 26B A4B |
Test from this branch, let me know if that fixes it for you. https://github.com/emansom/llama.cpp/tree/gemma4-fixes |
|
This breaks parsing for chat completion after thinking. When streaming, it outputs the reasoning and the streaming stops, but the server keeps generating tokens that are not streamed. When not streaming, the reasoning text is outputted as the actual response, but the logs say it generated more tokens. In both cases, I get an error 500 when I ask it to continue (by prefilling the reasoning): I did substitute out the IP and the actual response text, but the logs showed an actual response was generated, just never sent to the frontend. |
|
How did you prefill? It is not supported for reasoning models, and thus not a legitimate form of reproduction unless done correctly. Judging by the It does seem like llama.cpp was able to parse the reasoning but then failed to parse the content. However, the failure doesn't propagate as an error and therefore you see generation occurring but no new content. You would need to set Which model did you use, 26B A4B? |
I use 31B. I forgot reasoning and prefill weren't supposed to be used together, since I fixed that the first day. I'm using ST, and I need chat completion in order to use vision (it doesn't send images in text completion), but in order to use the "continue generation" function (which I use often), prefill is needed. So I added to the additional body parameters to "disable" reasoning, then I prefill it with |
I'll take a look, thanks. It should work, since we recently started parsing from the start of the generation prompt vs. the start of the generation itself. I have an idea that could help, it's just hard to reproduce these errors to verify it works. |
|
I am quite confused with your prompt, are you using a custom template or are you intentionally injecting Gemma 4 tokens into your conversation? e.g., |
|
I'm using SillyTavern's chat completion, which uses a call to the openAI-compatible endpoint in /v1/chat/completions. What I'm sending is a json which includes a list of system, user, and assistant messages. The only Gemma4 I'm intentionally adding is <|channel>, everything else you see is added by the llama.cpp when transforming that call into a text completion prompt for the AI |
|
I created a fix for that problem #22325. I've tested it and it works both in my case, using more extensive prefills, and generating normally with no prefill. |



Overview
Fix a few edge cases for Gemma 4 26B A4B. I don't see these artifacts from the 31B variant.
Additional information
Issue 1
If the model generates content + tool call, the template will incorrectly format the prompt without the generation prompt (
<|turn>model\n):Causing 26B to produce a broken thinking sequence:
Instead of
This is fixed by adding the generation prompt if not present and the prompt ends with
<turn|>\n.Issue 2
Occasionally 26B will emit a trailing
<channel|>, particularly when it does not reason but produces a content message before a tool call:Fixed by scanning until
<channel|>, then consume until<|tool_call>or end.Issue 3
At the start of the generation, 26B may emit multiple
<|channel>tokens.Unsure if this is related to the bad prompt above, but it's easy enough to handle by consuming all
<|channel>tokens that do not precedethought.Requirements