When using the newly implemented Gemma 4 (peg-gemma4) chat format in llama-server, the model enters an infinite repetition loop during tool-calling. The server appears to re-parse the entire model response turn for every single token generated, and the model continuously repeats the same tool call without ever reaching an EOT (End of Turn) or EOS.
Environment
Build: llama.cpp (full-vulkan)
WARNING: radv is not a conformant Vulkan implementation, testing use only.
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 8643 (f49e917)
built with GNU 15.2.0 for Linux x86_64
Server: llama-server
Model: gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
Actual Behavior
The model generates a tool call, but instead of finishing the turn, it repeats the same call indefinitely. The logs show the PEG parser running a full re-parse for every token:
The server continues to send streamed chunks for the same tool call index (e.g., index: 83) repeatedly. The model never exits the <|tool_call|> block.
Expected Behavior
The model should generate the tool call once, close the tags correctly (e.g., <tool_call|>), and stop generating to wait for the tool output.
Operating systems
Linux
GGML backends
Vulkan
Hardware
E5-2640v4 on Supermicro X10DRi with ReBAR BIOS Patch applied (32G BAR available)
Radeon AI Pro R9700 32G
Models
gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
Problem description & steps to reproduce
Steps to Reproduce
Run llama-server with a Gemma 4 model using the default auto-detected peg-gemma4 format.
Provide a system prompt containing tool definitions (e.g., Home Assistant Assist tools).
Issue a prompt that triggers a tool call (e.g., "What is the outside temperature?").
Observe the server logs and the client response.
First Bad Commit
No response
Relevant log output
Plaintext
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
[llama-cpp-vulkan] <|channel>thought
[llama-cpp-vulkan] <channel|><|tool_call>call:assist__GetLiveContext{}...
[llama-cpp-vulkan] slot process_toke: id 0 | next token: 70940 'assist'
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
[llama-cpp-vulkan] <|channel>thought
[llama-cpp-vulkan] <channel|><|tool_call>call:assist__GetLiveContext{}...
[llama-cpp-vulkan] slot process_toke: id 0 | next token: 1269 '__'
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
...
When using the newly implemented Gemma 4 (peg-gemma4) chat format in llama-server, the model enters an infinite repetition loop during tool-calling. The server appears to re-parse the entire model response turn for every single token generated, and the model continuously repeats the same tool call without ever reaching an EOT (End of Turn) or EOS.
Environment
WARNING: radv is not a conformant Vulkan implementation, testing use only.
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 8643 (f49e917)
built with GNU 15.2.0 for Linux x86_64
Actual Behavior
The model generates a tool call, but instead of finishing the turn, it repeats the same call indefinitely. The logs show the PEG parser running a full re-parse for every token:
The server continues to send streamed chunks for the same tool call index (e.g., index: 83) repeatedly. The model never exits the <|tool_call|> block.
Expected Behavior
The model should generate the tool call once, close the tags correctly (e.g., <tool_call|>), and stop generating to wait for the tool output.
Operating systems
Linux
GGML backends
Vulkan
Hardware
E5-2640v4 on Supermicro X10DRi with ReBAR BIOS Patch applied (32G BAR available)
Radeon AI Pro R9700 32G
Models
gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
Problem description & steps to reproduce
Steps to Reproduce
First Bad Commit
No response
Relevant log output
Plaintext
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
[llama-cpp-vulkan] <|channel>thought
[llama-cpp-vulkan] <channel|><|tool_call>call:assist__GetLiveContext{}...
[llama-cpp-vulkan] slot process_toke: id 0 | next token: 70940 'assist'
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
[llama-cpp-vulkan] <|channel>thought
[llama-cpp-vulkan] <channel|><|tool_call>call:assist__GetLiveContext{}...
[llama-cpp-vulkan] slot process_toke: id 0 | next token: 1269 '__'
[llama-cpp-vulkan] Parsing PEG input with format peg-gemma4: <|turn>model
...