[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility by lucianommartins · Pull Request #45257 · huggingface/transformers

lucianommartins · 2026-04-05T22:07:53Z

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Rewrites the _patch_template_for_openai_tool_role() function in convert_gemma4_weights.py to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).

Chat template patcher

Forward-scan tool rendering: role: "tool" messages are skipped in the outer loop and rendered proactively as <|tool_response> blocks from the preceding assistant turn that issued the tool_calls
Turn suppression: Suppresses duplicate <|turn>model when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)
tool_call_id resolution: Matches tool results back to the originating tool_calls array by ID to resolve function names correctly (prevents response:unknown)
Content-parts robustness: Handles tool response content as both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])
format_tool_response_block macro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native tool_responses and OpenAI-style role: "tool" paths)
reasoning/reasoning_content support: Renders thinking fields as <|channel>thought blocks (compatible with vLLM, DeepSeek, and o1-style inference servers)
Legacy compat: Preserves native tool_responses on assistant messages (Google/Gemma format)

Stop tokens (`eos_token_id`)

Removed <tool_call|> (etc_token) from the stop token list
Keeps only <eos> + <turn|> (eot_token)
Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly

Testing

Validated with 17 functional test scenarios across both E4B and 31B templates:

Simple chat, tool declarations, single/multi/parallel tool calls
Multi-round tool loops (exactly 1 <|turn>model emitted)
Legacy tool_responses, tool_call_id resolution, content-parts arrays
reasoning/reasoning_content field rendering
add_generation_prompt correctness, Jinja2 syntax validation

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Models:

multimodal models: @zucchini-nlp

Library:

generate: @zucchini-nlp (visual-language models)

…atibility - Chat Template: Added handler for OpenAI-standard 'role: "tool"' messages to render inline as <|tool_response> without initiating a new <|turn> block. - Chat Template: Extended turn-close condition to inhibit <turn|> emission when model has pending 'tool_calls' without corresponding responses, preserving the continuous turn structure. - Generation Config: Updated 'eos_token_id' derivation in convert_gemma4_weights.py to prioritize the terminal '<tool_call|>' token over the starting '<|tool_response>' token, resolving post-call generation hallucinations in HuggingFace inference. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

Chat template patcher (_patch_template_for_openai_tool_role): - Inject format_tool_response_block macro after strip_thinking to DRY up tool-response rendering (used by both legacy and OpenAI paths) - Replace the entire message loop instead of two point patches: * Skip role:'tool' messages in outer loop; render them proactively via forward-scan from the preceding assistant message * Suppress duplicate <|turn>model on consecutive assistant messages separated only by tool messages (multi-round tool-call loops) * Resolve tool_call_id back to function name from originating tool_calls array (prevents response:unknown fallback) * Handle tool response content as both plain strings and OpenAI content-parts arrays ([{type:'text', text:'...'}]) * Render reasoning/reasoning_content fields as <|channel>thought blocks (supports both vLLM and older inference server variants) - Preserve legacy tool_responses on assistant messages (Gemma native) - Pre-scan loop_messages for last_user_idx to guard reasoning injection Stop tokens (eos_token_id): - Remove <tool_call|> (etc_token) from the stop token list - Keeps only <eos> + <turn|> (eot_token) - Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

github-actions · 2026-04-06T15:31:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

zucchini-nlp · 2026-04-07T12:30:53Z

cc @Rocketknight1 for chat templates and tool calling, but this seems to be only the conversion. Prob the latest changes you made before release, didn't make it to conversion script 😅

Rocketknight1 · 2026-04-09T13:25:45Z

The template updates seem good, but I think it'd make more sense and be easier to review if we just embedded the entire file in this script, rather than all the replacements, since I think the size of the replacements approaches the size of the original file! That would mean we're not dependent on the files in the model repos too.

Rocketknight1 · 2026-04-09T13:29:37Z

Also linking an additional bug report in the template: #45331

RyanMullins · 2026-04-10T12:17:00Z

@lucianommartins and I pushed chat template updates directly to the Hub yesterday, so I think these changes are no longer needed and we can probably close this.

lucianommartins added 2 commits April 5, 2026 22:01

This was referenced Apr 8, 2026

[Tool] adjust_request to reasoning parser, and Gemma4 fixes vllm-project/vllm#39027

Merged

common : add gemma 4 specialized parser ggml-org/llama.cpp#21418

Merged

wbste mentioned this pull request Apr 8, 2026

Eval bug: Infinite repetition loop in llama-server with peg-gemma4 parser during tool calls ggml-org/llama.cpp#21375

Open

Rocketknight1 mentioned this pull request Apr 9, 2026

[Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio #45331

Open

Rocketknight1 closed this Apr 10, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257
lucianommartins wants to merge 2 commits intohuggingface:mainfrom
lucianommartins:lucianommartins/gemma4

lucianommartins commented Apr 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

zucchini-nlp commented Apr 7, 2026

Uh oh!

Rocketknight1 commented Apr 9, 2026

Uh oh!

Rocketknight1 commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lucianommartins commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Chat template patcher

Stop tokens (eos_token_id)

Testing

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

zucchini-nlp commented Apr 7, 2026

Uh oh!

Rocketknight1 commented Apr 9, 2026

Uh oh!

Rocketknight1 commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucianommartins commented Apr 5, 2026 •

edited

Loading

Stop tokens (`eos_token_id`)