[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257
[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257lucianommartins wants to merge 2 commits intohuggingface:mainfrom
Conversation
…atibility - Chat Template: Added handler for OpenAI-standard 'role: "tool"' messages to render inline as <|tool_response> without initiating a new <|turn> block. - Chat Template: Extended turn-close condition to inhibit <turn|> emission when model has pending 'tool_calls' without corresponding responses, preserving the continuous turn structure. - Generation Config: Updated 'eos_token_id' derivation in convert_gemma4_weights.py to prioritize the terminal '<tool_call|>' token over the starting '<|tool_response>' token, resolving post-call generation hallucinations in HuggingFace inference. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Chat template patcher (_patch_template_for_openai_tool_role):
- Inject format_tool_response_block macro after strip_thinking to DRY
up tool-response rendering (used by both legacy and OpenAI paths)
- Replace the entire message loop instead of two point patches:
* Skip role:'tool' messages in outer loop; render them proactively
via forward-scan from the preceding assistant message
* Suppress duplicate <|turn>model on consecutive assistant messages
separated only by tool messages (multi-round tool-call loops)
* Resolve tool_call_id back to function name from originating
tool_calls array (prevents response:unknown fallback)
* Handle tool response content as both plain strings and OpenAI
content-parts arrays ([{type:'text', text:'...'}])
* Render reasoning/reasoning_content fields as <|channel>thought
blocks (supports both vLLM and older inference server variants)
- Preserve legacy tool_responses on assistant messages (Gemma native)
- Pre-scan loop_messages for last_user_idx to guard reasoning injection
Stop tokens (eos_token_id):
- Remove <tool_call|> (etc_token) from the stop token list
- Keeps only <eos> + <turn|> (eot_token)
- Enables parallel tool calls without premature truncation after the
first <tool_call|>; <turn|> still terminates the model turn correctly
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gemma4 |
|
cc @Rocketknight1 for chat templates and tool calling, but this seems to be only the conversion. Prob the latest changes you made before release, didn't make it to conversion script 😅 |
|
The template updates seem good, but I think it'd make more sense and be easier to review if we just embedded the entire file in this script, rather than all the replacements, since I think the size of the replacements approaches the size of the original file! That would mean we're not dependent on the files in the model repos too. |
|
Also linking an additional bug report in the template: #45331 |
|
@lucianommartins and I pushed chat template updates directly to the Hub yesterday, so I think these changes are no longer needed and we can probably close this. |
[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility
What does this PR do?
Rewrites the
_patch_template_for_openai_tool_role()function inconvert_gemma4_weights.pyto fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).Chat template patcher
role: "tool"messages are skipped in the outer loop and rendered proactively as<|tool_response>blocks from the preceding assistant turn that issued thetool_calls<|turn>modelwhen consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)tool_call_idresolution: Matches tool results back to the originatingtool_callsarray by ID to resolve function names correctly (preventsresponse:unknown)contentas both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])format_tool_response_blockmacro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma nativetool_responsesand OpenAI-stylerole: "tool"paths)reasoning/reasoning_contentsupport: Renders thinking fields as<|channel>thoughtblocks (compatible with vLLM, DeepSeek, and o1-style inference servers)tool_responseson assistant messages (Google/Gemma format)Stop tokens (
eos_token_id)<tool_call|>(etc_token) from the stop token list<eos>+<turn|>(eot_token)<tool_call|>;<turn|>still terminates the model turn correctlyTesting
Validated with 17 functional test scenarios across both E4B and 31B templates:
<|turn>modelemitted)tool_responses,tool_call_idresolution, content-parts arraysreasoning/reasoning_contentfield renderingadd_generation_promptcorrectness, Jinja2 syntax validationBefore submitting
Who can review?
Models:
Library: