[serve] Update tool call to switch to parse_response#45485
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: cli |
|
This comment contains models: ["cli"] |
LysandreJik
left a comment
There was a problem hiding this comment.
Nice, looks good! Let's indeed trying with some tool-enable frontends to see how it performs
| # 5. Tool calls are parsed after generation completes (not during streaming), | ||
| # because the full token sequence is needed for reliable parsing. |
There was a problem hiding this comment.
I'm not sure if this is a satisfactory workaround: it does mean that we'll need to wait for the generation to complete before handling tool calls, right? Let's see how others do it in the ecosystem, but from memory they do on-the-fly tool-call-parsing
There was a problem hiding this comment.
can be revisited after the PR maybe
There was a problem hiding this comment.
I think that if there is one tool call, it shouldn't matter much. Note that usually the generation ends up just after the tool_call request, so we have the wait until the end in any case. But if we have multiple tool calls, it might be nice to have the on-the-fly tool-call-parsing. However, not sure if we will see much speed gain. In any case, this should be quite easy to fix but I'll check if it is worth adding the extra logic when testing with real use case.
There was a problem hiding this comment.
This is on me - the schema parser assumes a complete message. If incremental processing is a hard requirement, I might need to go back and revisit the spec, and how exactly we handle parsing.
There was a problem hiding this comment.
Parsing is significantly harder than just chat formatting because of things like this - it might be a sign that we need arbitrary code to do it correctly, which is what other frameworks have, and we can't just have user-authorable schemas embedded in config files. If so, we'd need to maintain a list of message parser functions either in Transformers or some other HF repo!
Co-authored-by: Lysandre Debut <hi@lysand.re>
|
run-slow: cli |
|
This comment contains models: ["cli"] |
LysandreJik
left a comment
There was a problem hiding this comment.
Alright, let's do it! Thanks @SunMarc
…5485) * fix tool call to support parse response * simplify parser * simplification ! * Update src/transformers/cli/serving/response.py Co-authored-by: Lysandre Debut <hi@lysand.re> --------- Co-authored-by: Lysandre Debut <hi@lysand.re>
…5485) * fix tool call to support parse response * simplify parser * simplification ! * Update src/transformers/cli/serving/response.py Co-authored-by: Lysandre Debut <hi@lysand.re> --------- Co-authored-by: Lysandre Debut <hi@lysand.re>
What does this PR do?
This PR update the tool calling support in serve to switch to
parse_response. I've updated qwen support to use the same template asparse_responseso that we only keep one implementation. With this gemma4 tool calling is supported. I've made also other changes:apply_chat_template. This was causing issues with gemma4 as the content as put after the tool call and the tool response. But I think in general, it should be fine to not show the content. We can revisit this if needed as I'm not sure that this. Like if there is a thinking phase, I feel like this could make sense to have it in the context.