test(parsers): Top-N models to have extra CASE.6+ coverage (case3)#9035
Conversation
WalkthroughAdds comprehensive test coverage for tool-calling functionality across multiple LLM parser implementations (glm47, minimax_m2, qwen3_coder, nemotron_deci, harmony). Includes integration-level test matrices validating Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/llm/tests/tool_choice.rs (1)
1173-1241: 💤 Low valueLGTM — harmony CASE.11 block.
Minor note: the five
TODO(CASE.11)comments across this file (lines 902, 973, 1047, 1120, 1193) reference "cross-parser tool_choice parametrisation work-item (tracked separately)" but don't include a trackable GitHub issue number. Linking them to a concrete issue (e.g.#NNNN) would prevent these from becoming orphaned once the current PR context is gone. Not blocking, but worth addressing before the TODO count accumulates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/tests/tool_choice.rs` around lines 1173 - 1241, Update the TODO(CASE.11) comment annotations so they include a concrete, trackable GitHub issue number (e.g. change TODO(CASE.11) to TODO(CASE.11, `#1234`)) to avoid orphaned TODOs; specifically edit the TODO comments near the harmony test block (the comment immediately above test_harmony_tool_choice_required_pins_current_behavior and the four other TODO(CASE.11) instances referenced in this file) so each TODO contains the same issue reference number and a brief one-line linkable note (no behavioral code changes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/llm/tests/tool_choice.rs`:
- Around line 1173-1241: Update the TODO(CASE.11) comment annotations so they
include a concrete, trackable GitHub issue number (e.g. change TODO(CASE.11) to
TODO(CASE.11, `#1234`)) to avoid orphaned TODOs; specifically edit the TODO
comments near the harmony test block (the comment immediately above
test_harmony_tool_choice_required_pins_current_behavior and the four other
TODO(CASE.11) instances referenced in this file) so each TODO contains the same
issue reference number and a brief one-line linkable note (no behavioral code
changes).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d53a68d6-189b-4ee8-aabf-885d143293fc
📒 Files selected for processing (5)
lib/llm/tests/tool_choice.rslib/parsers/src/tool_calling/harmony/harmony_parser.rslib/parsers/src/tool_calling/json/mod.rslib/parsers/src/tool_calling/xml/glm47_parser.rslib/parsers/src/tool_calling/xml/parser.rs
|
/ok to test a8999a7 |
a8999a7 to
885c4b1
Compare
885c4b1 to
6e23ce1
Compare
6e23ce1 to
b53e8c8
Compare
Adds 39 tests across 5 files filling CASE.6/11/12/14/15 gaps for glm47, minimax_m2, qwen3_coder, nemotron_deci, and harmony — mirrors the kimi_k2 + dsv4 pattern from #8946. Test-only, no parser code change. Per-parser surface tests (CASE.6 empty-args, CASE.12 finish_reason byte-stability, CASE.14 empty/whitespace inputs, CASE.15 duplicate calls same name) added to: - lib/parsers/src/tool_calling/xml/glm47_parser.rs - lib/parsers/src/tool_calling/xml/parser.rs (qwen3_coder + minimax_m2) - lib/parsers/src/tool_calling/json/mod.rs (nemotron_deci) - lib/parsers/src/tool_calling/harmony/harmony_parser.rs CASE.11 tool_choice integration tests (auto / required-pinning / named-correct / named-wrong, 4 per parser × 5 parsers = 20) added to: - lib/llm/tests/tool_choice.rs Pinned harmony's whitespace handling explicitly: unlike the XML/JSON parsers (which trim to `Some("")`), harmony passes empty/whitespace input verbatim through normal_text. cargo test -p dynamo-parsers --lib: 498 passed (was 479). cargo test -p dynamo-llm --test tool_choice: 36 passed (was 16). cargo clippy: clean. Refs: DIS-1842 Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
b53e8c8 to
14816a4
Compare
| #[tokio::test] // CASE.15 — gpt-oss | ||
| async fn test_parse_harmony_duplicate_calls_same_name() { | ||
| let text = r#"<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"city":"NYC"}<|call|><|start|>assistant<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"city":"LA"}<|call|>"#; | ||
| let (tool_calls, _) = parse_tool_calls_harmony_complete(text, &Default::default(), None) |
There was a problem hiding this comment.
test_parse_harmony_duplicate_calls_same_name uses the default Harmony config, so the known multi-commentary-block path will not take the regex recovery fallback and the test will fail instead of returning two calls. Fix: pass JsonParserConfig { allow_eof_recovery: true, ..Default::default() } as in the existing multiple-call recovery test.
There was a problem hiding this comment.
oops, didn't mean to merge before address thing. I'll address in in a follow-up PR.
…9035) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
… family Pythonic was the only top-N family untouched by the recent coverage PRs (#8888, #8946, #8846, #9035, #8852). Add three small batch-mode tests that mirror the top-N quartet pattern landed for harmony/glm47/qwen3/etc. in #9035, plus the parameterless-call shape from vLLM's pythonic test file. - `test_parse_tool_call_parse_pythonic_empty_args` (PARSER.batch.6) — `[get_weather()]` returns one call with `arguments={}`. Mirrors vLLM `test_pythonic_tool_parser.py::test_tool_call[parameterless_*]`. - `test_parse_pythonic_empty_and_whitespace_inputs` (PARSER.batch.9) — empty / whitespace-only inputs return 0 calls and empty content without panicking. Mirrors the #9035 quartet contract. - `test_parse_pythonic_duplicate_calls_same_name` (PARSER.batch.10) — two `get_weather` calls in one list surface with distinct ids (`call-1`, `call-2`). Pins the auto-generated-id contract for the duplicate-name case. `cargo test -p dynamo-parsers --lib tool_calling::pythonic` — 19/19 pass. Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
…audit Output of DIS-1926 (research vLLM parser test coverage gaps). Doc-only change to `lib/parsers/PARSER_CASES.md`; no source touched. `cargo check -p dynamo-parsers --tests` passes. Refinements driven by gaps surfaced during a bidirectional diff against vLLM `tests/tool_parsers/*` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b: - Split PARSER.fmt.1 (function-name surface) from new PARSER.fmt.5 (argument-envelope shape: native call-ID preservation, JSON field-order tolerance, arguments↔parameters key alias). The old CASE.21 (and an earlier draft of PARSER.fmt.1) conflated both axes. - Broaden PARSER.fmt.3 examples beyond Kimi K2's singular vs plural section tokens to include Mistral pre-v11 vs v11+ wire formats, Llama 3 with vs without `<|python_tag|>`, Hermes `qwen25` registry alias. - Add `Known production gaps` section flagging Mistral v11+ wire format (`[TOOL_CALLS]name{...args}` name-then-object) — Dynamo's `ToolCallConfig::mistral()` only handles pre-v11 (JSON-array body), while vLLM tests v11 extensively. v11 is the current Mistral-Small / Mistral-Large production path. Largest single Dynamo parser gap surfaced by the audit. - Promote regex-timeout / parser-exception containment to Universal Gaps (vLLM has explicit `test_regex_timeout_handling` for llama3_json / llama4_pythonic / pythonic and `*_streaming_exception_returns_none` for Mistral; Dynamo relies on Rust regex linear-time guarantees but does not pin failure-containment paths). - Cross-ref PARSER.batch.1 happy-path → PARSER.fmt.5 native-ID sub-axis. - Update Applicability summary and `Adding a new parser` minimum viable set to cover fmt.{1..5}. The full per-test bidirectional audit (493 test rows across 36 parser families, mapped onto the new taxonomy) lives outside this commit. It informed every refinement above; the audit itself is not committed because it's a working artifact rather than a stable reference doc. Top-3 P0 gap status from the audit: 1. Mistral v11 wire format — STILL OPEN (parser doesn't exist; flagged in the new `Known production gaps` section). 2. PARSER.stream.{1..4} parser-tier — partial; DSv4 (#8946) and Gemma 4 (#8852) added coverage; Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral parser-tier streaming tests still gap. 3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families via 28 new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035). Coverage PRs since 2026-05-05 baseline: #8888 (silent-drop recoveries), #8946 (DSv4 + Kimi K2 coverage), #9035 (top-N CASE.6+ quartet), #8852 (Gemma 4 family), #9127 (taxonomy rename). Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Companion artifact to PR #9290 (PARSER_CASES.md taxonomy refinement). Adds the full per-test bidirectional audit that informed every change in that PR — every vLLM tool-parser test mapped onto the new (PR #9127) taxonomy with a clickable source link. `lib/parsers/VLLM_TEST_AUDIT.md` (new file, 906 lines, 493 distinct test rows): - **Source**: vLLM `main` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b (`vllm/tool_parsers/*`, `tests/tool_parsers/*`, `tests/tool_use/*`, `tests/entrypoints/openai/tool_parsers/*`). - **Scope**: 421 explicit test functions + 72 inherited common-suite rows from `ToolParserTests`. - **Bucketing**: every row carries one or more `PARSER_CASES.md` / `REASONING_CASES.md` / `PIPELINE_CASES.md` / `FRONTEND_CASES.md` tags, plus a one-line behavioral note. Re-bucketing transformations applied (vs the original CASE.* labels the audit was first written against, before PR #9127): - 244 streaming rows split per-row into PARSER.stream.{1,2,3,4} (single-call assembly / multi-call assembly / partial-token chunking / streaming termination) - 26 fmt rows split per-row into PARSER.fmt.1 (function-name) vs PARSER.fmt.5 (argument-shape: native ID, JSON field-order, arguments↔parameters alias) - Out-of-PARSER-scope buckets relocated to sibling docs: CASE.{11,18,25} → FRONTEND.{1,3,5,6}; CASE.12 → PIPELINE.finish_reason; CASE.{9,10,17} → REASONING.batch.{1,2}; CASE.20 → `// helper`; CASE.16 → inline-regression annotation; CASE.26 dissolved into PARSER.batch.4 impl-defined recovery contract Two mis-bucketings caught and fixed during review: - FunctionGemma::test_multiple_tool_calls and Gemma4::TestExtractToolCalls.test_multiple_tool_calls were both labeled CASE.1 but assert len(tool_calls) == 2 — corrected to PARSER.batch.2. Four bucket-assignment refinements caught by review: - test_unique_tool_call_ids (DSv3.2) drops fmt.5 (no native call-ID surface; just parallel-call distinctness). - test_invalid_funcall_id_skipped (Kimi K2) moves fmt.5 → fmt.1 (validation, not preservation). - 3 Mistral `argument_before_name*` parametrized rows gain fmt.5 (canonical field-order swap test set referenced by PARSER_CASES.md). A staleness banner at the top documents the re-bucketing transformation and mis-bucket fixes for traceability. Top findings the audit informed (already addressed in PR #9290 or flagged for follow-up): 1. Mistral v11+ wire format — STILL OPEN (parser doesn't exist; flagged in PARSER_CASES.md "Known production gaps"). 2. PARSER.stream.{1..4} parser-tier coverage gap in 5 families (Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral) — partial closure via DSv4 (#8946) and Gemma 4 (#8852). 3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families via 28 new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035). Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Overview:
Part 3 of the DIS-1842 parser-coverage work, continuing #8888 (CASE.1-5 silent-drop recovery) into the next set of cases. While looking at the coverage chart, I saw GLM 5.1, MiniMax 2.7, Qwen 3.5, Nemotron, and gpt-oss had a bunch of
nccells in CASE.6/11/12/14/15 — basic per-parser coverage gaps the chart was tracking but nothing had pinned yet.Note
Test-only — NO parser logic change. All 39 new tests assert the existing parser behavior at the per-parser surface; no
.rsfile underlib/parsers/src/tool_calling/*/had its non-test code touched.Coverage chart — before → after
Legend (3 states + qualifiers):
_silent_drop/_loses_Xstyle withTODO(CASE.N) — BUG, NEEDS FIX:).Cells with
→flipped in this PR.Kimi K2.6 and DSv4 columns shown for context — covered by #8946 (and earlier PRs); this PR doesn't change them.
23 cells flipped in CASE.6/11/12/14/15. Remaining gaps (CASE.8 streaming, CASE.9 reasoning + tool) are out of scope for this PR — separate work-items.
Details:
I was able to repro the gap by running
cargo test -p dynamo-parsers --libagainst main and listing the parsers with no dedicated CASE.6/11/12/14/15 tests — five came up short: glm47, minimax_m2, qwen3_coder, nemotron_deci, harmony. So I decided to mirror the kimi_k2 + dsv4 pattern from #8946 across them, and realized the parsers actually handle these mostly fine — every assertion lined up cleanly except harmony's whitespace handling, which passes input verbatim throughnormal_textinstead of trimming like the XML/JSON parsers do, so I pinned that distinction explicitly. The fix was test-only — no parser code change. I then re-ran and everything came up green. I think this should cover it well. I did run the full suites: 498 parser-lib tests pass (was 479), 36 tool_choice tests pass (was 16),cargo clippyclean.Where should the reviewer start?
lib/parsers/src/tool_calling/xml/glm47_parser.rs(cleanest example of the per-parser surface pattern), thenlib/llm/tests/tool_choice.rsfor the CASE.11 integration pattern.Related Issues:
DIS-1842
/coderabbit profile chill
Summary by CodeRabbit