Skip to content

test(parsers): Top-N models to have extra CASE.6+ coverage (case3)#9035

Merged
ayushag-nv merged 1 commit into
mainfrom
keivenchang/DIS-1842__glm-minimax-nemotron-coverage
May 4, 2026
Merged

test(parsers): Top-N models to have extra CASE.6+ coverage (case3)#9035
ayushag-nv merged 1 commit into
mainfrom
keivenchang/DIS-1842__glm-minimax-nemotron-coverage

Conversation

@keivenchang
Copy link
Copy Markdown
Contributor

@keivenchang keivenchang commented May 1, 2026

Overview:

Part 3 of the DIS-1842 parser-coverage work, continuing #8888 (CASE.1-5 silent-drop recovery) into the next set of cases. While looking at the coverage chart, I saw GLM 5.1, MiniMax 2.7, Qwen 3.5, Nemotron, and gpt-oss had a bunch of nc cells in CASE.6/11/12/14/15 — basic per-parser coverage gaps the chart was tracking but nothing had pinned yet.

Note

Test-only — NO parser logic change. All 39 new tests assert the existing parser behavior at the per-parser surface; no .rs file under lib/parsers/src/tool_calling/*/ had its non-test code touched.

Coverage chart — before → after

Legend (3 states + qualifiers):

  • — covered + passing (test exists and asserts correct parser output).
  • x — covered + failing (test exists but pins broken behavior — _silent_drop / _loses_X style with TODO(CASE.N) — BUG, NEEDS FIX:).
  • nc — no coverage.
  • ~ — partial / via shared parser only.
  • N/A — not applicable to this parser family.

Cells with flipped in this PR.

Case GLM 5.1 MiniMax 2.7 Qwen 3.5 Nemotron gpt-oss Kimi K2.6 DSv4
CASE.1 (single call)
CASE.2 (multiple calls)
CASE.3 (no tool call)
CASE.4 (malformed JSON args) x x
CASE.5 (missing end-token)
CASE.6 (empty args) nc → ✓ nc → ✓ ~ → ✓ nc → ✓
CASE.7 (complex arg types)
CASE.8 (streaming) ~ ~ ~ ~
CASE.9 (reasoning + tool) ~ ~ ~
CASE.10 (reasoning only)
CASE.11 (tool_choice) nc → ✓ nc → ✓ ~ → ✓ nc → ✓ nc → ✓
CASE.12 (finish_reason) nc → ✓ nc → ✓ nc → ✓ nc → ✓
CASE.13 (text interleaved)
CASE.14 (empty content) ~ → ✓ nc → ✓ nc → ✓ nc → ✓ nc → ✓
CASE.15 (duplicate calls) nc → ✓ nc → ✓ nc → ✓ nc → ✓ nc → ✓
CASE.xml1 (XML entities) N/A N/A N/A N/A
CASE.xml2 (schema-aware coercion) N/A N/A N/A N/A
CASE.harmony1 (channel parsing) N/A N/A N/A N/A N/A N/A

Kimi K2.6 and DSv4 columns shown for context — covered by #8946 (and earlier PRs); this PR doesn't change them.

23 cells flipped in CASE.6/11/12/14/15. Remaining gaps (CASE.8 streaming, CASE.9 reasoning + tool) are out of scope for this PR — separate work-items.

Details:

I was able to repro the gap by running cargo test -p dynamo-parsers --lib against main and listing the parsers with no dedicated CASE.6/11/12/14/15 tests — five came up short: glm47, minimax_m2, qwen3_coder, nemotron_deci, harmony. So I decided to mirror the kimi_k2 + dsv4 pattern from #8946 across them, and realized the parsers actually handle these mostly fine — every assertion lined up cleanly except harmony's whitespace handling, which passes input verbatim through normal_text instead of trimming like the XML/JSON parsers do, so I pinned that distinction explicitly. The fix was test-only — no parser code change. I then re-ran and everything came up green. I think this should cover it well. I did run the full suites: 498 parser-lib tests pass (was 479), 36 tool_choice tests pass (was 16), cargo clippy clean.

Where should the reviewer start?

lib/parsers/src/tool_calling/xml/glm47_parser.rs (cleanest example of the per-parser surface pattern), then lib/llm/tests/tool_choice.rs for the CASE.11 integration pattern.

Related Issues:

DIS-1842

/coderabbit profile chill


Open in Devin Review

Summary by CodeRabbit

  • Tests
    • Expanded test coverage for tool-calling parsers across multiple LLM formats, including additional scenarios for empty arguments, input edge cases, and duplicate function calls.

@keivenchang keivenchang requested a review from a team as a code owner May 1, 2026 22:17
@keivenchang keivenchang self-assigned this May 1, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Walkthrough

Adds comprehensive test coverage for tool-calling functionality across multiple LLM parser implementations (glm47, minimax_m2, qwen3_coder, nemotron_deci, harmony). Includes integration-level test matrices validating tool_choice modes (auto, required, named) and parser-specific unit tests covering edge cases such as empty arguments, whitespace inputs, duplicate function calls, and finish reason invariance.

Changes

Cohort / File(s) Summary
Integration-level tool_choice tests
lib/llm/tests/tool_choice.rs
Adds new test matrix covering auto, required, named with correct/incorrect tools for additional model formats, each with model-specific payload constants for tool-call envelope testing.
XML parser tests
lib/parsers/src/tool_calling/xml/glm47_parser.rs, lib/parsers/src/tool_calling/xml/parser.rs
Adds finish reason invariance assertions, empty/whitespace input handling, and duplicate function call extraction tests; introduces minimax_m2_config() builder for qwen3_coder and minimax_m2 model testing.
JSON parser tests
lib/parsers/src/tool_calling/json/mod.rs
Extends coverage with nemotron_deci-specific JsonParserConfig for <TOOLCALL> delimiters, validating empty object preservation, whitespace collapsing, and duplicate call extraction with distinct IDs.
Harmony parser tests
lib/parsers/src/tool_calling/harmony/harmony_parser.rs
Adds four new test cases verifying finish reason independence, no-arg call handling, empty input filtering, and duplicate commentary block parsing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title references 'Top-N models' and 'CASE.6+' but the PR actually adds coverage for CASE.6, CASE.11, CASE.12, CASE.14, and CASE.15 across five specific parsers, which is not clearly conveyed by the abbreviated title. Clarify the title to be more specific: consider 'test(parsers): Add corner-case coverage (CASE 6/11/12/14/15) for five parsers' or similar to accurately reflect the scope of cases and models covered.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively covers all required template sections with detailed context, coverage tracking, and clear guidance for reviewers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/llm/tests/tool_choice.rs (1)

1173-1241: 💤 Low value

LGTM — harmony CASE.11 block.

Minor note: the five TODO(CASE.11) comments across this file (lines 902, 973, 1047, 1120, 1193) reference "cross-parser tool_choice parametrisation work-item (tracked separately)" but don't include a trackable GitHub issue number. Linking them to a concrete issue (e.g. #NNNN) would prevent these from becoming orphaned once the current PR context is gone. Not blocking, but worth addressing before the TODO count accumulates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/tests/tool_choice.rs` around lines 1173 - 1241, Update the
TODO(CASE.11) comment annotations so they include a concrete, trackable GitHub
issue number (e.g. change TODO(CASE.11) to TODO(CASE.11, `#1234`)) to avoid
orphaned TODOs; specifically edit the TODO comments near the harmony test block
(the comment immediately above
test_harmony_tool_choice_required_pins_current_behavior and the four other
TODO(CASE.11) instances referenced in this file) so each TODO contains the same
issue reference number and a brief one-line linkable note (no behavioral code
changes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/llm/tests/tool_choice.rs`:
- Around line 1173-1241: Update the TODO(CASE.11) comment annotations so they
include a concrete, trackable GitHub issue number (e.g. change TODO(CASE.11) to
TODO(CASE.11, `#1234`)) to avoid orphaned TODOs; specifically edit the TODO
comments near the harmony test block (the comment immediately above
test_harmony_tool_choice_required_pins_current_behavior and the four other
TODO(CASE.11) instances referenced in this file) so each TODO contains the same
issue reference number and a brief one-line linkable note (no behavioral code
changes).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d53a68d6-189b-4ee8-aabf-885d143293fc

📥 Commits

Reviewing files that changed from the base of the PR and between eb91792 and a8999a7.

📒 Files selected for processing (5)
  • lib/llm/tests/tool_choice.rs
  • lib/parsers/src/tool_calling/harmony/harmony_parser.rs
  • lib/parsers/src/tool_calling/json/mod.rs
  • lib/parsers/src/tool_calling/xml/glm47_parser.rs
  • lib/parsers/src/tool_calling/xml/parser.rs

@keivenchang keivenchang changed the title test(parsers): top-7 corner-case coverage (CASE.6/11/12/14/15) test(parsers): GLM/MiniMax/Qwen/Nemotron/gpt-oss CASE.6+ coverage May 1, 2026
@keivenchang keivenchang changed the title test(parsers): GLM/MiniMax/Qwen/Nemotron/gpt-oss CASE.6+ coverage test(parsers): part 3 — GLM/MiniMax/Qwen/Nemotron/gpt-oss CASE.6+ May 1, 2026
@rmccorm4
Copy link
Copy Markdown
Contributor

rmccorm4 commented May 1, 2026

/ok to test a8999a7

@keivenchang keivenchang force-pushed the keivenchang/DIS-1842__glm-minimax-nemotron-coverage branch from a8999a7 to 885c4b1 Compare May 2, 2026 01:50
@keivenchang keivenchang force-pushed the keivenchang/DIS-1842__glm-minimax-nemotron-coverage branch from 885c4b1 to 6e23ce1 Compare May 2, 2026 15:19
@keivenchang keivenchang force-pushed the keivenchang/DIS-1842__glm-minimax-nemotron-coverage branch from 6e23ce1 to b53e8c8 Compare May 3, 2026 05:23
Adds 39 tests across 5 files filling CASE.6/11/12/14/15 gaps for glm47,
minimax_m2, qwen3_coder, nemotron_deci, and harmony — mirrors the kimi_k2
+ dsv4 pattern from #8946. Test-only, no parser code change.

Per-parser surface tests (CASE.6 empty-args, CASE.12 finish_reason
byte-stability, CASE.14 empty/whitespace inputs, CASE.15 duplicate calls
same name) added to:
- lib/parsers/src/tool_calling/xml/glm47_parser.rs
- lib/parsers/src/tool_calling/xml/parser.rs (qwen3_coder + minimax_m2)
- lib/parsers/src/tool_calling/json/mod.rs (nemotron_deci)
- lib/parsers/src/tool_calling/harmony/harmony_parser.rs

CASE.11 tool_choice integration tests (auto / required-pinning /
named-correct / named-wrong, 4 per parser × 5 parsers = 20) added to:
- lib/llm/tests/tool_choice.rs

Pinned harmony's whitespace handling explicitly: unlike the XML/JSON
parsers (which trim to `Some("")`), harmony passes empty/whitespace
input verbatim through normal_text.

cargo test -p dynamo-parsers --lib: 498 passed (was 479).
cargo test -p dynamo-llm --test tool_choice: 36 passed (was 16).
cargo clippy: clean.

Refs: DIS-1842
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang force-pushed the keivenchang/DIS-1842__glm-minimax-nemotron-coverage branch from b53e8c8 to 14816a4 Compare May 3, 2026 15:59
@keivenchang keivenchang changed the title test(parsers): part 3 — GLM/MiniMax/Qwen/Nemotron/gpt-oss CASE.6+ test(parsers): Top-N models to have extra CASE.6+ coverage (case3) May 4, 2026
Copy link
Copy Markdown
Contributor

@ayushag-nv ayushag-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@ayushag-nv ayushag-nv merged commit 8c0b50d into main May 4, 2026
222 of 226 checks passed
@ayushag-nv ayushag-nv deleted the keivenchang/DIS-1842__glm-minimax-nemotron-coverage branch May 4, 2026 16:52
#[tokio::test] // CASE.15 — gpt-oss
async fn test_parse_harmony_duplicate_calls_same_name() {
let text = r#"<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"city":"NYC"}<|call|><|start|>assistant<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"city":"LA"}<|call|>"#;
let (tool_calls, _) = parse_tool_calls_harmony_complete(text, &Default::default(), None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_parse_harmony_duplicate_calls_same_name uses the default Harmony config, so the known multi-commentary-block path will not take the regex recovery fallback and the test will fail instead of returning two calls. Fix: pass JsonParserConfig { allow_eof_recovery: true, ..Default::default() } as in the existing multiple-call recovery test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't mean to merge before address thing. I'll address in in a follow-up PR.

jthomson04 pushed a commit that referenced this pull request May 4, 2026
…9035)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
zhongdaor-nv added a commit that referenced this pull request May 8, 2026
… family

Pythonic was the only top-N family untouched by the recent coverage PRs
(#8888, #8946, #8846, #9035, #8852). Add three small batch-mode tests
that mirror the top-N quartet pattern landed for harmony/glm47/qwen3/etc.
in #9035, plus the parameterless-call shape from vLLM's pythonic test
file.

- `test_parse_tool_call_parse_pythonic_empty_args` (PARSER.batch.6) —
  `[get_weather()]` returns one call with `arguments={}`. Mirrors
  vLLM `test_pythonic_tool_parser.py::test_tool_call[parameterless_*]`.
- `test_parse_pythonic_empty_and_whitespace_inputs` (PARSER.batch.9) —
  empty / whitespace-only inputs return 0 calls and empty content
  without panicking. Mirrors the #9035 quartet contract.
- `test_parse_pythonic_duplicate_calls_same_name` (PARSER.batch.10) —
  two `get_weather` calls in one list surface with distinct ids
  (`call-1`, `call-2`). Pins the auto-generated-id contract for the
  duplicate-name case.

`cargo test -p dynamo-parsers --lib tool_calling::pythonic` — 19/19 pass.

Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
zhongdaor-nv added a commit that referenced this pull request May 8, 2026
…audit

Output of DIS-1926 (research vLLM parser test coverage gaps). Doc-only
change to `lib/parsers/PARSER_CASES.md`; no source touched. `cargo check
-p dynamo-parsers --tests` passes.

Refinements driven by gaps surfaced during a bidirectional diff against
vLLM `tests/tool_parsers/*` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b:

- Split PARSER.fmt.1 (function-name surface) from new PARSER.fmt.5
  (argument-envelope shape: native call-ID preservation, JSON field-order
  tolerance, arguments↔parameters key alias). The old CASE.21 (and an
  earlier draft of PARSER.fmt.1) conflated both axes.
- Broaden PARSER.fmt.3 examples beyond Kimi K2's singular vs plural
  section tokens to include Mistral pre-v11 vs v11+ wire formats, Llama 3
  with vs without `<|python_tag|>`, Hermes `qwen25` registry alias.
- Add `Known production gaps` section flagging Mistral v11+ wire format
  (`[TOOL_CALLS]name{...args}` name-then-object) — Dynamo's
  `ToolCallConfig::mistral()` only handles pre-v11 (JSON-array body),
  while vLLM tests v11 extensively. v11 is the current Mistral-Small /
  Mistral-Large production path. Largest single Dynamo parser gap
  surfaced by the audit.
- Promote regex-timeout / parser-exception containment to Universal Gaps
  (vLLM has explicit `test_regex_timeout_handling` for llama3_json /
  llama4_pythonic / pythonic and `*_streaming_exception_returns_none` for
  Mistral; Dynamo relies on Rust regex linear-time guarantees but does
  not pin failure-containment paths).
- Cross-ref PARSER.batch.1 happy-path → PARSER.fmt.5 native-ID sub-axis.
- Update Applicability summary and `Adding a new parser` minimum viable
  set to cover fmt.{1..5}.

The full per-test bidirectional audit (493 test rows across 36 parser
families, mapped onto the new taxonomy) lives outside this commit. It
informed every refinement above; the audit itself is not committed
because it's a working artifact rather than a stable reference doc.

Top-3 P0 gap status from the audit:

1. Mistral v11 wire format — STILL OPEN (parser doesn't exist; flagged in
   the new `Known production gaps` section).
2. PARSER.stream.{1..4} parser-tier — partial; DSv4 (#8946) and Gemma 4
   (#8852) added coverage; Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral
   parser-tier streaming tests still gap.
3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families via 28
   new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035).

Coverage PRs since 2026-05-05 baseline: #8888 (silent-drop recoveries),
#8946 (DSv4 + Kimi K2 coverage), #9035 (top-N CASE.6+ quartet), #8852
(Gemma 4 family), #9127 (taxonomy rename).

Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
keivenchang pushed a commit that referenced this pull request May 15, 2026
Companion artifact to PR #9290 (PARSER_CASES.md taxonomy refinement).
Adds the full per-test bidirectional audit that informed every change in
that PR — every vLLM tool-parser test mapped onto the new (PR #9127)
taxonomy with a clickable source link.

`lib/parsers/VLLM_TEST_AUDIT.md` (new file, 906 lines, 493 distinct
test rows):

- **Source**: vLLM `main` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b
  (`vllm/tool_parsers/*`, `tests/tool_parsers/*`, `tests/tool_use/*`,
  `tests/entrypoints/openai/tool_parsers/*`).
- **Scope**: 421 explicit test functions + 72 inherited common-suite
  rows from `ToolParserTests`.
- **Bucketing**: every row carries one or more `PARSER_CASES.md` /
  `REASONING_CASES.md` / `PIPELINE_CASES.md` / `FRONTEND_CASES.md`
  tags, plus a one-line behavioral note.

Re-bucketing transformations applied (vs the original CASE.* labels
the audit was first written against, before PR #9127):

- 244 streaming rows split per-row into PARSER.stream.{1,2,3,4}
  (single-call assembly / multi-call assembly / partial-token
  chunking / streaming termination)
- 26 fmt rows split per-row into PARSER.fmt.1 (function-name) vs
  PARSER.fmt.5 (argument-shape: native ID, JSON field-order,
  arguments↔parameters alias)
- Out-of-PARSER-scope buckets relocated to sibling docs:
  CASE.{11,18,25} → FRONTEND.{1,3,5,6}; CASE.12 →
  PIPELINE.finish_reason; CASE.{9,10,17} → REASONING.batch.{1,2};
  CASE.20 → `// helper`; CASE.16 → inline-regression annotation;
  CASE.26 dissolved into PARSER.batch.4 impl-defined recovery
  contract

Two mis-bucketings caught and fixed during review:
- FunctionGemma::test_multiple_tool_calls and
  Gemma4::TestExtractToolCalls.test_multiple_tool_calls were both
  labeled CASE.1 but assert len(tool_calls) == 2 — corrected to
  PARSER.batch.2.

Four bucket-assignment refinements caught by review:
- test_unique_tool_call_ids (DSv3.2) drops fmt.5 (no native call-ID
  surface; just parallel-call distinctness).
- test_invalid_funcall_id_skipped (Kimi K2) moves fmt.5 → fmt.1
  (validation, not preservation).
- 3 Mistral `argument_before_name*` parametrized rows gain fmt.5
  (canonical field-order swap test set referenced by PARSER_CASES.md).

A staleness banner at the top documents the re-bucketing transformation
and mis-bucket fixes for traceability.

Top findings the audit informed (already addressed in PR #9290 or
flagged for follow-up):

1. Mistral v11+ wire format — STILL OPEN (parser doesn't exist;
   flagged in PARSER_CASES.md "Known production gaps").
2. PARSER.stream.{1..4} parser-tier coverage gap in 5 families
   (Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral) — partial
   closure via DSv4 (#8946) and Gemma 4 (#8852).
3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families
   via 28 new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035).

Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants