Skip to content

docs(parsers): add DIS-1926 vLLM tool-parser test audit#9329

Draft
zhongdaor-nv wants to merge 3 commits into
mainfrom
zhongdaor/dis-1926-vllm-test-audit-doc
Draft

docs(parsers): add DIS-1926 vLLM tool-parser test audit#9329
zhongdaor-nv wants to merge 3 commits into
mainfrom
zhongdaor/dis-1926-vllm-test-audit-doc

Conversation

@zhongdaor-nv
Copy link
Copy Markdown
Contributor

Overview

Companion artifact to #9290 (already merged — refined PARSER_CASES.md taxonomy from this audit). This PR adds the full per-test bidirectional audit that informed every change in #9290.

Marked draft because reviewers may want to argue per-row bucket assignments; the audit is a working reference doc rather than a stable contract.

What's in this PR

lib/parsers/VLLM_TEST_AUDIT.md (new file, 906 lines, 493 distinct test rows):

  • Source: vLLM main at b53c507bc91f87e28b03e9b54bbff7c76e97d58b (vllm/tool_parsers/*, tests/tool_parsers/*, tests/tool_use/*, tests/entrypoints/openai/tool_parsers/*).
  • Scope: 421 explicit test functions + 72 inherited common-suite rows from ToolParserTests.
  • Bucketing: every row carries one or more PARSER_CASES.md / REASONING_CASES.md / PIPELINE_CASES.md / FRONTEND_CASES.md tags, plus a clickable GitHub source link, plus a one-line behavioral note.

Re-bucketing summary (post PR #9127 taxonomy)

The audit was originally written against the old CASE.* labels in lib/parsers/TEST_CASES.md (deleted by #9127). Mechanical renames + per-row classification done:

  • 244 streaming rows split per-row into PARSER.stream.{1,2,3,4} (single-call assembly / multi-call assembly / partial-token chunking / streaming termination)
  • 26 fmt rows split per-row into PARSER.fmt.1 (function-name surface) vs PARSER.fmt.5 (argument-envelope shape: native ID, JSON field-order, argumentsparameters alias)
  • Out-of-PARSER-scope buckets relocated: CASE.{11,18,25}FRONTEND.{1,3,5,6}; CASE.12PIPELINE.finish_reason; CASE.{9,10,17}REASONING.batch.{1,2}; CASE.20// helper; CASE.16 → inline-regression annotation; CASE.26 dissolved into PARSER.batch.4 impl-defined recovery contract

Mis-bucket fixes caught by review

  • FunctionGemma::test_multiple_tool_calls and Gemma4::TestExtractToolCalls.test_multiple_tool_calls were both labeled CASE.1 but assert len(tool_calls) == 2 — corrected to PARSER.batch.2.

Bucket-assignment refinements caught by review

  • test_unique_tool_call_ids (DSv3.2): drops fmt.5 (no native call-ID surface)
  • test_invalid_funcall_id_skipped (Kimi K2): fmt.5fmt.1 (validation, not preservation)
  • 3 Mistral argument_before_name* parametrized rows: added missing fmt.5 tag

A staleness banner at the top documents the re-bucketing transformation and mis-bucket fixes for traceability.

Top findings (already in PR #9290 or flagged for follow-up)

  1. Mistral v11+ wire format — STILL OPEN (parser doesn't exist; flagged in PARSER_CASES.md "Known production gaps")
  2. PARSER.stream.{1..4} parser-tier in Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral — partial closure via DSv4 (test(parsers): DIS-1842 — DSv4 + Kimi K2 unit-test coverage gaps #8946) + Gemma 4 (chore(frontend): Add Gemma 4 parser support + Test Cases #8852); 5 families remain
  3. FRONTEND.3 (adjust_request) — CLOSED for 7 families via 28 new tests in lib/llm/tests/tool_choice.rs (test(parsers): DIS-1842 — DSv4 + Kimi K2 unit-test coverage gaps #8946 + test(parsers): Top-N models to have extra CASE.6+ coverage (case3) #9035)

Test plan

  • cargo check -p dynamo-parsers --tests passes (docs-only; new file)
  • All 493 test rows carry at least one new-taxonomy bucket
  • Bucket-summary tally adds up; "Total (distinct test rows) = 493" matches header
  • No leftover old CASE.* labels outside the staleness banner / "Old label" column / mis-bucket annotation notes
  • Internal review caught 4 bucket-assignment issues + 2 mis-bucketings; all fixed

Out of scope / follow-ups

  • Per-family ticket spawning: the 5 PARSER.stream.* gaps and the Mistral v11 parser are separate work items
  • DIS-1906 (cross-impl parser parity harness) can use this audit as fixture-row source

Closes the audit half of DIS-1926 (the taxonomy half landed in #9290).

@github-actions github-actions Bot added docs documentation Improvements or additions to documentation labels May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

zhongdaor-nv and others added 3 commits May 15, 2026 09:34
Companion artifact to PR #9290 (PARSER_CASES.md taxonomy refinement).
Adds the full per-test bidirectional audit that informed every change in
that PR — every vLLM tool-parser test mapped onto the new (PR #9127)
taxonomy with a clickable source link.

`lib/parsers/VLLM_TEST_AUDIT.md` (new file, 906 lines, 493 distinct
test rows):

- **Source**: vLLM `main` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b
  (`vllm/tool_parsers/*`, `tests/tool_parsers/*`, `tests/tool_use/*`,
  `tests/entrypoints/openai/tool_parsers/*`).
- **Scope**: 421 explicit test functions + 72 inherited common-suite
  rows from `ToolParserTests`.
- **Bucketing**: every row carries one or more `PARSER_CASES.md` /
  `REASONING_CASES.md` / `PIPELINE_CASES.md` / `FRONTEND_CASES.md`
  tags, plus a one-line behavioral note.

Re-bucketing transformations applied (vs the original CASE.* labels
the audit was first written against, before PR #9127):

- 244 streaming rows split per-row into PARSER.stream.{1,2,3,4}
  (single-call assembly / multi-call assembly / partial-token
  chunking / streaming termination)
- 26 fmt rows split per-row into PARSER.fmt.1 (function-name) vs
  PARSER.fmt.5 (argument-shape: native ID, JSON field-order,
  arguments↔parameters alias)
- Out-of-PARSER-scope buckets relocated to sibling docs:
  CASE.{11,18,25} → FRONTEND.{1,3,5,6}; CASE.12 →
  PIPELINE.finish_reason; CASE.{9,10,17} → REASONING.batch.{1,2};
  CASE.20 → `// helper`; CASE.16 → inline-regression annotation;
  CASE.26 dissolved into PARSER.batch.4 impl-defined recovery
  contract

Two mis-bucketings caught and fixed during review:
- FunctionGemma::test_multiple_tool_calls and
  Gemma4::TestExtractToolCalls.test_multiple_tool_calls were both
  labeled CASE.1 but assert len(tool_calls) == 2 — corrected to
  PARSER.batch.2.

Four bucket-assignment refinements caught by review:
- test_unique_tool_call_ids (DSv3.2) drops fmt.5 (no native call-ID
  surface; just parallel-call distinctness).
- test_invalid_funcall_id_skipped (Kimi K2) moves fmt.5 → fmt.1
  (validation, not preservation).
- 3 Mistral `argument_before_name*` parametrized rows gain fmt.5
  (canonical field-order swap test set referenced by PARSER_CASES.md).

A staleness banner at the top documents the re-bucketing transformation
and mis-bucket fixes for traceability.

Top findings the audit informed (already addressed in PR #9290 or
flagged for follow-up):

1. Mistral v11+ wire format — STILL OPEN (parser doesn't exist;
   flagged in PARSER_CASES.md "Known production gaps").
2. PARSER.stream.{1..4} parser-tier coverage gap in 5 families
   (Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral) — partial
   closure via DSv4 (#8946) and Gemma 4 (#8852).
3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families
   via 28 new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035).

Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
…er tags

- Split PARSER.batch.8 into .a/.b/.c/.d sub-buckets per narration
  position (before / after / sandwich / between-multi); 43 rows updated.
- Helper-tag dedup per PARSER_CASES.md:35-38: rows previously
  double-tagged as PARSER.<bucket>.<n> + // helper now carry // helper
  only. 35 rows updated; PARSER.batch.7 86 -> 58.
- Drop "Old label" column and staleness banner from Bucket Summary
  (taxonomy migration is settled). Adds 8.a-d, PARSER.harmony.2, and
  the sibling-doc dissolved row. Count refinements: stream.1 178 -> 177,
  stream.3 63 -> 58, fmt.2 9 -> 8, REASONING.batch.2 18 -> 36.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang force-pushed the zhongdaor/dis-1926-vllm-test-audit-doc branch from f10e682 to fd87db4 Compare May 15, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants