docs(parsers): add DIS-1926 vLLM tool-parser test audit#9329
Draft
zhongdaor-nv wants to merge 3 commits into
Draft
docs(parsers): add DIS-1926 vLLM tool-parser test audit#9329zhongdaor-nv wants to merge 3 commits into
zhongdaor-nv wants to merge 3 commits into
Conversation
Contributor
Companion artifact to PR #9290 (PARSER_CASES.md taxonomy refinement). Adds the full per-test bidirectional audit that informed every change in that PR — every vLLM tool-parser test mapped onto the new (PR #9127) taxonomy with a clickable source link. `lib/parsers/VLLM_TEST_AUDIT.md` (new file, 906 lines, 493 distinct test rows): - **Source**: vLLM `main` at commit b53c507bc91f87e28b03e9b54bbff7c76e97d58b (`vllm/tool_parsers/*`, `tests/tool_parsers/*`, `tests/tool_use/*`, `tests/entrypoints/openai/tool_parsers/*`). - **Scope**: 421 explicit test functions + 72 inherited common-suite rows from `ToolParserTests`. - **Bucketing**: every row carries one or more `PARSER_CASES.md` / `REASONING_CASES.md` / `PIPELINE_CASES.md` / `FRONTEND_CASES.md` tags, plus a one-line behavioral note. Re-bucketing transformations applied (vs the original CASE.* labels the audit was first written against, before PR #9127): - 244 streaming rows split per-row into PARSER.stream.{1,2,3,4} (single-call assembly / multi-call assembly / partial-token chunking / streaming termination) - 26 fmt rows split per-row into PARSER.fmt.1 (function-name) vs PARSER.fmt.5 (argument-shape: native ID, JSON field-order, arguments↔parameters alias) - Out-of-PARSER-scope buckets relocated to sibling docs: CASE.{11,18,25} → FRONTEND.{1,3,5,6}; CASE.12 → PIPELINE.finish_reason; CASE.{9,10,17} → REASONING.batch.{1,2}; CASE.20 → `// helper`; CASE.16 → inline-regression annotation; CASE.26 dissolved into PARSER.batch.4 impl-defined recovery contract Two mis-bucketings caught and fixed during review: - FunctionGemma::test_multiple_tool_calls and Gemma4::TestExtractToolCalls.test_multiple_tool_calls were both labeled CASE.1 but assert len(tool_calls) == 2 — corrected to PARSER.batch.2. Four bucket-assignment refinements caught by review: - test_unique_tool_call_ids (DSv3.2) drops fmt.5 (no native call-ID surface; just parallel-call distinctness). - test_invalid_funcall_id_skipped (Kimi K2) moves fmt.5 → fmt.1 (validation, not preservation). - 3 Mistral `argument_before_name*` parametrized rows gain fmt.5 (canonical field-order swap test set referenced by PARSER_CASES.md). A staleness banner at the top documents the re-bucketing transformation and mis-bucket fixes for traceability. Top findings the audit informed (already addressed in PR #9290 or flagged for follow-up): 1. Mistral v11+ wire format — STILL OPEN (parser doesn't exist; flagged in PARSER_CASES.md "Known production gaps"). 2. PARSER.stream.{1..4} parser-tier coverage gap in 5 families (Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral) — partial closure via DSv4 (#8946) and Gemma 4 (#8852). 3. CASE.25 / FRONTEND.3 (`adjust_request`) — CLOSED for 7 families via 28 new tests in `lib/llm/tests/tool_choice.rs` (#8946 + #9035). Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
…er tags - Split PARSER.batch.8 into .a/.b/.c/.d sub-buckets per narration position (before / after / sandwich / between-multi); 43 rows updated. - Helper-tag dedup per PARSER_CASES.md:35-38: rows previously double-tagged as PARSER.<bucket>.<n> + // helper now carry // helper only. 35 rows updated; PARSER.batch.7 86 -> 58. - Drop "Old label" column and staleness banner from Bucket Summary (taxonomy migration is settled). Adds 8.a-d, PARSER.harmony.2, and the sibling-doc dissolved row. Count refinements: stream.1 178 -> 177, stream.3 63 -> 58, fmt.2 9 -> 8, REASONING.batch.2 18 -> 36. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
f10e682 to
fd87db4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Companion artifact to #9290 (already merged — refined
PARSER_CASES.mdtaxonomy from this audit). This PR adds the full per-test bidirectional audit that informed every change in #9290.Marked draft because reviewers may want to argue per-row bucket assignments; the audit is a working reference doc rather than a stable contract.
What's in this PR
lib/parsers/VLLM_TEST_AUDIT.md(new file, 906 lines, 493 distinct test rows):mainatb53c507bc91f87e28b03e9b54bbff7c76e97d58b(vllm/tool_parsers/*,tests/tool_parsers/*,tests/tool_use/*,tests/entrypoints/openai/tool_parsers/*).ToolParserTests.PARSER_CASES.md/REASONING_CASES.md/PIPELINE_CASES.md/FRONTEND_CASES.mdtags, plus a clickable GitHub source link, plus a one-line behavioral note.Re-bucketing summary (post PR #9127 taxonomy)
The audit was originally written against the old
CASE.*labels inlib/parsers/TEST_CASES.md(deleted by #9127). Mechanical renames + per-row classification done:PARSER.stream.{1,2,3,4}(single-call assembly / multi-call assembly / partial-token chunking / streaming termination)PARSER.fmt.1(function-name surface) vsPARSER.fmt.5(argument-envelope shape: native ID, JSON field-order,arguments↔parametersalias)PARSER-scope buckets relocated:CASE.{11,18,25}→FRONTEND.{1,3,5,6};CASE.12→PIPELINE.finish_reason;CASE.{9,10,17}→REASONING.batch.{1,2};CASE.20→// helper;CASE.16→ inline-regression annotation;CASE.26dissolved intoPARSER.batch.4impl-defined recovery contractMis-bucket fixes caught by review
FunctionGemma::test_multiple_tool_callsandGemma4::TestExtractToolCalls.test_multiple_tool_callswere both labeledCASE.1but assertlen(tool_calls) == 2— corrected toPARSER.batch.2.Bucket-assignment refinements caught by review
test_unique_tool_call_ids(DSv3.2): dropsfmt.5(no native call-ID surface)test_invalid_funcall_id_skipped(Kimi K2):fmt.5→fmt.1(validation, not preservation)argument_before_name*parametrized rows: added missingfmt.5tagA staleness banner at the top documents the re-bucketing transformation and mis-bucket fixes for traceability.
Top findings (already in PR #9290 or flagged for follow-up)
PARSER_CASES.md"Known production gaps")PARSER.stream.{1..4}parser-tier in Kimi K2 / Qwen3 / Hermes / Pythonic / Mistral — partial closure via DSv4 (test(parsers): DIS-1842 — DSv4 + Kimi K2 unit-test coverage gaps #8946) + Gemma 4 (chore(frontend): Add Gemma 4 parser support + Test Cases #8852); 5 families remainFRONTEND.3(adjust_request) — CLOSED for 7 families via 28 new tests inlib/llm/tests/tool_choice.rs(test(parsers): DIS-1842 — DSv4 + Kimi K2 unit-test coverage gaps #8946 + test(parsers): Top-N models to have extra CASE.6+ coverage (case3) #9035)Test plan
cargo check -p dynamo-parsers --testspasses (docs-only; new file)CASE.*labels outside the staleness banner / "Old label" column / mis-bucket annotation notesOut of scope / follow-ups
PARSER.stream.*gaps and the Mistral v11 parser are separate work itemsCloses the audit half of DIS-1926 (the taxonomy half landed in #9290).