Skip to content

feat(sdk): improve AI tool definitions for LLM accuracy (25% → 95% pass rate)#2446

Merged
harbournick merged 14 commits intomainfrom
feature/eval-improvements
Mar 19, 2026
Merged

feat(sdk): improve AI tool definitions for LLM accuracy (25% → 95% pass rate)#2446
harbournick merged 14 commits intomainfrom
feature/eval-improvements

Conversation

@tupizz
Copy link
Copy Markdown
Contributor

@tupizz tupizz commented Mar 18, 2026

TL;DR

LLMs using our SDK tools failed 75% of the time because parameter schemas had no descriptions. This PR adds descriptions to 90% of parameters and improves the codegen pipeline. Execution test pass rate went from 25% → 95%.


Problem

Our 9 grouped SDK tools expose ~115 parameters to LLMs via JSON Schema. Most had no description field — models saw names and types but no guidance on format, valid values, or when params are required.

Common failures:

  • Model passed at: {kind: "end"} — schema requires "documentEnd" but nothing said so
  • Model used value instead of text for replacement content
  • Model omitted changeMode (required) because nothing marked it as mandatory
  • Model passed target: {ref: "..."} instead of using the ref param directly
  • Model tried {type: "heading"} instead of {type: "node", nodeType: "heading"}
  • List creation burned 5 iterations guessing kind, mode, at, target requirements

Solution

Three layers of improvements:

1. Descriptions on schemas (schemas.ts, operation-params.ts)

Added descriptions with format examples to all major operations:

// Before: model guesses wrong format
at: { oneOf: [...] }

// After: model knows exactly what to pass  
at: { oneOf: [...], description: "Position: {kind:'documentEnd'} to append, {kind:'before'|'after', target:{kind:'block', nodeType:'...', nodeId:'...'}} for relative placement." }

2. Smarter codegen (generate-intent-tools.mjs)

  • Auto-annotates conditional requirements: adds "Required for action 'create'" when a param is required by some actions but not all
  • Checks contract-level required arrays: previously only checked CLI params, missing requirements from if/then/else schemas
  • Removes empty {} oneOf branches: bold: oneOf:[{type:"boolean"}, {}]bold: {type:"boolean"} (42 inline properties simplified)
  • Deduplicates same-type oneOf branches: collapses ref: oneOf:[{type:"string",...},{type:"string",...}] into single {type:"string"}
  • Fallback descriptions: auto-adds descriptions to target, ref, content, inline when the schema doesn't provide one

3. System prompt guidance (system-prompt.md)

  • "Always take action" — stop asking clarifying questions
  • "Placing content near specific text" — search by text, not by node type
  • ref vs target clarification — use ref for inline formatting
  • List creation workflow — two modes explained with examples
  • select.type must be "text" or "node" — prevents {type: "heading"} error

Results

Execution tests (same 20 tests, same model):

Stage Pass rate Key fix
Baseline 25% No descriptions
+create at desc 60% Model stops guessing {kind:"end"}
+changeMode, select, target 85% Required params and format examples
+mutations changeMode: "Required" 95% Last major failure fixed

Description coverage: 60% → 90% of parameters

Schema token cost: ~11,175 → ~11,001 tokens (empty branch removal)

The 1 remaining failure is a table creation test where the tool surface doesn't support the operation yet.

Files

Area Files What
Schema descriptions schemas.ts Descriptions on 30+ properties across all tools
CLI params operation-params.ts, types.ts, export-sdk-contract.ts Descriptions on envelope + flat params, removed dead agentRequired
Codegen generate-intent-tools.mjs Auto-annotation, dedup, empty branch removal, fallbacks
System prompt system-prompt.md Targeting guide, list workflow, action-first instruction
Eval tests tool-quality.yaml, execution.yaml, checks.cjs 4-group structure, trace assertions, partial credit scoring
Reference docs apps/docs/**/*.mdx Regenerated with new descriptions

Test plan

  • pnpm run generate:all passes
  • tools.openai.json has descriptions on 104/115 params
  • Execution tests: 19/20 (95%)
  • Tool quality tests: 172/174 (99%)
  • Reference docs regenerated
  • Cross-provider validation (Anthropic, Google)

…ed assertions

- Updated execution tests to validate tool execution mechanics, including trace and content assertions.
- Improved tool quality tests to assess LLM tool selection accuracy and argument structure.
- Added comprehensive checks for tool call sequences and success rates in execution traces.
- Refined assertions to ensure correct tool usage and argument validation across various document operations.
- Enhanced CLI operation parameter specifications by adding human-readable descriptions for better usability and documentation.
- Updated existing parameters to include descriptions, improving clarity for users interacting with the CLI.
- Modified the `CliOperationParamSpec` type to include an optional `description` field for enhanced schema documentation.
@tupizz tupizz self-assigned this Mar 18, 2026
@tupizz tupizz marked this pull request as ready for review March 18, 2026 22:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7fd1ad12eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread evals/providers/superdoc-agent-gateway.mjs
Comment thread evals/lib/checks.cjs Outdated
Comment thread evals/lib/checks.cjs
tupizz added 3 commits March 18, 2026 19:46
- Updated the documentation for list creation and insertion operations to include detailed descriptions for required parameters, improving clarity for users.
- Added specific formatting instructions for the `at` and `target` parameters in the `create` and `insert` operations, respectively.
- Regenerated the manifest file to reflect the updated source hash.
…dle in format documentation

- Enhanced the descriptions for the `target` and `ref` properties across multiple format-related documentation files to clarify usage.
- Updated the `target` description to recommend using 'ref' for search result handles.
- Improved the `ref` description to specify passing the handle.ref value directly for inline formatting.
- Regenerated the manifest file to reflect the updated source hash.
- Add descriptions to SelectionPoint, nestingPolicy, and inline formatting
- Fix codegen to check contract-level required arrays (not just CLI params)
- Remove empty {} oneOf branches from inline properties (42 simplified)
- Deduplicate same-type oneOf branches (e.g. duplicate string refs)
- Collapse single-branch oneOf to plain type
- Add fallback descriptions for target, ref, content, inline params
- Add "placing content near text" workflow to system prompt
- Clarify search select.type must be "text" or "node"
@tupizz tupizz changed the title feat: improvements on promptfoo setup and evaluation tests feat(sdk): improve AI tool definitions for LLM accuracy (25% → 95% pass rate) Mar 19, 2026
Copy link
Copy Markdown
Contributor

@andrii-harbour andrii-harbour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done!

minor comments here

Comment thread apps/docs/document-api/reference/create/heading.mdx Outdated
Comment thread evals/promptfooconfig.e2e.yaml Outdated
tupizz added 4 commits March 19, 2026 15:52
- Fix heading level description "1-9" → "1-6" to match schema max: 6
- Add comment explaining commented-out providers are templates
- Add zero tool calls guard to traceAllOk assertion
…ents

# Conflicts:
#	apps/docs/document-api/reference/_generated-manifest.json
Copy link
Copy Markdown
Collaborator

@harbournick harbournick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - really nice!

@harbournick harbournick merged commit 2e10e26 into main Mar 19, 2026
8 checks passed
@harbournick harbournick deleted the feature/eval-improvements branch March 19, 2026 19:40
@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Mar 19, 2026

🎉 This PR is included in superdoc-cli v0.3.0-next.38

The release is available on GitHub release

superdoc-bot Bot pushed a commit that referenced this pull request Mar 20, 2026
# [0.3.0](cli-v0.2.0...cli-v0.3.0) (2026-03-20)

### Bug Fixes

* arrow key navigation through and out of tables (SD-2236) ([#2476](#2476)) ([d5317ef](d5317ef))
* behavior tests ([#2436](#2436)) ([2d087f2](2d087f2))
* bug text edit commands fail on targets returned by find ([#2488](#2488)) ([7a9a448](7a9a448))
* change default link protocol ([#2319](#2319)) ([1deda06](1deda06))
* clear linked style for the next paragraph ([#2344](#2344)) ([9714ffb](9714ffb))
* clear selection on undo/redo ([#2385](#2385)) ([6473acf](6473acf))
* cli skills install ([ed7436a](ed7436a))
* **cli:** include allowed values in oneOf const validation errors ([#2455](#2455)) ([8802f90](8802f90))
* **cli:** restore tracked diff redline roundtrip ([#2438](#2438)) ([f609371](f609371))
* close toolbar overflow menu on click outside ([#2377](#2377)) ([ba74245](ba74245))
* **collaboration:** preserve body section properties across Yjs sync ([#2356](#2356)) ([ea702d6](ea702d6))
* **comments:** keep floating comment bubbles aligned with the selected thread (SD-2210 and SD-2223) ([#2390](#2390)) ([b014618](b014618))
* **comments:** resolve double-click activation and edit mode issues (SD-2035) ([#2259](#2259)) ([d9465aa](d9465aa))
* declare w15 namespace when bootstrapping numbering.xml ([#2470](#2470)) ([a14004d](a14004d))
* **diffing:** ignore volatile OOXML attrs in image and paragraph diff comparison ([#2421](#2421)) ([ca91225](ca91225))
* disable table resizing UI in viewing mode ([#2403](#2403)) ([697e799](697e799))
* doc-api story regressions and export app.xml stats ([#2478](#2478)) ([d06ff4e](d06ff4e))
* **doc-api:** gate textStyle attrs and sync reference coverage ([#2430](#2430)) ([e2d6ca6](e2d6ca6))
* **docs:** coherence pass on doc api, clean up dead code, update CLI SKILL.md ([#2424](#2424)) ([bf0d4b8](bf0d4b8))
* **document-api:** add document diff API and fix tracked diff replay in CLI host session ([#2418](#2418)) ([2a804f7](2a804f7))
* **document-api:** add mutation-ready cell addresses to tables.getCells ([#2461](#2461)) ([99bd4e5](99bd4e5))
* **document-api:** clear styles before paragraph.setStyle ([#2449](#2449)) ([bce4bb8](bce4bb8))
* **document-api:** make find/get treat content controls as sdt ([6688b8c](6688b8c))
* **document-api:** rename atRowIndex to rowIndex in tables.split ([#2473](#2473)) ([7de2864](7de2864))
* **document-api:** return fresh table ref in mutation responses ([#2453](#2453)) ([af6de73](af6de73))
* **document-api:** return NodeAddress from find and getNode instead of SDAddress (SD-2168) ([#2342](#2342)) ([edcb3c6](edcb3c6))
* **editor:** arrow key navigation across page boundaries and auto-scroll (SD-1950) ([#2191](#2191)) ([f7961d7](f7961d7)), closes [#scrollCaretIntoViewIfNeeded](https://github.com/superdoc-dev/superdoc/issues/scrollCaretIntoViewIfNeeded) [this.#painterHost](https://github.com/this./issues/painterHost) [#scrollScreenRectIntoView](https://github.com/superdoc-dev/superdoc/issues/scrollScreenRectIntoView) [#scrollCaretIntoViewIfNeeded](https://github.com/superdoc-dev/superdoc/issues/scrollCaretIntoViewIfNeeded) [#scrollActiveEndIntoView](https://github.com/superdoc-dev/superdoc/issues/scrollActiveEndIntoView)
* **editor:** prevent scroll-to-top when clicking toolbar buttons ([#2236](#2236)) ([ab30a36](ab30a36))
* ensure ruler 0 is visible ([#2487](#2487)) ([096d9f0](096d9f0))
* **export:** prevent DOCX corruption from UTF-16 XML parts and schema violations (SD-2170) ([#2349](#2349)) ([fed1d6b](fed1d6b))
* faulty TOC import/export (SD-2183) ([#2371](#2371)) ([45b4452](45b4452))
* guard drawing export against invalid structures and zero IDs (SD-824) ([#2363](#2363)) ([9c7fc2e](9c7fc2e))
* **header-footer:** normalize page-relative anchor layout ([#2484](#2484)) ([6e62198](6e62198))
* **image:** sync headless image media to Y.Doc for collab persistence ([#2313](#2313)) ([72c64ed](72c64ed))
* import regression ([#2452](#2452)) ([cac5e24](cac5e24))
* improve document API dry runs, query matching, and reference block mutations ([#2498](#2498)) ([5959c5f](5959c5f))
* improve multi-column rendering ([#2369](#2369)) ([d231640](d231640))
* isolate document surface and toolbar/ruler stacking contexts ([#2491](#2491)) ([976ce14](976ce14))
* issue with vertical cells merging ([#2387](#2387)) ([e8f1c10](e8f1c10))
* **layout-engine:** match partial-row split height to renderer semantics ([#2486](#2486)) ([e0982da](e0982da))
* **layout-engine:** require bilateral opt-in for contextual spacing ([#2475](#2475)) ([40e04c2](40e04c2))
* **layout-engine:** skip redundant pageBreakBefore after page-forcing section breaks ([a950ed2](a950ed2))
* **lists:** stabilize list item addresses for docs without paraIds ([#2429](#2429)) ([0070de6](0070de6))
* match Word list marker geometry and section-carrier pagination ([#2358](#2358)) ([36d562f](36d562f))
* merged table cells owning outer borders in DOM painter ([c55f65a](c55f65a))
* newline formatting inheritance without serializing style-derived formatting (SD-2228) ([#2417](#2417)) ([5a3318f](5a3318f))
* open links in view mode ([#2350](#2350)) ([25f0aad](25f0aad))
* **painter-dom:** skip non-scrollable scroll container in virtualization (SD-2199) ([#2383](#2383)) ([1e075f6](1e075f6))
* **presentation-editor:** arrow key scroll-into-view with unconstrained containers (SD-1950) ([#2411](#2411)) ([fa8afc8](fa8afc8)), closes [#findScrollableAncestor](https://github.com/superdoc-dev/superdoc/issues/findScrollableAncestor)
* preserve imported letter spacing through editor and layout ([ca9cf6a](ca9cf6a))
* preserve tracked format changes through DOCX export roundtrip ([#2395](#2395)) ([0ee9fa0](0ee9fa0))
* register DOCX numbering metadata for lists.create ([#2432](#2432)) ([129772f](129772f))
* remove syncing of runProperties with paragraph (SD-2143) ([#2343](#2343)) ([3e74426](3e74426))
* **rendering:** apply superscript/subscript font-size scaling during layout ([#2340](#2340)) ([7e9c24f](7e9c24f))
* **rendering:** show comment highlight on text with Word highlight formatting (SD-2188) ([#2370](#2370)) ([8fe0afd](8fe0afd)), closes [#ffff00](https://github.com/superdoc-dev/superdoc/issues/ffff00)
* replace file running twice ([#2396](#2396)) ([a79fcaa](a79fcaa))
* **sdk:** improve agent tool definitions for better LLM accuracy ([#2494](#2494)) ([e914af7](e914af7)), closes [#8](#8) [#9](#9) [#10](#10)
* seed base docx package for collaboration exports ([#2416](#2416)) ([df36853](df36853))
* show correct paragraph font in toolbar when selection is empty (SD-2145) ([#2402](#2402)) ([39e1477](39e1477))
* **super-editor:** guard against style definition nodes without elements ([#2379](#2379)) ([7dd57f8](7dd57f8))
* **super-editor:** make notes-part mutations canonical for footnotes ([#2361](#2361)) ([e232129](e232129))
* **super-editor:** preserve fontFamily in runProperties when set via document API (SD-2249) ([#2433](#2433)) ([491c3fe](491c3fe))
* **super-editor:** preserve root doc attrs during collaboration seeding ([#2359](#2359)) ([018469a](018469a))
* **super-editor:** prevent cursor jump when changing font from toolbar ([#2468](#2468)) ([c315599](c315599))
* **super-editor:** reconcile OPC package metadata during DOCX export ([#2357](#2357)) ([863254a](863254a))
* **superdoc:** expose header/footer edits in update callbacks ([#2368](#2368)) ([78d0056](78d0056))
* **superdoc:** prevent duplicate prosemirror-view bundles in dist ([32c1045](32c1045))
* surface hyperlink tracked changes in comments ([#2485](#2485)) ([ae55118](ae55118))
* **tables:** handle insertColumn right of last column ([#2451](#2451)) ([74c37ff](74c37ff))
* **tables:** prevent resize overlay artifacts during drag ([#2479](#2479)) ([1b1e712](1b1e712))
* text selection inside headers/footers ([#2404](#2404)) ([09677dc](09677dc))
* **toc:** anchor scroll precision within pages navigation (SD-2186) ([#2372](#2372)) ([cfb9a72](cfb9a72)), closes [#scrollContainer](https://github.com/superdoc-dev/superdoc/issues/scrollContainer)
* **toc:** inject _Toc bookmarks so exported DOCX TOC links work without manual Update Table ([#2431](#2431)) ([54c5aa7](54c5aa7))
* toolbar state after document load (SD-2145) ([#2448](#2448)) ([6347ffe](6347ffe))
* **track-changes:** allow linked style changes in suggesting mode (SD-2182) ([#2373](#2373)) ([6400a1f](6400a1f))
* **track-changes:** cancel tracked format changes when reverted to original (SD-2181) ([#2365](#2365)) ([72077b2](72077b2))
* **track-changes:** remove logic that combines adjacent TCs with different IDs ([#2326](#2326)) ([b2f088b](b2f088b))
* **tracked-changes:** do not render empty space in TC within lists ([#2316](#2316)) ([00672dc](00672dc))
* **tracked-changes:** sync tracked changes store on undo and redo ([#2164](#2164)) ([94f0056](94f0056))
* **tracked-changes:** undo/redo applies to both document and comment bubbles ([#2437](#2437)) ([bc7cba3](bc7cba3))
* **types:** fix broken .d.ts imports in published superdoc package (SD-2227) ([#2392](#2392)) ([77807e5](77807e5))
* update list marker font before adding list item ([#2312](#2312)) ([8721614](8721614))
* update skill file ([240fb66](240fb66))
* watermark shading mismatch (SD-2147) ([#2353](#2353)) ([c94320c](c94320c))

### Features

* charts ([#2322](#2322)) ([dff2edc](dff2edc))
* cli improvements, block deletion ([#2360](#2360)) ([26972ff](26972ff))
* **cli:** add --version flag ([6199a9c](6199a9c))
* **collab:** wait for Y fragment settling before initializing editor ([b75ee17](b75ee17))
* **comments:** add scrollToComment API ([#2440](#2440)) ([0132d0e](0132d0e))
* diffing extension for comparing documents (SD-1324 and SD-89) ([#2306](#2306)) ([33e2ce6](33e2ce6))
* **doc-info:** add live page counts to doc.info ([#2435](#2435)) ([e631f4b](e631f4b))
* **doc-info:** live doc.info counts for characters, tracked changes, SDT fields, and lists ([#2428](#2428)) ([2978507](2978507))
* **document-api:** accept table coordinates in unmergeCells ([#2462](#2462)) ([5eca65b](5eca65b))
* **document-api:** add 'story' targeting for parts targeting with main api functions ([#2477](#2477)) ([49dc4ef](49dc4ef))
* **document-api:** add paragraph direction ops and clarify format.rtl ([#2474](#2474)) ([86600ac](86600ac))
* **document-api:** add table convenience ops and sync reference doc ([#2471](#2471)) ([137b1d9](137b1d9))
* **document-api:** content controls ([#2320](#2320)) ([2747e81](2747e81))
* **document-api:** headers & footers ([#2323](#2323)) ([b6511ca](b6511ca))
* **document-api:** improve cross block selection and deleting ([#2391](#2391)) ([cb8fedd](cb8fedd))
* **document-api:** insert/replace structural content ([#2305](#2305)) ([ce0c719](ce0c719))
* **document-api:** list creation and style edit commands ([#2457](#2457)) ([1d6d4bb](1d6d4bb))
* **document-api:** references ([#2321](#2321)) ([6da4d9c](6da4d9c))
* **docx:** support Word document statistic fields and F9 field updates ([#2460](#2460)) ([57b3ecc](57b3ecc))
* **headless:** collaborative comment and tracked-change parity ([#2315](#2315)) ([4dc1be1](4dc1be1))
* **layout:** implement AutoFit table layout algorithm (SD-2174) ([#2355](#2355)) ([5c05535](5c05535))
* llm tools beta ([#2393](#2393)) ([f725f36](f725f36))
* make images uploaded into table cell adjust to width of cell ([#2317](#2317)) ([c79b1d1](c79b1d1))
* parts sync system including yjs ([#2325](#2325)) ([84d8945](84d8945))
* **presentation-editor:** enhance zoom functionality in web layout ([#2408](#2408)) ([d44de69](d44de69)), closes [#applyZoom](https://github.com/superdoc-dev/superdoc/issues/applyZoom)
* remove naive ui ([#2240](#2240)) ([fd5444f](fd5444f))
* **sdk:** ensure sdk clients are not global, change open() to return document handle ([#2497](#2497)) ([3b6eede](3b6eede))
* **sdk:** improve AI tool definitions for LLM accuracy (25% → 95% pass rate) ([#2446](#2446)) ([2e10e26](2e10e26))
* seed blank docx parts when loading JSON into editor ([#2401](#2401)) ([89c982f](89c982f))
* **super-editor:** bridge editor selection into Document API commands ([#2458](#2458)) ([26cef26](26cef26))
* support paragraph between borders (w:pBdr/w:between) ([#2324](#2324)) ([03f8207](03f8207)), closes [#2074](#2074)
* **tables:** support lastRow style options with OOXML roundtrip parity ([#2467](#2467)) ([e84f695](e84f695))
* theming with css variables ([#2386](#2386)) ([529c500](529c500)), closes [#2441](#2441) [#ffffff](https://github.com/superdoc-dev/superdoc/issues/ffffff) [#dbdbdb](https://github.com/superdoc-dev/superdoc/issues/dbdbdb) [hi#level](https://github.com/hi/issues/level) [hi#contrast](https://github.com/hi/issues/contrast) [#f3f6fd](https://github.com/superdoc-dev/superdoc/issues/f3f6fd) [hi#level](https://github.com/hi/issues/level) [#2445](#2445) [#2469](#2469) [#ffffff](https://github.com/superdoc-dev/superdoc/issues/ffffff)

### Performance Improvements

* **test:** move one-time setup to beforeAll in contract-conformance ([#2483](#2483)) ([ed4839b](ed4839b))
* **test:** speed up unit tests and migrate to bun ([#2492](#2492)) ([af44051](af44051))

### Reverts

* Revert "fix(types): fix broken .d.ts imports in published superdoc package (S…" ([#2443](#2443)) ([33215ee](33215ee))
* Revert "fix(types): fix broken .d.ts imports in published superdoc package (S…" ([#2443](#2443)) ([#2444](#2444)) ([2bde895](2bde895))
@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Mar 20, 2026

🎉 This PR is included in superdoc-cli v0.3.0

The release is available on GitHub release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants