From 7055d27cafb987a7949bed99666b9a05a6fb6ce2 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:14:26 +0000 Subject: [PATCH 01/11] docs: add design for --threshold flag (#698) Design document for suite-level quality gate threshold flag that fails CI when mean eval score drops below a specified value. Co-Authored-By: Claude Opus 4.6 Entire-Checkpoint: 1604663b3709 --- .../plans/2026-03-25-threshold-flag-design.md | 76 +++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100644 docs/plans/2026-03-25-threshold-flag-design.md diff --git a/docs/plans/2026-03-25-threshold-flag-design.md b/docs/plans/2026-03-25-threshold-flag-design.md new file mode 100644 index 000000000..29c6b5e74 --- /dev/null +++ b/docs/plans/2026-03-25-threshold-flag-design.md @@ -0,0 +1,76 @@ +# Design: `--threshold` flag for suite-level quality gates + +**Issue:** #698 +**Date:** 2026-03-25 + +## Objective + +Add a `--threshold` CLI flag to `agentv eval` that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing `agentv compare --baseline`. + +## CLI Flag + +- `--threshold ` on `agentv eval run` (0–1 scale) +- Optional — if omitted, no threshold check (current behavior preserved) +- Overrides `execution.threshold` from YAML if both set + +## YAML Config + +Add `threshold` to the `execution` block in eval YAML files: + +```yaml +execution: + threshold: 0.8 +``` + +Both `threshold` and `execution.threshold` accepted (snake_case wire format convention). + +## Score Evaluation + +After all tests complete: + +1. Compute mean score from quality results only (excluding `execution_error` tests — same as existing `calculateEvaluationSummary()`) +2. If mean score < threshold → exit code 1 +3. Execution errors fail independently via existing `fail_on_error` mechanism (separate concern) +4. If no quality results exist (all execution errors), threshold check is skipped + +## Output + +When threshold is active, append a summary line after the existing result summary: + +``` +Suite score: 0.53 (threshold: 0.60) — FAIL +``` + +or: + +``` +Suite score: 0.85 (threshold: 0.60) — PASS +``` + +## JUnit Integration + +The JUnit writer uses the threshold for per-test pass/fail: + +- If threshold is set: `score < threshold` → `` element +- If threshold is not set: `score < 0.5` (current hardcoded behavior preserved) + +## Exit Code + +- Exit 0: mean score >= threshold (or no threshold set) +- Exit 1: mean score < threshold +- Execution errors handled separately by `fail_on_error` + +## Files to Modify + +1. `packages/core/src/evaluation/validation/eval-file.schema.ts` — add `threshold` to ExecutionSchema +2. `apps/cli/src/commands/eval/commands/run.ts` — add `--threshold` CLI flag +3. `apps/cli/src/commands/eval/run-eval.ts` — pass threshold through, check after results +4. `apps/cli/src/commands/eval/statistics.ts` — add threshold summary formatting +5. `apps/cli/src/commands/eval/junit-writer.ts` — use threshold for pass/fail +6. Tests for new behavior + +## Non-Goals + +- Per-test threshold override (use `required` for that) +- Replacement for `agentv compare` regression gating +- Severity levels (#334) From 44587b9eec003fd02d5aefee60e48ddb4c1def27 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:19:22 +0000 Subject: [PATCH 02/11] docs: add implementation plan for --threshold flag (#698) 8-task TDD plan covering core extractor, YAML schema, CLI flag, threshold check, JUnit integration, and manual UAT. Co-Authored-By: Claude Opus 4.6 Entire-Checkpoint: 6cfbff7718f7 --- docs/plans/2026-03-25-threshold-flag-plan.md | 562 +++++++++++++++++++ 1 file changed, 562 insertions(+) create mode 100644 docs/plans/2026-03-25-threshold-flag-plan.md diff --git a/docs/plans/2026-03-25-threshold-flag-plan.md b/docs/plans/2026-03-25-threshold-flag-plan.md new file mode 100644 index 000000000..57ba2eb53 --- /dev/null +++ b/docs/plans/2026-03-25-threshold-flag-plan.md @@ -0,0 +1,562 @@ +# `--threshold` Flag Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Add a `--threshold` CLI flag and `execution.threshold` YAML field to `agentv eval` that exits 1 when mean quality score falls below the threshold. + +**Architecture:** The threshold value flows from CLI flag or YAML config through the existing options pipeline. After all tests complete, the summary is checked against the threshold. JUnit writer also uses the threshold for per-test pass/fail. + +**Tech Stack:** TypeScript, cmd-ts (CLI parsing), Zod (schema validation), Vitest (testing) + +--- + +### Task 1: Add `extractThreshold` to core config-loader + +**Files:** +- Modify: `packages/core/src/evaluation/loaders/config-loader.ts:287` (after `extractTotalBudgetUsd`) +- Test: `packages/core/test/evaluation/loaders/config-loader.test.ts` + +**Step 1: Write the failing tests** + +Add to `packages/core/test/evaluation/loaders/config-loader.test.ts` after the `extractFailOnError` describe block: + +```typescript +describe('extractThreshold', () => { + it('returns undefined when no execution block', () => { + const suite: JsonObject = { tests: [] }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined when threshold not set', () => { + const suite: JsonObject = { execution: { target: 'default' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('parses valid threshold', () => { + const suite: JsonObject = { execution: { threshold: 0.8 } }; + expect(extractThreshold(suite)).toBe(0.8); + }); + + it('accepts 0 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 0 } }; + expect(extractThreshold(suite)).toBe(0); + }); + + it('accepts 1 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 1 } }; + expect(extractThreshold(suite)).toBe(1); + }); + + it('returns undefined for negative threshold', () => { + const suite: JsonObject = { execution: { threshold: -0.1 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for threshold > 1', () => { + const suite: JsonObject = { execution: { threshold: 1.5 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for non-number threshold', () => { + const suite: JsonObject = { execution: { threshold: 'high' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); +}); +``` + +Also add `extractThreshold` to the import at the top of the test file. + +**Step 2: Run tests to verify they fail** + +Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` +Expected: FAIL — `extractThreshold` not found + +**Step 3: Implement `extractThreshold`** + +Add to `packages/core/src/evaluation/loaders/config-loader.ts` after `extractTotalBudgetUsd` (after line ~308): + +```typescript +/** + * Extract `execution.threshold` from parsed eval suite. + * Accepts a number in [0, 1] range. + * Returns undefined when not specified. + */ +export function extractThreshold(suite: JsonObject): number | undefined { + const execution = suite.execution; + if (!execution || typeof execution !== 'object' || Array.isArray(execution)) { + return undefined; + } + + const executionObj = execution as Record; + const raw = executionObj.threshold; + + if (raw === undefined || raw === null) { + return undefined; + } + + if (typeof raw === 'number' && raw >= 0 && raw <= 1) { + return raw; + } + + logWarning( + `Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`, + ); + return undefined; +} +``` + +**Step 4: Run tests to verify they pass** + +Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` +Expected: PASS + +**Step 5: Commit** + +```bash +git add packages/core/src/evaluation/loaders/config-loader.ts packages/core/test/evaluation/loaders/config-loader.test.ts +git commit -m "feat(core): add extractThreshold for execution.threshold YAML field (#698)" +``` + +--- + +### Task 2: Wire `extractThreshold` through YAML parser and schema + +**Files:** +- Modify: `packages/core/src/evaluation/yaml-parser.ts:12` (imports), `:58` (re-exports), `:204` (loadTestSuite) +- Modify: `packages/core/src/evaluation/yaml-parser.ts:168` (EvalSuiteResult type) +- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts:317` (ExecutionSchema) + +**Step 1: Add `threshold` to ExecutionSchema in eval-file.schema.ts** + +In `packages/core/src/evaluation/validation/eval-file.schema.ts`, add to the `ExecutionSchema` object (after `failOnError` at line 330): + +```typescript + threshold: z.number().min(0).max(1).optional(), +``` + +**Step 2: Add to EvalSuiteResult type in yaml-parser.ts** + +In `packages/core/src/evaluation/yaml-parser.ts`, add to the `EvalSuiteResult` type (after `failOnError` at line 182): + +```typescript + /** Suite-level quality threshold (0-1) — suite fails if mean score is below */ + readonly threshold?: number; +``` + +**Step 3: Import and re-export `extractThreshold` in yaml-parser.ts** + +Add `extractThreshold` to the import from `./loaders/config-loader.js` (line 12 area) and the re-export block (line 58 area). + +**Step 4: Use in `loadTestSuite`** + +In the `loadTestSuite` function (around line 203), extract and return threshold: + +```typescript + const threshold = extractThreshold(parsed); + return { + tests, + trials: extractTrialsConfig(parsed), + targets: extractTargetsFromSuite(parsed), + workers: extractWorkersFromSuite(parsed), + cacheConfig: extractCacheConfig(parsed), + totalBudgetUsd: extractTotalBudgetUsd(parsed), + ...(metadata !== undefined && { metadata }), + ...(failOnError !== undefined && { failOnError }), + ...(threshold !== undefined && { threshold }), + }; +``` + +**Step 5: Regenerate the JSON schema** + +Run: `bun run generate:schema` + +**Step 6: Run core tests** + +Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` +Expected: PASS + +**Step 7: Commit** + +```bash +git add packages/core/src/evaluation/validation/eval-file.schema.ts packages/core/src/evaluation/yaml-parser.ts +git commit -m "feat(core): wire extractThreshold through YAML parser and schema (#698)" +``` + +--- + +### Task 3: Add `--threshold` CLI flag and pass through to run-eval + +**Files:** +- Modify: `apps/cli/src/commands/eval/commands/run.ts` (add CLI flag) +- Modify: `apps/cli/src/commands/eval/run-eval.ts` (NormalizedOptions, normalizeOptions, handler return) + +**Step 1: Add CLI flag to run.ts** + +In `apps/cli/src/commands/eval/commands/run.ts`, add after the `model` option (around line 171): + +```typescript + threshold: option({ + type: optional(number), + long: 'threshold', + description: 'Suite-level quality gate: exit 1 if mean score falls below this value (0-1)', + }), +``` + +And add `threshold: args.threshold` to the `rawOptions` object in the handler (around line 219). + +**Step 2: Add to NormalizedOptions in run-eval.ts** + +In `apps/cli/src/commands/eval/run-eval.ts`, add to the `NormalizedOptions` interface: + +```typescript + readonly threshold?: number; +``` + +**Step 3: Add to normalizeOptions** + +In the `normalizeOptions` function, add threshold resolution (CLI > YAML): + +```typescript + // Resolve threshold: CLI --threshold > YAML execution.threshold + const cliThreshold = normalizeOptionalNumber(rawOptions.threshold); +``` + +And in the return statement: + +```typescript + threshold: cliThreshold, +``` + +**Step 4: Wire YAML threshold into normalized options** + +In `runEvalCommand`, after `prepareEvalFile` returns, merge the YAML threshold if CLI didn't set one. In the loop over eval files (around the `prepareEvalFile` call), capture `suite.threshold` and pass it through. + +The cleanest approach: read the YAML threshold in `prepareEvalFile` and return it alongside the other fields. Then in the main `runEvalCommand`, resolve CLI vs YAML threshold. + +Add `threshold` to the `prepareEvalFile` return type (alongside `failOnError`): + +```typescript + readonly threshold?: number; +``` + +And in `prepareEvalFile`, add after `failOnError: suite.failOnError`: + +```typescript + threshold: suite.threshold, +``` + +**Step 5: Commit** + +```bash +git add apps/cli/src/commands/eval/commands/run.ts apps/cli/src/commands/eval/run-eval.ts +git commit -m "feat(cli): add --threshold flag and wire through options pipeline (#698)" +``` + +--- + +### Task 4: Add threshold check and summary output after eval completes + +**Files:** +- Modify: `apps/cli/src/commands/eval/run-eval.ts` (after summary calculation ~line 1152) +- Modify: `apps/cli/src/commands/eval/statistics.ts` (add `formatThresholdSummary`) +- Test: `apps/cli/test/commands/eval/threshold.test.ts` (new) + +**Step 1: Write failing tests** + +Create `apps/cli/test/commands/eval/threshold.test.ts`: + +```typescript +import { describe, expect, it } from 'bun:test'; + +import type { EvaluationResult } from '@agentv/core'; + +import { formatThresholdSummary } from '../../../src/commands/eval/statistics.js'; + +function makeResult(overrides: Partial = {}): EvaluationResult { + return { + timestamp: '2024-01-01T00:00:00Z', + testId: 'test-1', + score: 1.0, + assertions: [{ text: 'criterion-1', passed: true }], + output: [{ role: 'assistant' as const, content: 'answer' }], + target: 'default', + ...overrides, + }; +} + +describe('formatThresholdSummary', () => { + it('returns PASS when mean score meets threshold', () => { + const result = formatThresholdSummary(0.85, 0.6); + expect(result.passed).toBe(true); + expect(result.message).toContain('0.85'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('PASS'); + }); + + it('returns FAIL when mean score is below threshold', () => { + const result = formatThresholdSummary(0.53, 0.6); + expect(result.passed).toBe(false); + expect(result.message).toContain('0.53'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('FAIL'); + }); + + it('returns PASS when mean score exactly equals threshold', () => { + const result = formatThresholdSummary(0.6, 0.6); + expect(result.passed).toBe(true); + }); + + it('returns PASS for threshold 0 with any score', () => { + const result = formatThresholdSummary(0, 0); + expect(result.passed).toBe(true); + }); +}); +``` + +**Step 2: Run tests to verify they fail** + +Run: `bun test apps/cli/test/commands/eval/threshold.test.ts` +Expected: FAIL — `formatThresholdSummary` not found + +**Step 3: Implement `formatThresholdSummary` in statistics.ts** + +Add to `apps/cli/src/commands/eval/statistics.ts`: + +```typescript +/** + * Format a threshold check summary line. + * Returns whether the threshold was met and the formatted message. + */ +export function formatThresholdSummary( + meanScore: number, + threshold: number, +): { passed: boolean; message: string } { + const passed = meanScore >= threshold; + const verdict = passed ? 'PASS' : 'FAIL'; + const message = `Suite score: ${meanScore.toFixed(2)} (threshold: ${threshold.toFixed(2)}) — ${verdict}`; + return { passed, message }; +} +``` + +**Step 4: Run tests to verify they pass** + +Run: `bun test apps/cli/test/commands/eval/threshold.test.ts` +Expected: PASS + +**Step 5: Wire the threshold check into run-eval.ts** + +In `apps/cli/src/commands/eval/run-eval.ts`, after the summary is printed (around line 1153), add: + +```typescript + // Threshold quality gate check + const resolvedThreshold = options.threshold ?? yamlThreshold; + if (resolvedThreshold !== undefined) { + const { formatThresholdSummary } = await import('./statistics.js'); + const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold); + console.log(`\n${thresholdResult.message}`); + if (!thresholdResult.passed) { + process.exitCode = 1; + } + } +``` + +Note: `yamlThreshold` needs to be captured from the `prepareEvalFile` results. If multiple eval files are run, use the first non-undefined threshold (or the CLI value). + +Import `formatThresholdSummary` statically at the top (preferred over dynamic import since it's in the same package): + +```typescript +import { + calculateEvaluationSummary, + formatEvaluationSummary, + formatMatrixSummary, + formatThresholdSummary, +} from './statistics.js'; +``` + +**Step 6: Commit** + +```bash +git add apps/cli/src/commands/eval/statistics.ts apps/cli/src/commands/eval/run-eval.ts apps/cli/test/commands/eval/threshold.test.ts +git commit -m "feat(cli): add threshold check with summary output after eval (#698)" +``` + +--- + +### Task 5: JUnit writer uses threshold for per-test pass/fail + +**Files:** +- Modify: `apps/cli/src/commands/eval/junit-writer.ts` +- Modify: `apps/cli/test/commands/eval/output-writers.test.ts` (add tests) + +**Step 1: Write failing tests** + +Add to `apps/cli/test/commands/eval/output-writers.test.ts` in the JUnit describe block: + +```typescript + it('uses custom threshold for pass/fail when provided', async () => { + const filePath = path.join(testDir, `junit-threshold-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath, { threshold: 0.8 }); + + await writer.append(makeResult({ testId: 'high', score: 0.9 })); + await writer.append(makeResult({ testId: 'mid', score: 0.6 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { + const filePath = path.join(testDir, `junit-default-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath); + + await writer.append(makeResult({ testId: 'pass', score: 0.6 })); + await writer.append(makeResult({ testId: 'fail', score: 0.3 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { + await mkdir(path.dirname(filePath), { recursive: true }); + return new JunitWriter(filePath, options); + } +``` + +Then replace all `r.score < 0.5` with `r.score < this.threshold` in the `close()` method. + +**Step 4: Pass threshold to JunitWriter in output-writer.ts** + +In `apps/cli/src/commands/eval/output-writer.ts`, where JunitWriter is created, pass the threshold. Check how output writers are created and thread the threshold through. + +**Step 5: Run tests to verify they pass** + +Run: `bun test apps/cli/test/commands/eval/output-writers.test.ts` +Expected: PASS + +**Step 6: Commit** + +```bash +git add apps/cli/src/commands/eval/junit-writer.ts apps/cli/src/commands/eval/output-writer.ts apps/cli/test/commands/eval/output-writers.test.ts +git commit -m "feat(cli): JUnit writer uses --threshold for per-test pass/fail (#698)" +``` + +--- + +### Task 6: Add `threshold` to Zod schema and regenerate JSON schema + +**Files:** +- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts` (already done in Task 2) +- Run: `bun run generate:schema` + +**Step 1: Verify threshold is in ExecutionSchema** + +Read `packages/core/src/evaluation/validation/eval-file.schema.ts` and confirm `threshold` was added in Task 2. + +**Step 2: Regenerate JSON schema** + +Run: `bun run generate:schema` + +**Step 3: Run validate:examples to check existing YAML files still pass** + +Run: `bun run validate:examples` +Expected: PASS (threshold is optional, so existing files are unaffected) + +**Step 4: Commit if schema file changed** + +```bash +git add packages/core/ +git commit -m "chore: regenerate eval-schema.json with threshold field (#698)" +``` + +--- + +### Task 7: Run full test suite and verify + +**Step 1: Run all tests** + +Run: `bun run test` +Expected: PASS (except any pre-existing known failures) + +**Step 2: Run typecheck** + +Run: `bun run typecheck` +Expected: PASS + +**Step 3: Run lint** + +Run: `bun run lint` +Expected: PASS + +**Step 4: Run build** + +Run: `bun run build` +Expected: PASS + +--- + +### Task 8: Manual red/green UAT + +**Step 1: Red — verify no threshold behavior on main** + +Run an eval without --threshold: + +```bash +bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 +``` + +Confirm: no "Suite score" line in output, exit code is 0. + +**Step 2: Green — verify --threshold works** + +Run with a threshold that should PASS: + +```bash +bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.3 +``` + +Confirm: "Suite score: X.XX (threshold: 0.30) — PASS" printed, exit code 0. + +Run with a threshold that should FAIL: + +```bash +bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.99 +``` + +Confirm: "Suite score: X.XX (threshold: 0.99) — FAIL" printed, exit code 1. + +**Step 3: Verify JUnit output uses threshold** + +```bash +bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.9 -o /tmp/test-threshold.xml +``` + +Inspect the XML: tests with score < 0.9 should have `` elements. From a1f283979757096f9ad9899a14d41ec00099576c Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:20:29 +0000 Subject: [PATCH 03/11] docs(repo): restore dropped sections in AGENTS.md PR #757 moved content from CLAUDE.md to AGENTS.md but accidentally dropped several sections: Evaluator Type System, Git Workflow (issue claiming, PRs, worktrees), Version Management, Package Publishing, and Python Scripts. Co-Authored-By: Claude Opus 4.6 Entire-Checkpoint: 1ed266d094ed --- AGENTS.md | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index e0a6b1aa8..cb022cb1b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -258,3 +258,185 @@ When making changes to functionality: 2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, evaluator types, or CLI commands. Keep concise — link to docs site for details. 3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests. + +4. **README.md**: Keep minimal. Links point to agentv.dev. + +## Evaluator Type System + +Evaluator types use **kebab-case** everywhere (matching promptfoo convention): + +- **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics` +- **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...` +- **Output `scores[].type`:** `"llm-grader"`, `"is-json"` +- **Registry keys:** `registry.register('llm-grader', ...)` + +**Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts` + +**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeEvaluatorType()` in `evaluator-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged. + +**Two type definitions exist:** +- `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical +- `AssertionType` in `packages/eval/src/assertion.ts` — SDK-facing, must stay in sync + +## Git Workflow + +### Commit Convention + +Follow conventional commits: `type(scope): description` + +Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore` + +### Issue Workflow + +When working on a GitHub issue, **ALWAYS** follow this workflow: + +1. **Claim the issue** — prevents other agents from duplicating work: + ```bash + # Load AGENT_ID from .env; if not set, ask the user or default to - + # Harness = the coding tool (claude-code, opencode, codex-cli, cursor, etc.) + # Model = the LLM (opus, sonnet, o3, etc.) + # Examples: "claude-code-opus", "opencode-sonnet", "cursor-o3", "codex-cli-o3" + # In this local dev environment, default to "devbox2-codex" unless the user specifies another AGENT_ID. + # Do NOT use hostname or machine name. + source .env 2>/dev/null + if [ -z "$AGENT_ID" ]; then + echo "AGENT_ID is not set. Ask the user for an agent identifier, or default to devbox2-codex in this environment (otherwise use -)." + fi + + # Check if already claimed + gh issue view --json labels --jq '.labels[].name' | grep -q "in-progress" && echo "SKIP — already claimed" && exit 1 + + # Claim it — label + project roadmap status + gh issue edit --add-label "in-progress" + + # Update project roadmap: set status to "In Progress" and stamp Agent ID + ITEM_ID=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == and .content.repository == "agentv") | .id') + if [ -n "$ITEM_ID" ]; then + gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTSSF_lADOAIbbRc4BSmjFzhAFomw --single-select-option-id 47fc9ee4 + gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTF_lADOAIbbRc4BSmjFzhAHSnk --text "$AGENT_ID" + fi + ``` + If the issue has the `in-progress` label, **do not work on it** — pick a different issue. + +2. **Create a worktree** with a feature branch: + ```bash + git worktree add agentv.worktrees/ -b /- + cd agentv.worktrees/ + bun install + cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env + # Example: git worktree add agentv.worktrees/feat/42-add-new-embedder -b feat/42-add-new-embedder + ``` + +3. **Implement the changes** and commit following the commit convention + +4. **Push the branch and create a Pull Request**: + ```bash + git push -u origin + gh pr create --title "(scope): description" --body "Closes #" + ``` + +5. **Before merging**, ensure: + - **E2E verification completed** (see "Completing Work — E2E Checklist") + - CI pipeline passes (all checks green) + - Code has been reviewed if required + - No merge conflicts with `main` + +The `in-progress` label stays on the issue until the PR is merged and the issue is closed. Do not remove it manually. + +**IMPORTANT:** Never push directly to `main`. Always use branches and PRs. + +### Tracker Conventions + +- The roadmap project is the source of truth for prioritization. +- Issues in the roadmap are prioritized; issues outside it are not. +- `bug` marks defects. +- Issues without `bug` are non-bug work by default. +- `in-progress` marks an issue as claimed by an agent — do not start work on it. +- `core`, `wui`, and `tui` are area labels. +- Keep issue bodies focused on the handoff contract: objective, design latitude, acceptance signals, non-goals, and related links. +- Do not put priority metadata in issue bodies. + +### Pull Requests + +**Always use squash merge** when merging PRs to main. This keeps the commit history clean with one commit per feature/fix. + +```bash +# Using GitHub CLI to squash merge a PR +gh pr merge --squash --delete-branch + +# Or with auto-merge enabled +gh pr merge --squash --auto +``` + +Do NOT use regular merge or rebase merge, as these create noisy commit history with intermediate commits. + +### After Squash Merge + +Once a PR is squash-merged, its source branch diverges from main. **Do NOT** try to push additional commits from that branch—you will get merge conflicts. + +For follow-up fixes: +```bash +git checkout main +git pull origin main +git checkout -b fix/ +# Apply fixes on the fresh branch +``` + +### Plans and Worktrees + +#### Plans + +Design documents and implementation plans are stored in `docs/plans/` inside the worktree (not the main repo). Save plans to the worktree so they are committed on the feature branch and visible in the draft PR. + +**Path warning:** When working in a worktree, use paths relative to the worktree root (e.g., `docs/plans/plan.md`). Do NOT prefix with the worktree directory from the main repo (e.g., `agentv.worktrees/feat/xxx/docs/plans/plan.md`) — this creates accidental nested directories inside the worktree. + +Plans are temporary working materials. **Before merging the PR**, delete the plan file and incorporate any user-relevant details into the official documentation. + +#### Git Worktrees + +Use the sibling `../agentv.worktrees/` directory for all AgentV worktrees. This overrides any generic skill or default preference for `.worktrees/` or `worktrees/` inside the repository. Do not create new AgentV worktrees inside the repository root. + +After creating a worktree, always run setup: +```bash +bun install # worktrees do NOT share node_modules +cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env # required for e2e tests and LLM operations +``` +Both steps are required before running builds, tests, or evals in the worktree. + +## Version Management + +This project uses a simple release script for version bumping. The git commit history serves as the changelog. + +### Releasing a new version + +Run the release script for a version bump: + +```bash +bun run release # patch bump (default) +bun run release minor # minor bump +bun run release major # major bump +``` + +The script will: +1. Validate you're on the `main` branch with no uncommitted changes +2. Pull latest changes from origin +3. Bump version in all package.json files +4. Commit the version bump +5. Create and push a git tag + +Recommended publish flow: +```bash +bun run publish:next # publish current version to npm `next` +bun run promote:latest # promote same version to npm `latest` +bun run tag:next 2.18.0 +bun run promote:latest 2.18.0 +``` + +## Package Publishing +- Core package (`packages/core/`) - Core evaluation engine and grading logic (published as `@agentv/core`) +- CLI package (`apps/cli/`) is published as `agentv` on npm +- Uses tsup with `noExternal: ["@agentv/core"]` to bundle workspace dependencies +- Install command: `bun install -g agentv` (preferred) or `npm install -g agentv` + +## Python Scripts +When running Python scripts, always use: `uv run ` From 430418a510b5c179ee4d4bb79d2c8c0c660a5adf Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:23:00 +0000 Subject: [PATCH 04/11] feat(core): add extractThreshold for execution.threshold YAML field (#698) Co-Authored-By: Claude Opus 4.6 --- .../src/evaluation/loaders/config-loader.ts | 28 ++++++++++++ .../evaluation/loaders/config-loader.test.ts | 43 +++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/packages/core/src/evaluation/loaders/config-loader.ts b/packages/core/src/evaluation/loaders/config-loader.ts index 4835dcbd2..daa2aa7aa 100644 --- a/packages/core/src/evaluation/loaders/config-loader.ts +++ b/packages/core/src/evaluation/loaders/config-loader.ts @@ -333,6 +333,34 @@ export function extractFailOnError(suite: JsonObject): FailOnError | undefined { return undefined; } +/** + * Extract `execution.threshold` from parsed eval suite. + * Accepts a number in [0, 1] range. + * Returns undefined when not specified. + */ +export function extractThreshold(suite: JsonObject): number | undefined { + const execution = suite.execution; + if (!execution || typeof execution !== 'object' || Array.isArray(execution)) { + return undefined; + } + + const executionObj = execution as Record; + const raw = executionObj.threshold; + + if (raw === undefined || raw === null) { + return undefined; + } + + if (typeof raw === 'number' && raw >= 0 && raw <= 1) { + return raw; + } + + logWarning( + `Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`, + ); + return undefined; +} + export function parseExecutionDefaults( raw: unknown, configPath: string, diff --git a/packages/core/test/evaluation/loaders/config-loader.test.ts b/packages/core/test/evaluation/loaders/config-loader.test.ts index 27dd52c1e..ac68e0eb9 100644 --- a/packages/core/test/evaluation/loaders/config-loader.test.ts +++ b/packages/core/test/evaluation/loaders/config-loader.test.ts @@ -5,6 +5,7 @@ import { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTotalBudgetUsd, extractTrialsConfig, parseExecutionDefaults, @@ -302,6 +303,48 @@ describe('extractFailOnError', () => { }); }); +describe('extractThreshold', () => { + it('returns undefined when no execution block', () => { + const suite: JsonObject = { tests: [] }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined when threshold not set', () => { + const suite: JsonObject = { execution: { target: 'default' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('parses valid threshold', () => { + const suite: JsonObject = { execution: { threshold: 0.8 } }; + expect(extractThreshold(suite)).toBe(0.8); + }); + + it('accepts 0 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 0 } }; + expect(extractThreshold(suite)).toBe(0); + }); + + it('accepts 1 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 1 } }; + expect(extractThreshold(suite)).toBe(1); + }); + + it('returns undefined for negative threshold', () => { + const suite: JsonObject = { execution: { threshold: -0.1 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for threshold > 1', () => { + const suite: JsonObject = { execution: { threshold: 1.5 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for non-number threshold', () => { + const suite: JsonObject = { execution: { threshold: 'high' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); +}); + describe('parseExecutionDefaults', () => { it('returns undefined when no execution block', () => { expect(parseExecutionDefaults(undefined, '/test/config.yaml')).toBeUndefined(); From 5b6a80d8a378216bafe7332e3c7ecd05edc7a714 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:27:42 +0000 Subject: [PATCH 05/11] feat(core): wire extractThreshold through YAML parser and schema (#698) Add threshold to ExecutionSchema in Zod, wire extractThreshold through yaml-parser.ts (import, re-export, EvalSuiteResult type, loadTestSuite), and regenerate eval-schema.json. Co-Authored-By: Claude Opus 4.6 --- .../evaluation/validation/eval-file.schema.ts | 1 + packages/core/src/evaluation/yaml-parser.ts | 6 + .../references/eval-schema.json | 3545 +++++++++++++---- 3 files changed, 2881 insertions(+), 671 deletions(-) diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts index 9ebf758a9..084eb3466 100644 --- a/packages/core/src/evaluation/validation/eval-file.schema.ts +++ b/packages/core/src/evaluation/validation/eval-file.schema.ts @@ -328,6 +328,7 @@ const ExecutionSchema = z.object({ totalBudgetUsd: z.number().min(0).optional(), fail_on_error: FailOnErrorSchema.optional(), failOnError: FailOnErrorSchema.optional(), + threshold: z.number().min(0).max(1).optional(), }); // --------------------------------------------------------------------------- diff --git a/packages/core/src/evaluation/yaml-parser.ts b/packages/core/src/evaluation/yaml-parser.ts index 119cadb47..e8004bbc9 100644 --- a/packages/core/src/evaluation/yaml-parser.ts +++ b/packages/core/src/evaluation/yaml-parser.ts @@ -13,6 +13,7 @@ import { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTotalBudgetUsd, extractTrialsConfig, extractWorkersFromSuite, @@ -59,6 +60,7 @@ export { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTrialsConfig, extractWorkersFromSuite, loadConfig, @@ -180,6 +182,8 @@ export type EvalSuiteResult = { readonly totalBudgetUsd?: number; /** Execution error tolerance: true or false */ readonly failOnError?: import('./types.js').FailOnError; + /** Suite-level quality threshold (0-1) — suite fails if mean score is below */ + readonly threshold?: number; }; /** @@ -201,6 +205,7 @@ export async function loadTestSuite( const { tests, parsed } = await loadTestsFromYaml(evalFilePath, repoRoot, options); const metadata = parseMetadata(parsed); const failOnError = extractFailOnError(parsed); + const threshold = extractThreshold(parsed); return { tests, trials: extractTrialsConfig(parsed), @@ -210,6 +215,7 @@ export async function loadTestSuite( totalBudgetUsd: extractTotalBudgetUsd(parsed), ...(metadata !== undefined && { metadata }), ...(failOnError !== undefined && { failOnError }), + ...(threshold !== undefined && { threshold }), }; } diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json index a7be8362b..4df59a334 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json +++ b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json @@ -53,7 +53,12 @@ "properties": { "role": { "type": "string", - "enum": ["system", "user", "assistant", "tool"] + "enum": [ + "system", + "user", + "assistant", + "tool" + ] }, "content": { "anyOf": [ @@ -67,20 +72,29 @@ "properties": { "type": { "type": "string", - "enum": ["text", "file"] + "enum": [ + "text", + "file" + ] }, "value": { "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false } } ] } }, - "required": ["role", "content"], + "required": [ + "role", + "content" + ], "additionalProperties": false } } @@ -121,7 +135,12 @@ "properties": { "role": { "type": "string", - "enum": ["system", "user", "assistant", "tool"] + "enum": [ + "system", + "user", + "assistant", + "tool" + ] }, "content": { "anyOf": [ @@ -135,20 +154,29 @@ "properties": { "type": { "type": "string", - "enum": ["text", "file"] + "enum": [ + "text", + "file" + ] }, "value": { "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false } } ] } }, - "required": ["role", "content"], + "required": [ + "role", + "content" + ], "additionalProperties": false } } @@ -176,7 +204,12 @@ "properties": { "role": { "type": "string", - "enum": ["system", "user", "assistant", "tool"] + "enum": [ + "system", + "user", + "assistant", + "tool" + ] }, "content": { "anyOf": [ @@ -190,20 +223,29 @@ "properties": { "type": { "type": "string", - "enum": ["text", "file"] + "enum": [ + "text", + "file" + ] }, "value": { "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false } } ] } }, - "required": ["role", "content"], + "required": [ + "role", + "content" + ], "additionalProperties": false } } @@ -240,7 +282,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -292,7 +339,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -322,7 +372,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -416,7 +471,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -445,7 +503,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -505,7 +565,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -521,7 +583,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -538,7 +603,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -555,13 +623,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -591,11 +664,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -636,7 +718,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -650,7 +737,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -661,7 +753,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -669,7 +763,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -683,7 +782,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -694,7 +798,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -724,7 +831,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -736,7 +846,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -758,17 +872,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -805,7 +928,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -842,7 +968,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -872,7 +1001,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -887,7 +1019,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -917,7 +1051,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -949,7 +1086,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -985,7 +1124,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -1021,7 +1163,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -1051,10 +1196,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -1090,7 +1240,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -1171,7 +1324,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -1181,7 +1337,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -1218,7 +1377,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -1270,7 +1434,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -1300,7 +1467,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -1394,7 +1566,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -1423,7 +1598,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -1483,7 +1660,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -1499,7 +1678,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -1516,7 +1698,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -1533,13 +1718,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -1569,11 +1759,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -1614,7 +1813,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -1628,7 +1832,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -1639,7 +1848,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -1647,7 +1858,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -1661,7 +1877,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -1672,7 +1893,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -1702,7 +1926,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -1714,7 +1941,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -1736,17 +1967,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -1783,7 +2023,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -1820,7 +2063,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -1850,7 +2096,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -1865,7 +2114,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -1895,7 +2146,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -1927,7 +2181,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -1963,7 +2219,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -1999,7 +2258,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -2029,10 +2291,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -2068,7 +2335,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -2149,7 +2419,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -2159,7 +2432,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -2196,7 +2472,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -2248,7 +2529,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -2278,7 +2562,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -2372,7 +2661,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -2401,7 +2693,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -2461,7 +2755,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -2477,7 +2773,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -2494,7 +2793,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -2511,13 +2813,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -2547,11 +2854,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -2592,7 +2908,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -2606,7 +2927,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -2617,7 +2943,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -2625,7 +2953,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -2639,7 +2972,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -2650,7 +2988,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -2680,7 +3021,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -2692,7 +3036,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -2714,17 +3062,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -2761,7 +3118,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -2798,7 +3158,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -2828,7 +3191,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -2843,7 +3209,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -2873,7 +3241,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -2905,7 +3276,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -2941,7 +3314,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -2977,7 +3353,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -3007,10 +3386,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -3046,7 +3430,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -3127,7 +3514,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -3137,7 +3527,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -3191,7 +3584,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -3243,7 +3641,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -3273,7 +3674,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -3367,7 +3773,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -3396,7 +3805,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -3456,7 +3867,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -3472,7 +3885,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -3489,7 +3905,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -3506,13 +3925,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -3542,11 +3966,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -3587,7 +4020,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -3601,7 +4039,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -3612,7 +4055,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -3620,7 +4065,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -3634,7 +4084,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -3645,7 +4100,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -3675,7 +4133,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -3687,7 +4148,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -3709,17 +4174,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -3756,7 +4230,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -3793,7 +4270,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -3823,7 +4303,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -3838,7 +4321,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -3868,7 +4353,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -3900,7 +4388,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -3936,7 +4426,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -3972,7 +4465,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -4002,10 +4498,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -4041,7 +4542,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -4122,7 +4626,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -4132,7 +4639,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -4169,7 +4679,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -4221,7 +4736,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -4251,7 +4769,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -4345,7 +4868,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -4374,7 +4900,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -4434,7 +4962,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -4450,7 +4980,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -4467,7 +5000,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -4484,13 +5020,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -4520,11 +5061,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -4565,7 +5115,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -4579,7 +5134,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -4590,7 +5150,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -4598,7 +5160,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -4612,7 +5179,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -4623,7 +5195,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -4653,7 +5228,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -4665,7 +5243,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -4687,17 +5269,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -4734,7 +5325,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -4771,7 +5365,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -4801,7 +5398,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -4816,7 +5416,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -4846,7 +5448,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -4878,7 +5483,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -4914,7 +5521,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -4950,7 +5560,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -4980,10 +5593,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5019,7 +5637,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -5100,7 +5721,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -5110,7 +5734,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -5147,7 +5774,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -5199,7 +5831,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -5229,7 +5864,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -5323,7 +5963,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -5352,7 +5995,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5412,7 +6057,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5428,7 +6075,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -5445,7 +6095,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -5462,13 +6115,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -5498,11 +6156,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -5543,7 +6210,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -5557,7 +6229,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -5568,7 +6245,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -5576,7 +6255,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -5590,7 +6274,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -5601,7 +6290,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -5631,7 +6323,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -5643,7 +6338,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -5665,17 +6364,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -5712,7 +6420,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -5749,7 +6460,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -5779,7 +6493,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -5794,7 +6511,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5824,7 +6543,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -5856,7 +6578,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5892,7 +6616,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -5928,7 +6655,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -5958,10 +6688,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -5997,7 +6732,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -6078,7 +6816,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -6088,7 +6829,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -6109,7 +6853,11 @@ }, "strategy": { "type": "string", - "enum": ["pass_at_k", "mean", "confidence_interval"] + "enum": [ + "pass_at_k", + "mean", + "confidence_interval" + ] }, "cost_limit_usd": { "type": "number", @@ -6120,7 +6868,9 @@ "minimum": 0 } }, - "required": ["count"], + "required": [ + "count" + ], "additionalProperties": false }, "total_budget_usd": { @@ -6136,6 +6886,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false @@ -6148,7 +6903,10 @@ }, "isolation": { "type": "string", - "enum": ["shared", "per_test"] + "enum": [ + "shared", + "per_test" + ] }, "repos": { "type": "array", @@ -6172,7 +6930,10 @@ "format": "uri" } }, - "required": ["type", "url"], + "required": [ + "type", + "url" + ], "additionalProperties": false }, { @@ -6186,7 +6947,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false } ] @@ -6199,7 +6963,10 @@ }, "resolve": { "type": "string", - "enum": ["remote", "local"] + "enum": [ + "remote", + "local" + ] }, "ancestor": { "type": "integer", @@ -6228,7 +6995,10 @@ "additionalProperties": false } }, - "required": ["path", "source"], + "required": [ + "path", + "source" + ], "additionalProperties": false } }, @@ -6264,7 +7034,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -6295,7 +7069,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -6326,7 +7104,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -6357,7 +7139,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -6367,7 +7153,11 @@ }, "mode": { "type": "string", - "enum": ["pooled", "temp", "static"] + "enum": [ + "pooled", + "temp", + "static" + ] }, "path": { "type": "string" @@ -6389,7 +7179,9 @@ "type": "string" } }, - "required": ["id"], + "required": [ + "id" + ], "additionalProperties": false } }, @@ -6427,7 +7219,12 @@ "properties": { "role": { "type": "string", - "enum": ["system", "user", "assistant", "tool"] + "enum": [ + "system", + "user", + "assistant", + "tool" + ] }, "content": { "anyOf": [ @@ -6441,20 +7238,29 @@ "properties": { "type": { "type": "string", - "enum": ["text", "file"] + "enum": [ + "text", + "file" + ] }, "value": { "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false } } ] } }, - "required": ["role", "content"], + "required": [ + "role", + "content" + ], "additionalProperties": false } } @@ -6482,7 +7288,12 @@ "properties": { "role": { "type": "string", - "enum": ["system", "user", "assistant", "tool"] + "enum": [ + "system", + "user", + "assistant", + "tool" + ] }, "content": { "anyOf": [ @@ -6496,20 +7307,29 @@ "properties": { "type": { "type": "string", - "enum": ["text", "file"] + "enum": [ + "text", + "file" + ] }, "value": { "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false } } ] } }, - "required": ["role", "content"], + "required": [ + "role", + "content" + ], "additionalProperties": false } } @@ -6546,7 +7366,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -6598,7 +7423,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -6628,7 +7456,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -6722,7 +7555,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -6751,7 +7587,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -6811,7 +7649,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -6827,7 +7667,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -6844,7 +7687,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -6861,13 +7707,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -6897,11 +7748,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -6942,7 +7802,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -6956,7 +7821,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -6967,7 +7837,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -6975,7 +7847,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -6989,7 +7866,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -7000,7 +7882,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -7030,7 +7915,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -7042,7 +7930,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -7064,17 +7956,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -7111,7 +8012,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -7148,7 +8052,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -7178,7 +8085,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -7193,7 +8103,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -7223,7 +8135,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -7255,7 +8170,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -7291,7 +8208,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -7327,7 +8247,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -7357,10 +8280,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -7396,7 +8324,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -7477,7 +8408,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -7487,7 +8421,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -7524,7 +8461,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -7576,7 +8518,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -7606,7 +8551,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -7700,7 +8650,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -7729,7 +8682,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -7789,7 +8744,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -7805,7 +8762,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -7822,7 +8782,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -7839,13 +8802,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -7875,11 +8843,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -7920,7 +8897,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -7934,7 +8916,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -7945,7 +8932,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -7953,7 +8942,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -7967,7 +8961,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -7978,7 +8977,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -8008,7 +9010,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -8020,7 +9025,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -8042,17 +9051,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -8089,7 +9107,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -8126,7 +9147,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -8156,7 +9180,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -8171,7 +9198,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -8201,7 +9230,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -8233,7 +9265,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -8269,7 +9303,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -8305,7 +9342,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -8335,10 +9375,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -8374,7 +9419,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -8455,7 +9503,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -8465,7 +9516,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -8502,7 +9556,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -8554,7 +9613,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -8584,7 +9646,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -8678,7 +9745,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -8707,7 +9777,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -8767,7 +9839,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -8783,7 +9857,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -8800,7 +9877,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -8817,13 +9897,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -8853,11 +9938,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -8898,7 +9992,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -8912,7 +10011,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -8923,7 +10027,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -8931,7 +10037,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -8945,7 +10056,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -8956,7 +10072,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -8986,7 +10105,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -8998,7 +10120,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -9020,17 +10146,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -9067,7 +10202,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -9104,7 +10242,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -9134,7 +10275,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -9149,7 +10293,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -9179,7 +10325,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -9211,7 +10360,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -9247,7 +10398,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -9283,7 +10437,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -9313,10 +10470,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -9352,7 +10514,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -9433,7 +10598,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -9443,7 +10611,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -9497,7 +10668,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -9549,7 +10725,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -9579,7 +10758,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -9673,7 +10857,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -9702,7 +10889,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -9762,7 +10951,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -9778,7 +10969,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -9795,7 +10989,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -9812,13 +11009,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -9848,11 +11050,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -9893,7 +11104,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -9907,7 +11123,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -9918,7 +11139,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -9926,7 +11149,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -9940,7 +11168,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -9951,7 +11184,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -9981,7 +11217,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -9993,7 +11232,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -10015,17 +11258,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -10062,7 +11314,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -10099,7 +11354,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -10129,7 +11387,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -10144,7 +11405,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -10174,7 +11437,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -10206,7 +11472,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -10242,7 +11510,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -10278,7 +11549,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -10308,10 +11582,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -10347,7 +11626,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -10428,7 +11710,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -10438,7 +11723,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -10475,7 +11763,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -10527,7 +11820,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -10557,7 +11853,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -10651,7 +11952,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -10680,7 +11984,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -10740,7 +12046,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -10756,7 +12064,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -10773,7 +12084,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -10790,13 +12104,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -10826,11 +12145,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -10871,7 +12199,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -10885,7 +12218,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -10896,7 +12234,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -10904,7 +12244,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -10918,7 +12263,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -10929,7 +12279,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -10959,7 +12312,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -10971,7 +12327,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -10993,17 +12353,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -11040,7 +12409,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -11077,7 +12449,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -11107,7 +12482,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -11122,7 +12500,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -11152,7 +12532,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -11184,7 +12567,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -11220,7 +12605,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -11256,7 +12644,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -11286,10 +12677,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -11325,7 +12721,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -11406,7 +12805,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -11416,7 +12818,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -11453,7 +12858,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -11505,7 +12915,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -11535,7 +12948,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -11629,7 +13047,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -11658,7 +13079,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -11718,7 +13141,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -11734,7 +13159,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -11751,7 +13179,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -11768,13 +13199,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -11804,11 +13240,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -11849,7 +13294,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -11863,7 +13313,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -11874,7 +13329,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -11882,7 +13339,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -11896,7 +13358,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -11907,7 +13374,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -11937,7 +13407,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -11949,7 +13422,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -11971,17 +13448,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -12018,7 +13504,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -12055,7 +13544,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -12085,7 +13577,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -12100,7 +13595,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -12130,7 +13627,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -12162,7 +13662,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -12198,7 +13700,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -12234,7 +13739,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -12264,10 +13772,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -12303,7 +13816,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -12384,7 +13900,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -12394,7 +13913,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -12415,7 +13937,11 @@ }, "strategy": { "type": "string", - "enum": ["pass_at_k", "mean", "confidence_interval"] + "enum": [ + "pass_at_k", + "mean", + "confidence_interval" + ] }, "cost_limit_usd": { "type": "number", @@ -12426,7 +13952,9 @@ "minimum": 0 } }, - "required": ["count"], + "required": [ + "count" + ], "additionalProperties": false }, "total_budget_usd": { @@ -12442,6 +13970,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false @@ -12454,7 +13987,10 @@ }, "isolation": { "type": "string", - "enum": ["shared", "per_test"] + "enum": [ + "shared", + "per_test" + ] }, "repos": { "type": "array", @@ -12478,7 +14014,10 @@ "format": "uri" } }, - "required": ["type", "url"], + "required": [ + "type", + "url" + ], "additionalProperties": false }, { @@ -12492,7 +14031,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false } ] @@ -12505,7 +14047,10 @@ }, "resolve": { "type": "string", - "enum": ["remote", "local"] + "enum": [ + "remote", + "local" + ] }, "ancestor": { "type": "integer", @@ -12534,7 +14079,10 @@ "additionalProperties": false } }, - "required": ["path", "source"], + "required": [ + "path", + "source" + ], "additionalProperties": false } }, @@ -12570,7 +14118,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -12601,7 +14153,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -12632,7 +14188,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -12663,7 +14223,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -12673,7 +14237,11 @@ }, "mode": { "type": "string", - "enum": ["pooled", "temp", "static"] + "enum": [ + "pooled", + "temp", + "static" + ] }, "path": { "type": "string" @@ -12695,7 +14263,9 @@ "type": "string" } }, - "required": ["id"], + "required": [ + "id" + ], "additionalProperties": false } }, @@ -12755,7 +14325,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -12807,7 +14382,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -12837,7 +14415,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -12931,7 +14514,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -12960,7 +14546,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13020,7 +14608,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13036,7 +14626,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -13053,7 +14646,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -13070,13 +14666,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -13106,11 +14707,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -13151,7 +14761,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -13165,7 +14780,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -13176,7 +14796,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -13184,7 +14806,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -13198,7 +14825,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -13209,7 +14841,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -13239,7 +14874,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -13251,7 +14889,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -13273,17 +14915,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -13320,7 +14971,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -13357,7 +15011,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -13387,7 +15044,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -13402,7 +15062,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13432,7 +15094,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -13464,7 +15129,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13500,7 +15167,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -13536,7 +15206,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -13566,10 +15239,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13605,7 +15283,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -13686,7 +15367,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -13696,7 +15380,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -13733,7 +15420,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -13785,7 +15477,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -13815,7 +15510,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -13909,7 +15609,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -13938,7 +15641,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -13998,7 +15703,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14014,7 +15721,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -14031,7 +15741,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -14048,13 +15761,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -14084,11 +15802,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -14129,7 +15856,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -14143,7 +15875,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -14154,7 +15891,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -14162,7 +15901,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -14176,7 +15920,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -14187,7 +15936,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -14217,7 +15969,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -14229,7 +15984,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -14251,17 +16010,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -14298,7 +16066,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -14335,7 +16106,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -14365,7 +16139,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -14380,7 +16157,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14410,7 +16189,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -14442,7 +16224,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14478,7 +16262,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -14514,7 +16301,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -14544,10 +16334,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14583,7 +16378,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -14664,7 +16462,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -14674,7 +16475,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -14711,7 +16515,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -14763,7 +16572,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -14793,7 +16605,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -14887,7 +16704,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -14916,7 +16736,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14976,7 +16798,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -14992,7 +16816,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -15009,7 +16836,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -15026,13 +16856,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -15062,11 +16897,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -15107,7 +16951,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -15121,7 +16970,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -15132,7 +16986,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -15140,7 +16996,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -15154,7 +17015,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -15165,7 +17031,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -15195,7 +17064,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -15207,7 +17079,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -15229,17 +17105,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -15276,7 +17161,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -15313,7 +17201,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -15343,7 +17234,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -15358,7 +17252,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -15388,7 +17284,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -15420,7 +17319,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -15456,7 +17357,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -15492,7 +17396,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -15522,10 +17429,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -15561,7 +17473,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -15642,7 +17557,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -15652,7 +17570,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -15673,7 +17594,11 @@ }, "strategy": { "type": "string", - "enum": ["pass_at_k", "mean", "confidence_interval"] + "enum": [ + "pass_at_k", + "mean", + "confidence_interval" + ] }, "cost_limit_usd": { "type": "number", @@ -15684,7 +17609,9 @@ "minimum": 0 } }, - "required": ["count"], + "required": [ + "count" + ], "additionalProperties": false }, "total_budget_usd": { @@ -15700,6 +17627,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false @@ -15735,7 +17667,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -15787,7 +17724,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -15817,7 +17757,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -15911,7 +17856,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -15940,7 +17888,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16000,7 +17950,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16016,7 +17968,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -16033,7 +17988,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -16050,13 +18008,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -16086,11 +18049,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -16131,7 +18103,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -16145,7 +18122,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -16156,7 +18138,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -16164,7 +18148,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -16178,7 +18167,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -16189,7 +18183,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -16219,7 +18216,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -16231,7 +18231,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -16253,17 +18257,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -16300,7 +18313,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -16337,7 +18353,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -16367,7 +18386,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -16382,7 +18404,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16412,7 +18436,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -16444,7 +18471,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16480,7 +18509,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -16516,7 +18548,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -16546,10 +18581,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16585,7 +18625,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -16666,7 +18709,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -16676,7 +18722,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -16713,7 +18762,12 @@ }, "type": { "type": "string", - "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] + "enum": [ + "code-grader", + "code_grader", + "code-judge", + "code_judge" + ] }, "command": { "anyOf": [ @@ -16765,7 +18819,10 @@ "additionalProperties": {} } }, - "required": ["type", "command"], + "required": [ + "type", + "command" + ], "additionalProperties": false }, { @@ -16795,7 +18852,12 @@ }, "type": { "type": "string", - "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] + "enum": [ + "llm-grader", + "llm_grader", + "llm-judge", + "llm_judge" + ] }, "prompt": { "anyOf": [ @@ -16889,7 +18951,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -16918,7 +18983,9 @@ "maximum": 2 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16978,7 +19045,9 @@ } } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -16994,7 +19063,10 @@ "maximum": 1 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -17011,7 +19083,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false }, { @@ -17028,13 +19103,18 @@ "type": "string" } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false } ] } }, - "required": ["type", "aggregator"], + "required": [ + "type", + "aggregator" + ], "additionalProperties": false }, { @@ -17064,11 +19144,20 @@ }, "type": { "type": "string", - "enum": ["tool-trajectory", "tool_trajectory"] + "enum": [ + "tool-trajectory", + "tool_trajectory" + ] }, "mode": { "type": "string", - "enum": ["any_order", "in_order", "exact", "subset", "superset"] + "enum": [ + "any_order", + "in_order", + "exact", + "subset", + "superset" + ] }, "minimums": { "type": "object", @@ -17109,7 +19198,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -17123,7 +19217,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -17134,7 +19233,9 @@ ] } }, - "required": ["tool"], + "required": [ + "tool" + ], "additionalProperties": false } }, @@ -17142,7 +19243,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -17156,7 +19262,12 @@ "anyOf": [ { "type": "string", - "enum": ["exact", "ignore", "subset", "superset"] + "enum": [ + "exact", + "ignore", + "subset", + "superset" + ] }, { "type": "array", @@ -17167,7 +19278,10 @@ ] } }, - "required": ["type", "mode"], + "required": [ + "type", + "mode" + ], "additionalProperties": false }, { @@ -17197,7 +19311,10 @@ }, "type": { "type": "string", - "enum": ["field-accuracy", "field_accuracy"] + "enum": [ + "field-accuracy", + "field_accuracy" + ] }, "fields": { "type": "array", @@ -17209,7 +19326,11 @@ }, "match": { "type": "string", - "enum": ["exact", "numeric_tolerance", "date"] + "enum": [ + "exact", + "numeric_tolerance", + "date" + ] }, "required": { "type": "boolean" @@ -17231,17 +19352,26 @@ } } }, - "required": ["path", "match"], + "required": [ + "path", + "match" + ], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": ["weighted_average", "all_or_nothing"] + "enum": [ + "weighted_average", + "all_or_nothing" + ] } }, - "required": ["type", "fields"], + "required": [ + "type", + "fields" + ], "additionalProperties": false }, { @@ -17278,7 +19408,10 @@ "minimum": 0 } }, - "required": ["type", "threshold"], + "required": [ + "type", + "threshold" + ], "additionalProperties": false }, { @@ -17315,7 +19448,10 @@ "minimum": 0 } }, - "required": ["type", "budget"], + "required": [ + "type", + "budget" + ], "additionalProperties": false }, { @@ -17345,7 +19481,10 @@ }, "type": { "type": "string", - "enum": ["token-usage", "token_usage"] + "enum": [ + "token-usage", + "token_usage" + ] }, "max_total": { "type": "number", @@ -17360,7 +19499,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -17390,7 +19531,10 @@ }, "type": { "type": "string", - "enum": ["execution-metrics", "execution_metrics"] + "enum": [ + "execution-metrics", + "execution_metrics" + ] }, "max_tool_calls": { "type": "number", @@ -17422,7 +19566,9 @@ "minimum": 0 } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -17458,7 +19604,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -17494,7 +19643,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -17524,10 +19676,15 @@ }, "type": { "type": "string", - "enum": ["is-json", "is_json"] + "enum": [ + "is-json", + "is_json" + ] } }, - "required": ["type"], + "required": [ + "type" + ], "additionalProperties": false }, { @@ -17563,7 +19720,10 @@ "type": "string" } }, - "required": ["type", "value"], + "required": [ + "type", + "value" + ], "additionalProperties": false }, { @@ -17644,7 +19804,10 @@ "minLength": 1 } }, - "required": ["score_range", "outcome"], + "required": [ + "score_range", + "outcome" + ], "additionalProperties": false } } @@ -17654,7 +19817,10 @@ "minItems": 1 } }, - "required": ["type", "criteria"], + "required": [ + "type", + "criteria" + ], "additionalProperties": false } ] @@ -17670,7 +19836,10 @@ }, "isolation": { "type": "string", - "enum": ["shared", "per_test"] + "enum": [ + "shared", + "per_test" + ] }, "repos": { "type": "array", @@ -17694,7 +19863,10 @@ "format": "uri" } }, - "required": ["type", "url"], + "required": [ + "type", + "url" + ], "additionalProperties": false }, { @@ -17708,7 +19880,10 @@ "type": "string" } }, - "required": ["type", "path"], + "required": [ + "type", + "path" + ], "additionalProperties": false } ] @@ -17721,7 +19896,10 @@ }, "resolve": { "type": "string", - "enum": ["remote", "local"] + "enum": [ + "remote", + "local" + ] }, "ancestor": { "type": "integer", @@ -17750,7 +19928,10 @@ "additionalProperties": false } }, - "required": ["path", "source"], + "required": [ + "path", + "source" + ], "additionalProperties": false } }, @@ -17786,7 +19967,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -17817,7 +20002,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -17848,7 +20037,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -17879,7 +20072,11 @@ }, "reset": { "type": "string", - "enum": ["none", "fast", "strict"] + "enum": [ + "none", + "fast", + "strict" + ] } }, "additionalProperties": false @@ -17889,7 +20086,11 @@ }, "mode": { "type": "string", - "enum": ["pooled", "temp", "static"] + "enum": [ + "pooled", + "temp", + "static" + ] }, "path": { "type": "string" @@ -17903,7 +20104,9 @@ ] } }, - "required": ["tests"], + "required": [ + "tests" + ], "additionalProperties": false } } From f3d2bebb4afbae228385bf2b8210735543901acf Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:33:37 +0000 Subject: [PATCH 06/11] feat(cli): add --threshold flag and wire through options pipeline (#698) Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/commands/run.ts | 6 ++++++ apps/cli/src/commands/eval/run-eval.ts | 9 +++++++++ 2 files changed, 15 insertions(+) diff --git a/apps/cli/src/commands/eval/commands/run.ts b/apps/cli/src/commands/eval/commands/run.ts index e680301f3..713366e7b 100644 --- a/apps/cli/src/commands/eval/commands/run.ts +++ b/apps/cli/src/commands/eval/commands/run.ts @@ -175,6 +175,11 @@ export const evalRunCommand = command({ description: 'Number of trailing messages to include in results output (default: 1, or "all")', }), + threshold: option({ + type: optional(number), + long: 'threshold', + description: 'Suite-level quality gate: exit 1 if mean score falls below this value (0-1)', + }), }, handler: async (args) => { // Launch interactive wizard when no eval paths and stdin is a TTY @@ -217,6 +222,7 @@ export const evalRunCommand = command({ graderTarget: args.graderTarget, model: args.model, outputMessages: args.outputMessages, + threshold: args.threshold, }; await runEvalCommand({ testFiles: resolvedPaths, rawOptions }); }, diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 2d486eab2..ace542fdd 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -86,6 +86,7 @@ interface NormalizedOptions { readonly graderTarget?: string; readonly model?: string; readonly outputMessages: number | 'all'; + readonly threshold?: number; } function normalizeBoolean(value: unknown): boolean { @@ -301,6 +302,7 @@ function normalizeOptions( graderTarget: normalizeString(rawOptions.graderTarget), model: normalizeString(rawOptions.model), outputMessages: normalizeOutputMessages(normalizeString(rawOptions.outputMessages)), + threshold: normalizeOptionalNumber(rawOptions.threshold), } satisfies NormalizedOptions; } @@ -430,6 +432,7 @@ async function prepareFileMetadata(params: { readonly yamlCachePath?: string; readonly totalBudgetUsd?: number; readonly failOnError?: FailOnError; + readonly threshold?: number; }> { const { testFilePath, repoRoot, cwd, options } = params; @@ -515,6 +518,7 @@ async function prepareFileMetadata(params: { yamlCachePath: suite.cacheConfig?.cachePath, totalBudgetUsd: suite.totalBudgetUsd, failOnError: suite.failOnError, + threshold: suite.threshold, }; } @@ -951,6 +955,7 @@ export async function runEvalCommand( readonly yamlCachePath?: string; readonly totalBudgetUsd?: number; readonly failOnError?: FailOnError; + readonly threshold?: number; } >(); // Separate TypeScript/JS eval files from YAML files. @@ -1006,6 +1011,10 @@ export async function runEvalCommand( console.log(`Response cache: enabled${yamlCachePath ? ` (${yamlCachePath})` : ''}`); } + // Resolve suite-level threshold: CLI --threshold takes precedence over YAML execution.threshold + const yamlThreshold = firstMeta?.threshold; + const resolvedThreshold = options.threshold ?? yamlThreshold; + // Detect matrix mode: multiple targets for any file const isMatrixMode = Array.from(fileMetadata.values()).some((meta) => meta.selections.length > 1); From 1ea5e067bc9bc45c58ca37641962f88146bf1037 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:36:01 +0000 Subject: [PATCH 07/11] feat(cli): add threshold check with summary output after eval (#698) Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/run-eval.ts | 10 ++++++ apps/cli/src/commands/eval/statistics.ts | 14 +++++++++ apps/cli/test/commands/eval/threshold.test.ts | 31 +++++++++++++++++++ 3 files changed, 55 insertions(+) create mode 100644 apps/cli/test/commands/eval/threshold.test.ts diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index ace542fdd..7deac0024 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -45,6 +45,7 @@ import { calculateEvaluationSummary, formatEvaluationSummary, formatMatrixSummary, + formatThresholdSummary, } from './statistics.js'; import { type TargetSelection, selectMultipleTargets, selectTarget } from './targets.js'; @@ -1161,6 +1162,15 @@ export async function runEvalCommand( const summary = calculateEvaluationSummary(allResults); console.log(formatEvaluationSummary(summary)); + // Threshold quality gate check + if (resolvedThreshold !== undefined) { + const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold); + console.log(`\n${thresholdResult.message}`); + if (!thresholdResult.passed) { + process.exitCode = 1; + } + } + // Print matrix summary when multiple targets were evaluated if (isMatrixMode && allResults.length > 0) { console.log(formatMatrixSummary(allResults)); diff --git a/apps/cli/src/commands/eval/statistics.ts b/apps/cli/src/commands/eval/statistics.ts index e47a65791..910052d24 100644 --- a/apps/cli/src/commands/eval/statistics.ts +++ b/apps/cli/src/commands/eval/statistics.ts @@ -334,3 +334,17 @@ export function formatMatrixSummary(results: readonly EvaluationResult[]): strin return lines.join('\n'); } + +/** + * Format a threshold check summary line. + * Returns whether the threshold was met and the formatted message. + */ +export function formatThresholdSummary( + meanScore: number, + threshold: number, +): { passed: boolean; message: string } { + const passed = meanScore >= threshold; + const verdict = passed ? 'PASS' : 'FAIL'; + const message = `Suite score: ${meanScore.toFixed(2)} (threshold: ${threshold.toFixed(2)}) — ${verdict}`; + return { passed, message }; +} diff --git a/apps/cli/test/commands/eval/threshold.test.ts b/apps/cli/test/commands/eval/threshold.test.ts new file mode 100644 index 000000000..65c059167 --- /dev/null +++ b/apps/cli/test/commands/eval/threshold.test.ts @@ -0,0 +1,31 @@ +import { describe, expect, it } from 'bun:test'; + +import { formatThresholdSummary } from '../../../src/commands/eval/statistics.js'; + +describe('formatThresholdSummary', () => { + it('returns PASS when mean score meets threshold', () => { + const result = formatThresholdSummary(0.85, 0.6); + expect(result.passed).toBe(true); + expect(result.message).toContain('0.85'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('PASS'); + }); + + it('returns FAIL when mean score is below threshold', () => { + const result = formatThresholdSummary(0.53, 0.6); + expect(result.passed).toBe(false); + expect(result.message).toContain('0.53'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('FAIL'); + }); + + it('returns PASS when mean score exactly equals threshold', () => { + const result = formatThresholdSummary(0.6, 0.6); + expect(result.passed).toBe(true); + }); + + it('returns PASS for threshold 0 with any score', () => { + const result = formatThresholdSummary(0, 0); + expect(result.passed).toBe(true); + }); +}); From e6261a1efbde6d336596c9cd55abd5b918c5a6e7 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:41:10 +0000 Subject: [PATCH 08/11] feat(cli): JUnit writer uses --threshold for per-test pass/fail (#698) Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/junit-writer.ts | 18 ++++++++----- apps/cli/src/commands/eval/output-writer.ts | 18 ++++++++++--- apps/cli/src/commands/eval/run-eval.ts | 13 +++++++--- .../test/commands/eval/output-writers.test.ts | 26 +++++++++++++++++++ 4 files changed, 62 insertions(+), 13 deletions(-) diff --git a/apps/cli/src/commands/eval/junit-writer.ts b/apps/cli/src/commands/eval/junit-writer.ts index f3bfb7f18..514b24585 100644 --- a/apps/cli/src/commands/eval/junit-writer.ts +++ b/apps/cli/src/commands/eval/junit-writer.ts @@ -3,6 +3,10 @@ import path from 'node:path'; import type { EvaluationResult } from '@agentv/core'; +export interface JunitWriterOptions { + readonly threshold?: number; +} + export function escapeXml(str: string): string { return str .replace(/&/g, '&') @@ -15,15 +19,17 @@ export function escapeXml(str: string): string { export class JunitWriter { private readonly filePath: string; private readonly results: EvaluationResult[] = []; + private readonly threshold: number; private closed = false; - private constructor(filePath: string) { + private constructor(filePath: string, options?: JunitWriterOptions) { this.filePath = filePath; + this.threshold = options?.threshold ?? 0.5; } - static async open(filePath: string): Promise { + static async open(filePath: string, options?: JunitWriterOptions): Promise { await mkdir(path.dirname(filePath), { recursive: true }); - return new JunitWriter(filePath); + return new JunitWriter(filePath, options); } async append(result: EvaluationResult): Promise { @@ -52,7 +58,7 @@ export class JunitWriter { const suiteXmls: string[] = []; for (const [suiteName, results] of grouped) { - const failures = results.filter((r) => r.score < 0.5).length; + const failures = results.filter((r) => r.score < this.threshold).length; const errors = results.filter((r) => r.error !== undefined).length; const testCases = results.map((r) => { @@ -61,7 +67,7 @@ export class JunitWriter { let inner = ''; if (r.error) { inner = `\n ${escapeXml(r.error)}\n `; - } else if (r.score < 0.5) { + } else if (r.score < this.threshold) { const message = `score=${r.score.toFixed(3)}`; const failedAssertions = r.assertions.filter((a) => !a.passed); const detail = [ @@ -84,7 +90,7 @@ export class JunitWriter { } const totalTests = this.results.length; - const totalFailures = this.results.filter((r) => r.score < 0.5).length; + const totalFailures = this.results.filter((r) => r.score < this.threshold).length; const totalErrors = this.results.filter((r) => r.error !== undefined).length; const xml = `\n\n${suiteXmls.join('\n')}\n\n`; diff --git a/apps/cli/src/commands/eval/output-writer.ts b/apps/cli/src/commands/eval/output-writer.ts index acaf757fe..e4d2cebd8 100644 --- a/apps/cli/src/commands/eval/output-writer.ts +++ b/apps/cli/src/commands/eval/output-writer.ts @@ -15,6 +15,10 @@ export interface OutputWriter { close(): Promise; } +export interface WriterOptions { + readonly threshold?: number; +} + export async function createOutputWriter( filePath: string, format: OutputFormat, @@ -35,7 +39,10 @@ export async function createOutputWriter( const SUPPORTED_EXTENSIONS = new Set(['.jsonl', '.json', '.xml', '.yaml', '.yml', '.html', '.htm']); -export function createWriterFromPath(filePath: string): Promise { +export function createWriterFromPath( + filePath: string, + options?: WriterOptions, +): Promise { const ext = path.extname(filePath).toLowerCase(); switch (ext) { case '.jsonl': @@ -43,7 +50,7 @@ export function createWriterFromPath(filePath: string): Promise { case '.json': return JsonWriter.open(filePath); case '.xml': - return JunitWriter.open(filePath); + return JunitWriter.open(filePath, { threshold: options?.threshold }); case '.yaml': case '.yml': return YamlWriter.open(filePath); @@ -57,8 +64,11 @@ export function createWriterFromPath(filePath: string): Promise { } } -export async function createMultiWriter(filePaths: readonly string[]): Promise { - const writers = await Promise.all(filePaths.map((fp) => createWriterFromPath(fp))); +export async function createMultiWriter( + filePaths: readonly string[], + options?: WriterOptions, +): Promise { + const writers = await Promise.all(filePaths.map((fp) => createWriterFromPath(fp, options))); return { async append(result: EvaluationResult): Promise { await Promise.all(writers.map((w) => w.append(result))); diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 7deac0024..cd09e8e22 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -906,12 +906,9 @@ export async function runEvalCommand( extraOutputPaths.length > 0 ? [outputPath, ...extraOutputPaths] : [outputPath]; const uniqueReportedOutputPaths = [...new Set(reportedOutputPaths)]; - let outputWriter: OutputWriter; if (uniqueOutputPaths.length === 1) { - outputWriter = await createOutputWriter(primaryWritePath, options.format); console.log(`Output path: ${outputPath}`); } else { - outputWriter = await createMultiWriter(uniqueOutputPaths); console.log('Output paths:'); for (const p of uniqueReportedOutputPaths) { console.log(` ${p}`); @@ -1016,6 +1013,16 @@ export async function runEvalCommand( const yamlThreshold = firstMeta?.threshold; const resolvedThreshold = options.threshold ?? yamlThreshold; + // Build the output writer (deferred until after threshold is resolved so JUnit + // writer can use the resolved threshold for per-test pass/fail decisions) + const writerOptions = resolvedThreshold !== undefined ? { threshold: resolvedThreshold } : undefined; + let outputWriter: OutputWriter; + if (uniqueOutputPaths.length === 1) { + outputWriter = await createOutputWriter(primaryWritePath, options.format); + } else { + outputWriter = await createMultiWriter(uniqueOutputPaths, writerOptions); + } + // Detect matrix mode: multiple targets for any file const isMatrixMode = Array.from(fileMetadata.values()).some((meta) => meta.selections.length > 1); diff --git a/apps/cli/test/commands/eval/output-writers.test.ts b/apps/cli/test/commands/eval/output-writers.test.ts index 8c1ea67fb..75ff80da2 100644 --- a/apps/cli/test/commands/eval/output-writers.test.ts +++ b/apps/cli/test/commands/eval/output-writers.test.ts @@ -162,6 +162,32 @@ describe('JunitWriter', () => { 'Cannot write to closed JUnit writer', ); }); + + it('uses custom threshold for pass/fail when provided', async () => { + const filePath = path.join(testDir, `junit-threshold-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath, { threshold: 0.8 }); + + await writer.append(makeResult({ testId: 'high', score: 0.9 })); + await writer.append(makeResult({ testId: 'mid', score: 0.6 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { + const filePath = path.join(testDir, `junit-default-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath); + + await writer.append(makeResult({ testId: 'pass', score: 0.6 })); + await writer.append(makeResult({ testId: 'fail', score: 0.3 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { From b66323dfb5eed7bdb858388564c42d9e688d5cdd Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:46:07 +0000 Subject: [PATCH 09/11] style: fix biome formatting in threshold implementation files Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/run-eval.ts | 3 +- .../src/evaluation/loaders/config-loader.ts | 4 +- .../references/eval-schema.json | 3530 ++++------------- 3 files changed, 674 insertions(+), 2863 deletions(-) diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index cd09e8e22..8dc114969 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -1015,7 +1015,8 @@ export async function runEvalCommand( // Build the output writer (deferred until after threshold is resolved so JUnit // writer can use the resolved threshold for per-test pass/fail decisions) - const writerOptions = resolvedThreshold !== undefined ? { threshold: resolvedThreshold } : undefined; + const writerOptions = + resolvedThreshold !== undefined ? { threshold: resolvedThreshold } : undefined; let outputWriter: OutputWriter; if (uniqueOutputPaths.length === 1) { outputWriter = await createOutputWriter(primaryWritePath, options.format); diff --git a/packages/core/src/evaluation/loaders/config-loader.ts b/packages/core/src/evaluation/loaders/config-loader.ts index daa2aa7aa..54505cddc 100644 --- a/packages/core/src/evaluation/loaders/config-loader.ts +++ b/packages/core/src/evaluation/loaders/config-loader.ts @@ -355,9 +355,7 @@ export function extractThreshold(suite: JsonObject): number | undefined { return raw; } - logWarning( - `Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`, - ); + logWarning(`Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`); return undefined; } diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json index 4df59a334..9827ee04c 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json +++ b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json @@ -53,12 +53,7 @@ "properties": { "role": { "type": "string", - "enum": [ - "system", - "user", - "assistant", - "tool" - ] + "enum": ["system", "user", "assistant", "tool"] }, "content": { "anyOf": [ @@ -72,29 +67,20 @@ "properties": { "type": { "type": "string", - "enum": [ - "text", - "file" - ] + "enum": ["text", "file"] }, "value": { "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false } } ] } }, - "required": [ - "role", - "content" - ], + "required": ["role", "content"], "additionalProperties": false } } @@ -135,12 +121,7 @@ "properties": { "role": { "type": "string", - "enum": [ - "system", - "user", - "assistant", - "tool" - ] + "enum": ["system", "user", "assistant", "tool"] }, "content": { "anyOf": [ @@ -154,29 +135,20 @@ "properties": { "type": { "type": "string", - "enum": [ - "text", - "file" - ] + "enum": ["text", "file"] }, "value": { "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false } } ] } }, - "required": [ - "role", - "content" - ], + "required": ["role", "content"], "additionalProperties": false } } @@ -204,12 +176,7 @@ "properties": { "role": { "type": "string", - "enum": [ - "system", - "user", - "assistant", - "tool" - ] + "enum": ["system", "user", "assistant", "tool"] }, "content": { "anyOf": [ @@ -223,29 +190,20 @@ "properties": { "type": { "type": "string", - "enum": [ - "text", - "file" - ] + "enum": ["text", "file"] }, "value": { "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false } } ] } }, - "required": [ - "role", - "content" - ], + "required": ["role", "content"], "additionalProperties": false } } @@ -282,12 +240,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -339,10 +292,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -372,12 +322,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -471,10 +416,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -503,9 +445,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -565,9 +505,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -583,10 +521,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -603,10 +538,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -623,18 +555,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -664,20 +591,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -718,12 +636,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -737,12 +650,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -753,9 +661,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -763,12 +669,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -782,12 +683,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -798,10 +694,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -831,10 +724,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -846,11 +736,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -872,26 +758,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -928,10 +805,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -968,10 +842,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -1001,10 +872,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -1019,9 +887,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -1051,10 +917,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -1086,9 +949,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -1124,10 +985,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -1163,10 +1021,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -1196,15 +1051,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -1240,10 +1090,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -1324,10 +1171,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -1337,10 +1181,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -1377,12 +1218,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -1434,10 +1270,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -1467,12 +1300,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -1566,10 +1394,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -1598,9 +1423,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -1660,9 +1483,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -1678,10 +1499,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -1698,10 +1516,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -1718,18 +1533,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -1759,20 +1569,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -1813,12 +1614,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -1832,12 +1628,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -1848,9 +1639,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -1858,12 +1647,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -1877,12 +1661,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -1893,10 +1672,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -1926,10 +1702,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -1941,11 +1714,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -1967,26 +1736,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -2023,10 +1783,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -2063,10 +1820,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -2096,10 +1850,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -2114,9 +1865,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -2146,10 +1895,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -2181,9 +1927,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -2219,10 +1963,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -2258,10 +1999,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -2291,15 +2029,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -2335,10 +2068,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -2419,10 +2149,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -2432,10 +2159,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -2472,12 +2196,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -2529,10 +2248,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -2562,12 +2278,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -2661,10 +2372,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -2693,9 +2401,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -2755,9 +2461,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -2773,10 +2477,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -2793,10 +2494,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -2813,18 +2511,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -2854,20 +2547,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -2908,12 +2592,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -2927,12 +2606,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -2943,9 +2617,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -2953,12 +2625,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -2972,12 +2639,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -2988,10 +2650,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -3021,10 +2680,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -3036,11 +2692,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -3062,26 +2714,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -3118,10 +2761,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -3158,10 +2798,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -3191,10 +2828,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -3209,9 +2843,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -3241,10 +2873,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -3276,9 +2905,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -3314,10 +2941,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -3353,10 +2977,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -3386,15 +3007,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -3430,10 +3046,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -3514,10 +3127,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -3527,10 +3137,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -3584,12 +3191,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -3641,10 +3243,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -3674,12 +3273,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -3773,10 +3367,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -3805,9 +3396,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -3867,9 +3456,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -3885,10 +3472,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -3905,10 +3489,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -3925,18 +3506,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -3966,20 +3542,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -4020,12 +3587,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -4039,12 +3601,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -4055,9 +3612,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -4065,12 +3620,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -4084,12 +3634,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -4100,10 +3645,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -4133,10 +3675,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -4148,11 +3687,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -4174,26 +3709,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -4230,10 +3756,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -4270,10 +3793,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -4303,10 +3823,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -4321,9 +3838,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -4353,10 +3868,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -4388,9 +3900,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -4426,10 +3936,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -4465,10 +3972,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -4498,15 +4002,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -4542,10 +4041,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -4626,10 +4122,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -4639,10 +4132,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -4679,12 +4169,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -4736,10 +4221,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -4769,12 +4251,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -4868,10 +4345,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -4900,9 +4374,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -4962,9 +4434,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -4980,10 +4450,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -5000,10 +4467,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -5020,18 +4484,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -5061,20 +4520,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -5115,12 +4565,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -5134,12 +4579,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -5150,9 +4590,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -5160,12 +4598,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -5179,12 +4612,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -5195,10 +4623,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -5228,10 +4653,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -5243,11 +4665,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -5269,26 +4687,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -5325,10 +4734,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -5365,10 +4771,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -5398,10 +4801,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -5416,9 +4816,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -5448,10 +4846,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -5483,9 +4878,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -5521,10 +4914,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -5560,10 +4950,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -5593,15 +4980,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -5637,10 +5019,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -5721,10 +5100,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -5734,10 +5110,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -5774,12 +5147,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -5831,10 +5199,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -5864,12 +5229,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -5963,10 +5323,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -5995,9 +5352,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -6057,9 +5412,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -6075,10 +5428,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -6095,10 +5445,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -6115,18 +5462,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -6156,20 +5498,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -6210,12 +5543,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -6229,12 +5557,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -6245,9 +5568,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -6255,12 +5576,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -6274,12 +5590,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -6290,10 +5601,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -6323,10 +5631,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -6338,11 +5643,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -6364,26 +5665,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -6420,10 +5712,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -6460,10 +5749,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -6493,10 +5779,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -6511,9 +5794,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -6543,10 +5824,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -6578,9 +5856,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -6616,10 +5892,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -6655,10 +5928,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -6688,15 +5958,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -6732,10 +5997,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -6816,10 +6078,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -6829,10 +6088,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -6853,11 +6109,7 @@ }, "strategy": { "type": "string", - "enum": [ - "pass_at_k", - "mean", - "confidence_interval" - ] + "enum": ["pass_at_k", "mean", "confidence_interval"] }, "cost_limit_usd": { "type": "number", @@ -6868,9 +6120,7 @@ "minimum": 0 } }, - "required": [ - "count" - ], + "required": ["count"], "additionalProperties": false }, "total_budget_usd": { @@ -6903,10 +6153,7 @@ }, "isolation": { "type": "string", - "enum": [ - "shared", - "per_test" - ] + "enum": ["shared", "per_test"] }, "repos": { "type": "array", @@ -6930,10 +6177,7 @@ "format": "uri" } }, - "required": [ - "type", - "url" - ], + "required": ["type", "url"], "additionalProperties": false }, { @@ -6947,10 +6191,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false } ] @@ -6963,10 +6204,7 @@ }, "resolve": { "type": "string", - "enum": [ - "remote", - "local" - ] + "enum": ["remote", "local"] }, "ancestor": { "type": "integer", @@ -6995,10 +6233,7 @@ "additionalProperties": false } }, - "required": [ - "path", - "source" - ], + "required": ["path", "source"], "additionalProperties": false } }, @@ -7034,11 +6269,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -7069,11 +6300,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -7104,11 +6331,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -7139,11 +6362,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -7153,11 +6372,7 @@ }, "mode": { "type": "string", - "enum": [ - "pooled", - "temp", - "static" - ] + "enum": ["pooled", "temp", "static"] }, "path": { "type": "string" @@ -7179,9 +6394,7 @@ "type": "string" } }, - "required": [ - "id" - ], + "required": ["id"], "additionalProperties": false } }, @@ -7219,12 +6432,7 @@ "properties": { "role": { "type": "string", - "enum": [ - "system", - "user", - "assistant", - "tool" - ] + "enum": ["system", "user", "assistant", "tool"] }, "content": { "anyOf": [ @@ -7238,29 +6446,20 @@ "properties": { "type": { "type": "string", - "enum": [ - "text", - "file" - ] + "enum": ["text", "file"] }, "value": { "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false } } ] } }, - "required": [ - "role", - "content" - ], + "required": ["role", "content"], "additionalProperties": false } } @@ -7288,12 +6487,7 @@ "properties": { "role": { "type": "string", - "enum": [ - "system", - "user", - "assistant", - "tool" - ] + "enum": ["system", "user", "assistant", "tool"] }, "content": { "anyOf": [ @@ -7307,29 +6501,20 @@ "properties": { "type": { "type": "string", - "enum": [ - "text", - "file" - ] + "enum": ["text", "file"] }, "value": { "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false } } ] } }, - "required": [ - "role", - "content" - ], + "required": ["role", "content"], "additionalProperties": false } } @@ -7366,12 +6551,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -7423,10 +6603,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -7456,12 +6633,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -7555,10 +6727,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -7587,9 +6756,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -7649,9 +6816,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -7667,10 +6832,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -7687,10 +6849,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -7707,18 +6866,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -7748,20 +6902,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -7802,12 +6947,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -7821,12 +6961,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -7837,9 +6972,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -7847,12 +6980,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -7866,12 +6994,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -7882,10 +7005,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -7915,10 +7035,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -7930,11 +7047,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -7956,26 +7069,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -8012,10 +7116,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -8052,10 +7153,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -8085,10 +7183,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -8103,9 +7198,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -8135,10 +7228,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -8170,9 +7260,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -8208,10 +7296,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -8247,10 +7332,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -8280,15 +7362,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -8324,10 +7401,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -8408,10 +7482,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -8421,10 +7492,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -8461,12 +7529,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -8518,10 +7581,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -8551,12 +7611,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -8650,10 +7705,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -8682,9 +7734,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -8744,9 +7794,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -8762,10 +7810,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -8782,10 +7827,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -8802,18 +7844,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -8843,20 +7880,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -8897,12 +7925,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -8916,12 +7939,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -8932,9 +7950,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -8942,12 +7958,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -8961,12 +7972,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -8977,10 +7983,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -9010,10 +8013,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -9025,11 +8025,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -9051,26 +8047,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -9107,10 +8094,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -9147,10 +8131,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -9180,10 +8161,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -9198,9 +8176,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -9230,10 +8206,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -9265,9 +8238,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -9303,10 +8274,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -9342,10 +8310,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -9375,15 +8340,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -9419,10 +8379,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -9503,10 +8460,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -9516,10 +8470,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -9556,12 +8507,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -9613,10 +8559,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -9646,12 +8589,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -9745,10 +8683,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -9777,9 +8712,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -9839,9 +8772,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -9857,10 +8788,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -9877,10 +8805,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -9897,18 +8822,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -9938,20 +8858,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -9992,12 +8903,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -10011,12 +8917,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -10027,9 +8928,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -10037,12 +8936,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -10056,12 +8950,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -10072,10 +8961,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -10105,10 +8991,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -10120,11 +9003,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -10146,26 +9025,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -10202,10 +9072,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -10242,10 +9109,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -10275,10 +9139,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -10293,9 +9154,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -10325,10 +9184,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -10360,9 +9216,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -10398,10 +9252,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -10437,10 +9288,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -10470,15 +9318,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -10514,10 +9357,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -10598,10 +9438,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -10611,10 +9448,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -10668,12 +9502,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -10725,10 +9554,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -10758,12 +9584,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -10857,10 +9678,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -10889,9 +9707,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -10951,9 +9767,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -10969,10 +9783,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -10989,10 +9800,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -11009,18 +9817,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -11050,20 +9853,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -11104,12 +9898,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -11123,12 +9912,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -11139,9 +9923,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -11149,12 +9931,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -11168,12 +9945,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -11184,10 +9956,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -11217,10 +9986,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -11232,11 +9998,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -11258,26 +10020,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -11314,10 +10067,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -11354,10 +10104,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -11387,10 +10134,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -11405,9 +10149,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -11437,10 +10179,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -11472,9 +10211,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -11510,10 +10247,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -11549,10 +10283,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -11582,15 +10313,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -11626,10 +10352,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -11710,10 +10433,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -11723,10 +10443,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -11763,12 +10480,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -11820,10 +10532,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -11853,12 +10562,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -11952,10 +10656,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -11984,9 +10685,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -12046,9 +10745,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -12064,10 +10761,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -12084,10 +10778,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -12104,18 +10795,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -12145,20 +10831,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -12199,12 +10876,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -12218,12 +10890,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -12234,9 +10901,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -12244,12 +10909,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -12263,12 +10923,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -12279,10 +10934,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -12312,10 +10964,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -12327,11 +10976,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -12353,26 +10998,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -12409,10 +11045,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -12449,10 +11082,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -12482,10 +11112,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -12500,9 +11127,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -12532,10 +11157,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -12567,9 +11189,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -12605,10 +11225,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -12644,10 +11261,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -12677,15 +11291,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -12721,10 +11330,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -12805,10 +11411,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -12818,10 +11421,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -12858,12 +11458,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -12915,10 +11510,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -12948,12 +11540,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -13047,10 +11634,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -13079,9 +11663,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -13141,9 +11723,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -13159,10 +11739,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -13179,10 +11756,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -13199,18 +11773,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -13240,20 +11809,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -13294,12 +11854,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -13313,12 +11868,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -13329,9 +11879,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -13339,12 +11887,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -13358,12 +11901,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -13374,10 +11912,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -13407,10 +11942,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -13422,11 +11954,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -13448,26 +11976,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -13504,10 +12023,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -13544,10 +12060,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -13577,10 +12090,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -13595,9 +12105,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -13627,10 +12135,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -13662,9 +12167,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -13700,10 +12203,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -13739,10 +12239,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -13772,15 +12269,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -13816,10 +12308,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -13900,10 +12389,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -13913,10 +12399,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -13937,11 +12420,7 @@ }, "strategy": { "type": "string", - "enum": [ - "pass_at_k", - "mean", - "confidence_interval" - ] + "enum": ["pass_at_k", "mean", "confidence_interval"] }, "cost_limit_usd": { "type": "number", @@ -13952,9 +12431,7 @@ "minimum": 0 } }, - "required": [ - "count" - ], + "required": ["count"], "additionalProperties": false }, "total_budget_usd": { @@ -13987,10 +12464,7 @@ }, "isolation": { "type": "string", - "enum": [ - "shared", - "per_test" - ] + "enum": ["shared", "per_test"] }, "repos": { "type": "array", @@ -14014,10 +12488,7 @@ "format": "uri" } }, - "required": [ - "type", - "url" - ], + "required": ["type", "url"], "additionalProperties": false }, { @@ -14031,10 +12502,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false } ] @@ -14047,10 +12515,7 @@ }, "resolve": { "type": "string", - "enum": [ - "remote", - "local" - ] + "enum": ["remote", "local"] }, "ancestor": { "type": "integer", @@ -14079,10 +12544,7 @@ "additionalProperties": false } }, - "required": [ - "path", - "source" - ], + "required": ["path", "source"], "additionalProperties": false } }, @@ -14118,11 +12580,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -14153,11 +12611,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -14188,11 +12642,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -14223,11 +12673,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -14237,11 +12683,7 @@ }, "mode": { "type": "string", - "enum": [ - "pooled", - "temp", - "static" - ] + "enum": ["pooled", "temp", "static"] }, "path": { "type": "string" @@ -14263,9 +12705,7 @@ "type": "string" } }, - "required": [ - "id" - ], + "required": ["id"], "additionalProperties": false } }, @@ -14325,12 +12765,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -14382,10 +12817,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -14415,12 +12847,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -14514,10 +12941,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -14546,9 +12970,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -14608,9 +13030,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -14626,10 +13046,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -14646,10 +13063,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -14666,18 +13080,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -14707,20 +13116,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -14761,12 +13161,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -14780,12 +13175,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -14796,9 +13186,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -14806,12 +13194,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -14825,12 +13208,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -14841,10 +13219,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -14874,10 +13249,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -14889,11 +13261,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -14915,26 +13283,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -14971,10 +13330,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -15011,10 +13367,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -15044,10 +13397,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -15062,9 +13412,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -15094,10 +13442,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -15129,9 +13474,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -15167,10 +13510,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -15206,10 +13546,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -15239,15 +13576,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -15283,10 +13615,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -15367,10 +13696,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -15380,10 +13706,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -15420,12 +13743,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -15477,10 +13795,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -15510,12 +13825,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -15609,10 +13919,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -15641,9 +13948,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -15703,9 +14008,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -15721,10 +14024,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -15741,10 +14041,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -15761,18 +14058,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -15802,20 +14094,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -15856,12 +14139,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -15875,12 +14153,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -15891,9 +14164,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -15901,12 +14172,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -15920,12 +14186,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -15936,10 +14197,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -15969,10 +14227,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -15984,11 +14239,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -16010,26 +14261,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -16066,10 +14308,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -16106,10 +14345,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -16139,10 +14375,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -16157,9 +14390,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -16189,10 +14420,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -16224,9 +14452,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -16262,10 +14488,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -16301,10 +14524,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -16334,15 +14554,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -16378,10 +14593,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -16462,10 +14674,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -16475,10 +14684,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -16515,12 +14721,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -16572,10 +14773,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -16605,12 +14803,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -16704,10 +14897,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -16736,9 +14926,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -16798,9 +14986,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -16816,10 +15002,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -16836,10 +15019,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -16856,18 +15036,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -16897,20 +15072,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -16951,12 +15117,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -16970,12 +15131,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -16986,9 +15142,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -16996,12 +15150,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -17015,12 +15164,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -17031,10 +15175,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -17064,10 +15205,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -17079,11 +15217,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -17105,26 +15239,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -17161,10 +15286,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -17201,10 +15323,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -17234,10 +15353,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -17252,9 +15368,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -17284,10 +15398,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -17319,9 +15430,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -17357,10 +15466,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -17396,10 +15502,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -17429,15 +15532,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -17473,10 +15571,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -17557,10 +15652,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -17570,10 +15662,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -17594,11 +15683,7 @@ }, "strategy": { "type": "string", - "enum": [ - "pass_at_k", - "mean", - "confidence_interval" - ] + "enum": ["pass_at_k", "mean", "confidence_interval"] }, "cost_limit_usd": { "type": "number", @@ -17609,9 +15694,7 @@ "minimum": 0 } }, - "required": [ - "count" - ], + "required": ["count"], "additionalProperties": false }, "total_budget_usd": { @@ -17667,12 +15750,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -17724,10 +15802,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -17757,12 +15832,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -17856,10 +15926,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -17888,9 +15955,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -17950,9 +16015,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -17968,10 +16031,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -17988,10 +16048,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -18008,18 +16065,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -18049,20 +16101,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -18103,12 +16146,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -18122,12 +16160,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -18138,9 +16171,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -18148,12 +16179,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -18167,12 +16193,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -18183,10 +16204,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -18216,10 +16234,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -18231,11 +16246,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -18257,26 +16268,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -18313,10 +16315,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -18353,10 +16352,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -18386,10 +16382,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -18404,9 +16397,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -18436,10 +16427,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -18471,9 +16459,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -18509,10 +16495,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -18548,10 +16531,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -18581,15 +16561,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -18625,10 +16600,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -18709,10 +16681,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -18722,10 +16691,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -18762,12 +16728,7 @@ }, "type": { "type": "string", - "enum": [ - "code-grader", - "code_grader", - "code-judge", - "code_judge" - ] + "enum": ["code-grader", "code_grader", "code-judge", "code_judge"] }, "command": { "anyOf": [ @@ -18819,10 +16780,7 @@ "additionalProperties": {} } }, - "required": [ - "type", - "command" - ], + "required": ["type", "command"], "additionalProperties": false }, { @@ -18852,12 +16810,7 @@ }, "type": { "type": "string", - "enum": [ - "llm-grader", - "llm_grader", - "llm-judge", - "llm_judge" - ] + "enum": ["llm-grader", "llm_grader", "llm-judge", "llm_judge"] }, "prompt": { "anyOf": [ @@ -18951,10 +16904,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -18983,9 +16933,7 @@ "maximum": 2 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -19045,9 +16993,7 @@ } } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -19063,10 +17009,7 @@ "maximum": 1 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -19083,10 +17026,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false }, { @@ -19103,18 +17043,13 @@ "type": "string" } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false } ] } }, - "required": [ - "type", - "aggregator" - ], + "required": ["type", "aggregator"], "additionalProperties": false }, { @@ -19144,20 +17079,11 @@ }, "type": { "type": "string", - "enum": [ - "tool-trajectory", - "tool_trajectory" - ] + "enum": ["tool-trajectory", "tool_trajectory"] }, "mode": { "type": "string", - "enum": [ - "any_order", - "in_order", - "exact", - "subset", - "superset" - ] + "enum": ["any_order", "in_order", "exact", "subset", "superset"] }, "minimums": { "type": "object", @@ -19198,12 +17124,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -19217,12 +17138,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -19233,9 +17149,7 @@ ] } }, - "required": [ - "tool" - ], + "required": ["tool"], "additionalProperties": false } }, @@ -19243,12 +17157,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -19262,12 +17171,7 @@ "anyOf": [ { "type": "string", - "enum": [ - "exact", - "ignore", - "subset", - "superset" - ] + "enum": ["exact", "ignore", "subset", "superset"] }, { "type": "array", @@ -19278,10 +17182,7 @@ ] } }, - "required": [ - "type", - "mode" - ], + "required": ["type", "mode"], "additionalProperties": false }, { @@ -19311,10 +17212,7 @@ }, "type": { "type": "string", - "enum": [ - "field-accuracy", - "field_accuracy" - ] + "enum": ["field-accuracy", "field_accuracy"] }, "fields": { "type": "array", @@ -19326,11 +17224,7 @@ }, "match": { "type": "string", - "enum": [ - "exact", - "numeric_tolerance", - "date" - ] + "enum": ["exact", "numeric_tolerance", "date"] }, "required": { "type": "boolean" @@ -19352,26 +17246,17 @@ } } }, - "required": [ - "path", - "match" - ], + "required": ["path", "match"], "additionalProperties": false }, "minItems": 1 }, "aggregation": { "type": "string", - "enum": [ - "weighted_average", - "all_or_nothing" - ] + "enum": ["weighted_average", "all_or_nothing"] } }, - "required": [ - "type", - "fields" - ], + "required": ["type", "fields"], "additionalProperties": false }, { @@ -19408,10 +17293,7 @@ "minimum": 0 } }, - "required": [ - "type", - "threshold" - ], + "required": ["type", "threshold"], "additionalProperties": false }, { @@ -19448,10 +17330,7 @@ "minimum": 0 } }, - "required": [ - "type", - "budget" - ], + "required": ["type", "budget"], "additionalProperties": false }, { @@ -19481,10 +17360,7 @@ }, "type": { "type": "string", - "enum": [ - "token-usage", - "token_usage" - ] + "enum": ["token-usage", "token_usage"] }, "max_total": { "type": "number", @@ -19499,9 +17375,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -19531,10 +17405,7 @@ }, "type": { "type": "string", - "enum": [ - "execution-metrics", - "execution_metrics" - ] + "enum": ["execution-metrics", "execution_metrics"] }, "max_tool_calls": { "type": "number", @@ -19566,9 +17437,7 @@ "minimum": 0 } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -19604,10 +17473,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -19643,10 +17509,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -19676,15 +17539,10 @@ }, "type": { "type": "string", - "enum": [ - "is-json", - "is_json" - ] + "enum": ["is-json", "is_json"] } }, - "required": [ - "type" - ], + "required": ["type"], "additionalProperties": false }, { @@ -19720,10 +17578,7 @@ "type": "string" } }, - "required": [ - "type", - "value" - ], + "required": ["type", "value"], "additionalProperties": false }, { @@ -19804,10 +17659,7 @@ "minLength": 1 } }, - "required": [ - "score_range", - "outcome" - ], + "required": ["score_range", "outcome"], "additionalProperties": false } } @@ -19817,10 +17669,7 @@ "minItems": 1 } }, - "required": [ - "type", - "criteria" - ], + "required": ["type", "criteria"], "additionalProperties": false } ] @@ -19836,10 +17685,7 @@ }, "isolation": { "type": "string", - "enum": [ - "shared", - "per_test" - ] + "enum": ["shared", "per_test"] }, "repos": { "type": "array", @@ -19863,10 +17709,7 @@ "format": "uri" } }, - "required": [ - "type", - "url" - ], + "required": ["type", "url"], "additionalProperties": false }, { @@ -19880,10 +17723,7 @@ "type": "string" } }, - "required": [ - "type", - "path" - ], + "required": ["type", "path"], "additionalProperties": false } ] @@ -19896,10 +17736,7 @@ }, "resolve": { "type": "string", - "enum": [ - "remote", - "local" - ] + "enum": ["remote", "local"] }, "ancestor": { "type": "integer", @@ -19928,10 +17765,7 @@ "additionalProperties": false } }, - "required": [ - "path", - "source" - ], + "required": ["path", "source"], "additionalProperties": false } }, @@ -19967,11 +17801,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -20002,11 +17832,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -20037,11 +17863,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -20072,11 +17894,7 @@ }, "reset": { "type": "string", - "enum": [ - "none", - "fast", - "strict" - ] + "enum": ["none", "fast", "strict"] } }, "additionalProperties": false @@ -20086,11 +17904,7 @@ }, "mode": { "type": "string", - "enum": [ - "pooled", - "temp", - "static" - ] + "enum": ["pooled", "temp", "static"] }, "path": { "type": "string" @@ -20104,9 +17918,7 @@ ] } }, - "required": [ - "tests" - ], + "required": ["tests"], "additionalProperties": false } } From 1fc65514ce6e00df3f534a75cbea7cc86251d4e5 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:50:32 +0000 Subject: [PATCH 10/11] fix(cli): use process.exit for threshold gate exit code (#698) process.exitCode was being reset by the cmd-ts handler wrapper. Return thresholdFailed from runEvalCommand and call process.exit(1) in the handler instead. Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/commands/run.ts | 5 ++++- apps/cli/src/commands/eval/run-eval.ts | 8 +++++--- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/apps/cli/src/commands/eval/commands/run.ts b/apps/cli/src/commands/eval/commands/run.ts index 713366e7b..5df5ee42b 100644 --- a/apps/cli/src/commands/eval/commands/run.ts +++ b/apps/cli/src/commands/eval/commands/run.ts @@ -224,6 +224,9 @@ export const evalRunCommand = command({ outputMessages: args.outputMessages, threshold: args.threshold, }; - await runEvalCommand({ testFiles: resolvedPaths, rawOptions }); + const result = await runEvalCommand({ testFiles: resolvedPaths, rawOptions }); + if (result?.thresholdFailed) { + process.exit(1); + } }, }); diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 8dc114969..98d5670fc 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -754,6 +754,8 @@ export interface RunEvalResult { readonly outputPath: string; readonly testFiles: readonly string[]; readonly target?: string; + /** True when --threshold is set and mean score is below the threshold */ + readonly thresholdFailed?: boolean; } export async function runEvalCommand( @@ -1171,12 +1173,11 @@ export async function runEvalCommand( console.log(formatEvaluationSummary(summary)); // Threshold quality gate check + let thresholdFailed = false; if (resolvedThreshold !== undefined) { const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold); console.log(`\n${thresholdResult.message}`); - if (!thresholdResult.passed) { - process.exitCode = 1; - } + thresholdFailed = !thresholdResult.passed; } // Print matrix summary when multiple targets were evaluated @@ -1273,6 +1274,7 @@ export async function runEvalCommand( outputPath, testFiles: resolvedTestFiles, target: options.target, + thresholdFailed, }; } finally { unsubscribeCodexLogs(); From 5ea4e4bfc616a3cab7ba3dfb3330b16822bdfeed Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 25 Mar 2026 02:58:13 +0000 Subject: [PATCH 11/11] docs: add --threshold documentation and CLI validation (#698) - Add CLI range validation (0-1) for --threshold flag - Document threshold in running-evals.mdx, eval-files.mdx, and SKILL.md - Remove temporary plan files before merge Co-Authored-By: Claude Opus 4.6 --- apps/cli/src/commands/eval/run-eval.ts | 3 + .../content/docs/evaluation/eval-files.mdx | 2 +- .../content/docs/evaluation/running-evals.mdx | 27 + .../plans/2026-03-25-threshold-flag-design.md | 76 --- docs/plans/2026-03-25-threshold-flag-plan.md | 562 ------------------ .../skills/agentv-eval-writer/SKILL.md | 15 +- 6 files changed, 45 insertions(+), 640 deletions(-) delete mode 100644 docs/plans/2026-03-25-threshold-flag-design.md delete mode 100644 docs/plans/2026-03-25-threshold-flag-plan.md diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 98d5670fc..ac3a84cd9 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -1014,6 +1014,9 @@ export async function runEvalCommand( // Resolve suite-level threshold: CLI --threshold takes precedence over YAML execution.threshold const yamlThreshold = firstMeta?.threshold; const resolvedThreshold = options.threshold ?? yamlThreshold; + if (resolvedThreshold !== undefined && (resolvedThreshold < 0 || resolvedThreshold > 1)) { + throw new Error('--threshold must be between 0 and 1'); + } // Build the output writer (deferred until after threshold is resolved so JUnit // writer can use the resolved threshold for per-test pass/fail decisions) diff --git a/apps/web/src/content/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/evaluation/eval-files.mdx index 281614053..41c03eb97 100644 --- a/apps/web/src/content/docs/evaluation/eval-files.mdx +++ b/apps/web/src/content/docs/evaluation/eval-files.mdx @@ -34,7 +34,7 @@ tests: |-------|-------------| | `description` | Human-readable description of the evaluation | | `dataset` | Optional dataset identifier | -| `execution` | Default execution config (`target`, `fail_on_error`, etc.) | +| `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) | | `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/guides/workspace-pool/#external-workspace-config) | | `tests` | Array of individual tests, or a string path to an external file | | `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test | diff --git a/apps/web/src/content/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/evaluation/running-evals.mdx index 7e221bbc6..5c502aa19 100644 --- a/apps/web/src/content/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/evaluation/running-evals.mdx @@ -229,6 +229,33 @@ execution: When halted, remaining tests are recorded with `failureReasonCode: 'error_threshold_exceeded'`. With concurrency > 1, a few additional tests may complete before halting takes effect. +### Suite-Level Quality Threshold + +Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. + +**CLI flag:** + +```bash +agentv eval evals/ --threshold 0.8 +``` + +**YAML config:** + +```yaml +execution: + threshold: 0.8 +``` + +The CLI `--threshold` flag overrides the YAML value. The threshold is a number between 0 and 1. Mean score is computed from quality results only (execution errors are excluded). + +When active, a summary line is printed after the eval results: + +``` +Suite score: 0.85 (threshold: 0.80) — PASS +``` + +The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as `` in JUnit output. When no threshold is set, JUnit defaults to 0.5. + ## Validate Before Running Check eval files for schema errors without executing: diff --git a/docs/plans/2026-03-25-threshold-flag-design.md b/docs/plans/2026-03-25-threshold-flag-design.md deleted file mode 100644 index 29c6b5e74..000000000 --- a/docs/plans/2026-03-25-threshold-flag-design.md +++ /dev/null @@ -1,76 +0,0 @@ -# Design: `--threshold` flag for suite-level quality gates - -**Issue:** #698 -**Date:** 2026-03-25 - -## Objective - -Add a `--threshold` CLI flag to `agentv eval` that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing `agentv compare --baseline`. - -## CLI Flag - -- `--threshold ` on `agentv eval run` (0–1 scale) -- Optional — if omitted, no threshold check (current behavior preserved) -- Overrides `execution.threshold` from YAML if both set - -## YAML Config - -Add `threshold` to the `execution` block in eval YAML files: - -```yaml -execution: - threshold: 0.8 -``` - -Both `threshold` and `execution.threshold` accepted (snake_case wire format convention). - -## Score Evaluation - -After all tests complete: - -1. Compute mean score from quality results only (excluding `execution_error` tests — same as existing `calculateEvaluationSummary()`) -2. If mean score < threshold → exit code 1 -3. Execution errors fail independently via existing `fail_on_error` mechanism (separate concern) -4. If no quality results exist (all execution errors), threshold check is skipped - -## Output - -When threshold is active, append a summary line after the existing result summary: - -``` -Suite score: 0.53 (threshold: 0.60) — FAIL -``` - -or: - -``` -Suite score: 0.85 (threshold: 0.60) — PASS -``` - -## JUnit Integration - -The JUnit writer uses the threshold for per-test pass/fail: - -- If threshold is set: `score < threshold` → `` element -- If threshold is not set: `score < 0.5` (current hardcoded behavior preserved) - -## Exit Code - -- Exit 0: mean score >= threshold (or no threshold set) -- Exit 1: mean score < threshold -- Execution errors handled separately by `fail_on_error` - -## Files to Modify - -1. `packages/core/src/evaluation/validation/eval-file.schema.ts` — add `threshold` to ExecutionSchema -2. `apps/cli/src/commands/eval/commands/run.ts` — add `--threshold` CLI flag -3. `apps/cli/src/commands/eval/run-eval.ts` — pass threshold through, check after results -4. `apps/cli/src/commands/eval/statistics.ts` — add threshold summary formatting -5. `apps/cli/src/commands/eval/junit-writer.ts` — use threshold for pass/fail -6. Tests for new behavior - -## Non-Goals - -- Per-test threshold override (use `required` for that) -- Replacement for `agentv compare` regression gating -- Severity levels (#334) diff --git a/docs/plans/2026-03-25-threshold-flag-plan.md b/docs/plans/2026-03-25-threshold-flag-plan.md deleted file mode 100644 index 57ba2eb53..000000000 --- a/docs/plans/2026-03-25-threshold-flag-plan.md +++ /dev/null @@ -1,562 +0,0 @@ -# `--threshold` Flag Implementation Plan - -> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. - -**Goal:** Add a `--threshold` CLI flag and `execution.threshold` YAML field to `agentv eval` that exits 1 when mean quality score falls below the threshold. - -**Architecture:** The threshold value flows from CLI flag or YAML config through the existing options pipeline. After all tests complete, the summary is checked against the threshold. JUnit writer also uses the threshold for per-test pass/fail. - -**Tech Stack:** TypeScript, cmd-ts (CLI parsing), Zod (schema validation), Vitest (testing) - ---- - -### Task 1: Add `extractThreshold` to core config-loader - -**Files:** -- Modify: `packages/core/src/evaluation/loaders/config-loader.ts:287` (after `extractTotalBudgetUsd`) -- Test: `packages/core/test/evaluation/loaders/config-loader.test.ts` - -**Step 1: Write the failing tests** - -Add to `packages/core/test/evaluation/loaders/config-loader.test.ts` after the `extractFailOnError` describe block: - -```typescript -describe('extractThreshold', () => { - it('returns undefined when no execution block', () => { - const suite: JsonObject = { tests: [] }; - expect(extractThreshold(suite)).toBeUndefined(); - }); - - it('returns undefined when threshold not set', () => { - const suite: JsonObject = { execution: { target: 'default' } }; - expect(extractThreshold(suite)).toBeUndefined(); - }); - - it('parses valid threshold', () => { - const suite: JsonObject = { execution: { threshold: 0.8 } }; - expect(extractThreshold(suite)).toBe(0.8); - }); - - it('accepts 0 as threshold', () => { - const suite: JsonObject = { execution: { threshold: 0 } }; - expect(extractThreshold(suite)).toBe(0); - }); - - it('accepts 1 as threshold', () => { - const suite: JsonObject = { execution: { threshold: 1 } }; - expect(extractThreshold(suite)).toBe(1); - }); - - it('returns undefined for negative threshold', () => { - const suite: JsonObject = { execution: { threshold: -0.1 } }; - expect(extractThreshold(suite)).toBeUndefined(); - }); - - it('returns undefined for threshold > 1', () => { - const suite: JsonObject = { execution: { threshold: 1.5 } }; - expect(extractThreshold(suite)).toBeUndefined(); - }); - - it('returns undefined for non-number threshold', () => { - const suite: JsonObject = { execution: { threshold: 'high' } }; - expect(extractThreshold(suite)).toBeUndefined(); - }); -}); -``` - -Also add `extractThreshold` to the import at the top of the test file. - -**Step 2: Run tests to verify they fail** - -Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` -Expected: FAIL — `extractThreshold` not found - -**Step 3: Implement `extractThreshold`** - -Add to `packages/core/src/evaluation/loaders/config-loader.ts` after `extractTotalBudgetUsd` (after line ~308): - -```typescript -/** - * Extract `execution.threshold` from parsed eval suite. - * Accepts a number in [0, 1] range. - * Returns undefined when not specified. - */ -export function extractThreshold(suite: JsonObject): number | undefined { - const execution = suite.execution; - if (!execution || typeof execution !== 'object' || Array.isArray(execution)) { - return undefined; - } - - const executionObj = execution as Record; - const raw = executionObj.threshold; - - if (raw === undefined || raw === null) { - return undefined; - } - - if (typeof raw === 'number' && raw >= 0 && raw <= 1) { - return raw; - } - - logWarning( - `Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`, - ); - return undefined; -} -``` - -**Step 4: Run tests to verify they pass** - -Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` -Expected: PASS - -**Step 5: Commit** - -```bash -git add packages/core/src/evaluation/loaders/config-loader.ts packages/core/test/evaluation/loaders/config-loader.test.ts -git commit -m "feat(core): add extractThreshold for execution.threshold YAML field (#698)" -``` - ---- - -### Task 2: Wire `extractThreshold` through YAML parser and schema - -**Files:** -- Modify: `packages/core/src/evaluation/yaml-parser.ts:12` (imports), `:58` (re-exports), `:204` (loadTestSuite) -- Modify: `packages/core/src/evaluation/yaml-parser.ts:168` (EvalSuiteResult type) -- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts:317` (ExecutionSchema) - -**Step 1: Add `threshold` to ExecutionSchema in eval-file.schema.ts** - -In `packages/core/src/evaluation/validation/eval-file.schema.ts`, add to the `ExecutionSchema` object (after `failOnError` at line 330): - -```typescript - threshold: z.number().min(0).max(1).optional(), -``` - -**Step 2: Add to EvalSuiteResult type in yaml-parser.ts** - -In `packages/core/src/evaluation/yaml-parser.ts`, add to the `EvalSuiteResult` type (after `failOnError` at line 182): - -```typescript - /** Suite-level quality threshold (0-1) — suite fails if mean score is below */ - readonly threshold?: number; -``` - -**Step 3: Import and re-export `extractThreshold` in yaml-parser.ts** - -Add `extractThreshold` to the import from `./loaders/config-loader.js` (line 12 area) and the re-export block (line 58 area). - -**Step 4: Use in `loadTestSuite`** - -In the `loadTestSuite` function (around line 203), extract and return threshold: - -```typescript - const threshold = extractThreshold(parsed); - return { - tests, - trials: extractTrialsConfig(parsed), - targets: extractTargetsFromSuite(parsed), - workers: extractWorkersFromSuite(parsed), - cacheConfig: extractCacheConfig(parsed), - totalBudgetUsd: extractTotalBudgetUsd(parsed), - ...(metadata !== undefined && { metadata }), - ...(failOnError !== undefined && { failOnError }), - ...(threshold !== undefined && { threshold }), - }; -``` - -**Step 5: Regenerate the JSON schema** - -Run: `bun run generate:schema` - -**Step 6: Run core tests** - -Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts` -Expected: PASS - -**Step 7: Commit** - -```bash -git add packages/core/src/evaluation/validation/eval-file.schema.ts packages/core/src/evaluation/yaml-parser.ts -git commit -m "feat(core): wire extractThreshold through YAML parser and schema (#698)" -``` - ---- - -### Task 3: Add `--threshold` CLI flag and pass through to run-eval - -**Files:** -- Modify: `apps/cli/src/commands/eval/commands/run.ts` (add CLI flag) -- Modify: `apps/cli/src/commands/eval/run-eval.ts` (NormalizedOptions, normalizeOptions, handler return) - -**Step 1: Add CLI flag to run.ts** - -In `apps/cli/src/commands/eval/commands/run.ts`, add after the `model` option (around line 171): - -```typescript - threshold: option({ - type: optional(number), - long: 'threshold', - description: 'Suite-level quality gate: exit 1 if mean score falls below this value (0-1)', - }), -``` - -And add `threshold: args.threshold` to the `rawOptions` object in the handler (around line 219). - -**Step 2: Add to NormalizedOptions in run-eval.ts** - -In `apps/cli/src/commands/eval/run-eval.ts`, add to the `NormalizedOptions` interface: - -```typescript - readonly threshold?: number; -``` - -**Step 3: Add to normalizeOptions** - -In the `normalizeOptions` function, add threshold resolution (CLI > YAML): - -```typescript - // Resolve threshold: CLI --threshold > YAML execution.threshold - const cliThreshold = normalizeOptionalNumber(rawOptions.threshold); -``` - -And in the return statement: - -```typescript - threshold: cliThreshold, -``` - -**Step 4: Wire YAML threshold into normalized options** - -In `runEvalCommand`, after `prepareEvalFile` returns, merge the YAML threshold if CLI didn't set one. In the loop over eval files (around the `prepareEvalFile` call), capture `suite.threshold` and pass it through. - -The cleanest approach: read the YAML threshold in `prepareEvalFile` and return it alongside the other fields. Then in the main `runEvalCommand`, resolve CLI vs YAML threshold. - -Add `threshold` to the `prepareEvalFile` return type (alongside `failOnError`): - -```typescript - readonly threshold?: number; -``` - -And in `prepareEvalFile`, add after `failOnError: suite.failOnError`: - -```typescript - threshold: suite.threshold, -``` - -**Step 5: Commit** - -```bash -git add apps/cli/src/commands/eval/commands/run.ts apps/cli/src/commands/eval/run-eval.ts -git commit -m "feat(cli): add --threshold flag and wire through options pipeline (#698)" -``` - ---- - -### Task 4: Add threshold check and summary output after eval completes - -**Files:** -- Modify: `apps/cli/src/commands/eval/run-eval.ts` (after summary calculation ~line 1152) -- Modify: `apps/cli/src/commands/eval/statistics.ts` (add `formatThresholdSummary`) -- Test: `apps/cli/test/commands/eval/threshold.test.ts` (new) - -**Step 1: Write failing tests** - -Create `apps/cli/test/commands/eval/threshold.test.ts`: - -```typescript -import { describe, expect, it } from 'bun:test'; - -import type { EvaluationResult } from '@agentv/core'; - -import { formatThresholdSummary } from '../../../src/commands/eval/statistics.js'; - -function makeResult(overrides: Partial = {}): EvaluationResult { - return { - timestamp: '2024-01-01T00:00:00Z', - testId: 'test-1', - score: 1.0, - assertions: [{ text: 'criterion-1', passed: true }], - output: [{ role: 'assistant' as const, content: 'answer' }], - target: 'default', - ...overrides, - }; -} - -describe('formatThresholdSummary', () => { - it('returns PASS when mean score meets threshold', () => { - const result = formatThresholdSummary(0.85, 0.6); - expect(result.passed).toBe(true); - expect(result.message).toContain('0.85'); - expect(result.message).toContain('0.60'); - expect(result.message).toContain('PASS'); - }); - - it('returns FAIL when mean score is below threshold', () => { - const result = formatThresholdSummary(0.53, 0.6); - expect(result.passed).toBe(false); - expect(result.message).toContain('0.53'); - expect(result.message).toContain('0.60'); - expect(result.message).toContain('FAIL'); - }); - - it('returns PASS when mean score exactly equals threshold', () => { - const result = formatThresholdSummary(0.6, 0.6); - expect(result.passed).toBe(true); - }); - - it('returns PASS for threshold 0 with any score', () => { - const result = formatThresholdSummary(0, 0); - expect(result.passed).toBe(true); - }); -}); -``` - -**Step 2: Run tests to verify they fail** - -Run: `bun test apps/cli/test/commands/eval/threshold.test.ts` -Expected: FAIL — `formatThresholdSummary` not found - -**Step 3: Implement `formatThresholdSummary` in statistics.ts** - -Add to `apps/cli/src/commands/eval/statistics.ts`: - -```typescript -/** - * Format a threshold check summary line. - * Returns whether the threshold was met and the formatted message. - */ -export function formatThresholdSummary( - meanScore: number, - threshold: number, -): { passed: boolean; message: string } { - const passed = meanScore >= threshold; - const verdict = passed ? 'PASS' : 'FAIL'; - const message = `Suite score: ${meanScore.toFixed(2)} (threshold: ${threshold.toFixed(2)}) — ${verdict}`; - return { passed, message }; -} -``` - -**Step 4: Run tests to verify they pass** - -Run: `bun test apps/cli/test/commands/eval/threshold.test.ts` -Expected: PASS - -**Step 5: Wire the threshold check into run-eval.ts** - -In `apps/cli/src/commands/eval/run-eval.ts`, after the summary is printed (around line 1153), add: - -```typescript - // Threshold quality gate check - const resolvedThreshold = options.threshold ?? yamlThreshold; - if (resolvedThreshold !== undefined) { - const { formatThresholdSummary } = await import('./statistics.js'); - const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold); - console.log(`\n${thresholdResult.message}`); - if (!thresholdResult.passed) { - process.exitCode = 1; - } - } -``` - -Note: `yamlThreshold` needs to be captured from the `prepareEvalFile` results. If multiple eval files are run, use the first non-undefined threshold (or the CLI value). - -Import `formatThresholdSummary` statically at the top (preferred over dynamic import since it's in the same package): - -```typescript -import { - calculateEvaluationSummary, - formatEvaluationSummary, - formatMatrixSummary, - formatThresholdSummary, -} from './statistics.js'; -``` - -**Step 6: Commit** - -```bash -git add apps/cli/src/commands/eval/statistics.ts apps/cli/src/commands/eval/run-eval.ts apps/cli/test/commands/eval/threshold.test.ts -git commit -m "feat(cli): add threshold check with summary output after eval (#698)" -``` - ---- - -### Task 5: JUnit writer uses threshold for per-test pass/fail - -**Files:** -- Modify: `apps/cli/src/commands/eval/junit-writer.ts` -- Modify: `apps/cli/test/commands/eval/output-writers.test.ts` (add tests) - -**Step 1: Write failing tests** - -Add to `apps/cli/test/commands/eval/output-writers.test.ts` in the JUnit describe block: - -```typescript - it('uses custom threshold for pass/fail when provided', async () => { - const filePath = path.join(testDir, `junit-threshold-${Date.now()}.xml`); - const writer = await JunitWriter.open(filePath, { threshold: 0.8 }); - - await writer.append(makeResult({ testId: 'high', score: 0.9 })); - await writer.append(makeResult({ testId: 'mid', score: 0.6 })); - await writer.close(); - - const xml = await readFile(filePath, 'utf8'); - expect(xml).not.toContain(' { - const filePath = path.join(testDir, `junit-default-${Date.now()}.xml`); - const writer = await JunitWriter.open(filePath); - - await writer.append(makeResult({ testId: 'pass', score: 0.6 })); - await writer.append(makeResult({ testId: 'fail', score: 0.3 })); - await writer.close(); - - const xml = await readFile(filePath, 'utf8'); - expect(xml).not.toContain(' { - await mkdir(path.dirname(filePath), { recursive: true }); - return new JunitWriter(filePath, options); - } -``` - -Then replace all `r.score < 0.5` with `r.score < this.threshold` in the `close()` method. - -**Step 4: Pass threshold to JunitWriter in output-writer.ts** - -In `apps/cli/src/commands/eval/output-writer.ts`, where JunitWriter is created, pass the threshold. Check how output writers are created and thread the threshold through. - -**Step 5: Run tests to verify they pass** - -Run: `bun test apps/cli/test/commands/eval/output-writers.test.ts` -Expected: PASS - -**Step 6: Commit** - -```bash -git add apps/cli/src/commands/eval/junit-writer.ts apps/cli/src/commands/eval/output-writer.ts apps/cli/test/commands/eval/output-writers.test.ts -git commit -m "feat(cli): JUnit writer uses --threshold for per-test pass/fail (#698)" -``` - ---- - -### Task 6: Add `threshold` to Zod schema and regenerate JSON schema - -**Files:** -- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts` (already done in Task 2) -- Run: `bun run generate:schema` - -**Step 1: Verify threshold is in ExecutionSchema** - -Read `packages/core/src/evaluation/validation/eval-file.schema.ts` and confirm `threshold` was added in Task 2. - -**Step 2: Regenerate JSON schema** - -Run: `bun run generate:schema` - -**Step 3: Run validate:examples to check existing YAML files still pass** - -Run: `bun run validate:examples` -Expected: PASS (threshold is optional, so existing files are unaffected) - -**Step 4: Commit if schema file changed** - -```bash -git add packages/core/ -git commit -m "chore: regenerate eval-schema.json with threshold field (#698)" -``` - ---- - -### Task 7: Run full test suite and verify - -**Step 1: Run all tests** - -Run: `bun run test` -Expected: PASS (except any pre-existing known failures) - -**Step 2: Run typecheck** - -Run: `bun run typecheck` -Expected: PASS - -**Step 3: Run lint** - -Run: `bun run lint` -Expected: PASS - -**Step 4: Run build** - -Run: `bun run build` -Expected: PASS - ---- - -### Task 8: Manual red/green UAT - -**Step 1: Red — verify no threshold behavior on main** - -Run an eval without --threshold: - -```bash -bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 -``` - -Confirm: no "Suite score" line in output, exit code is 0. - -**Step 2: Green — verify --threshold works** - -Run with a threshold that should PASS: - -```bash -bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.3 -``` - -Confirm: "Suite score: X.XX (threshold: 0.30) — PASS" printed, exit code 0. - -Run with a threshold that should FAIL: - -```bash -bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.99 -``` - -Confirm: "Suite score: X.XX (threshold: 0.99) — FAIL" printed, exit code 1. - -**Step 3: Verify JUnit output uses threshold** - -```bash -bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.9 -o /tmp/test-threshold.xml -``` - -Inspect the XML: tests with score < 0.9 should have `` elements. diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md index efc818f3c..7a6f2c3f5 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md @@ -520,11 +520,24 @@ execution: When halted, remaining tests get `executionStatus: 'execution_error'` with `failureReasonCode: 'error_threshold_exceeded'`. +## Suite-Level Quality Threshold + +Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. + +```yaml +execution: + threshold: 0.8 +``` + +CLI flag `--threshold 0.8` overrides the YAML value. Must be a number between 0 and 1. Mean score is computed from quality results only (execution errors excluded). + +The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as ``. When no threshold is set, JUnit defaults to 0.5. + ## CLI Commands ```bash # Run evaluation (requires API keys) -agentv eval [--test-id ] [--target ] [--dry-run] +agentv eval [--test-id ] [--target ] [--dry-run] [--threshold <0-1>] # Run with OTLP JSON file (importable by OTel backends) agentv eval --otel-file traces/eval.otlp.json