diff --git a/AGENTS.md b/AGENTS.md index e0a6b1aa8..cb022cb1b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -258,3 +258,185 @@ When making changes to functionality: 2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, evaluator types, or CLI commands. Keep concise — link to docs site for details. 3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests. + +4. **README.md**: Keep minimal. Links point to agentv.dev. + +## Evaluator Type System + +Evaluator types use **kebab-case** everywhere (matching promptfoo convention): + +- **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics` +- **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...` +- **Output `scores[].type`:** `"llm-grader"`, `"is-json"` +- **Registry keys:** `registry.register('llm-grader', ...)` + +**Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts` + +**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeEvaluatorType()` in `evaluator-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged. + +**Two type definitions exist:** +- `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical +- `AssertionType` in `packages/eval/src/assertion.ts` — SDK-facing, must stay in sync + +## Git Workflow + +### Commit Convention + +Follow conventional commits: `type(scope): description` + +Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore` + +### Issue Workflow + +When working on a GitHub issue, **ALWAYS** follow this workflow: + +1. **Claim the issue** — prevents other agents from duplicating work: + ```bash + # Load AGENT_ID from .env; if not set, ask the user or default to - + # Harness = the coding tool (claude-code, opencode, codex-cli, cursor, etc.) + # Model = the LLM (opus, sonnet, o3, etc.) + # Examples: "claude-code-opus", "opencode-sonnet", "cursor-o3", "codex-cli-o3" + # In this local dev environment, default to "devbox2-codex" unless the user specifies another AGENT_ID. + # Do NOT use hostname or machine name. + source .env 2>/dev/null + if [ -z "$AGENT_ID" ]; then + echo "AGENT_ID is not set. Ask the user for an agent identifier, or default to devbox2-codex in this environment (otherwise use -)." + fi + + # Check if already claimed + gh issue view --json labels --jq '.labels[].name' | grep -q "in-progress" && echo "SKIP — already claimed" && exit 1 + + # Claim it — label + project roadmap status + gh issue edit --add-label "in-progress" + + # Update project roadmap: set status to "In Progress" and stamp Agent ID + ITEM_ID=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == and .content.repository == "agentv") | .id') + if [ -n "$ITEM_ID" ]; then + gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTSSF_lADOAIbbRc4BSmjFzhAFomw --single-select-option-id 47fc9ee4 + gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTF_lADOAIbbRc4BSmjFzhAHSnk --text "$AGENT_ID" + fi + ``` + If the issue has the `in-progress` label, **do not work on it** — pick a different issue. + +2. **Create a worktree** with a feature branch: + ```bash + git worktree add agentv.worktrees/ -b /- + cd agentv.worktrees/ + bun install + cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env + # Example: git worktree add agentv.worktrees/feat/42-add-new-embedder -b feat/42-add-new-embedder + ``` + +3. **Implement the changes** and commit following the commit convention + +4. **Push the branch and create a Pull Request**: + ```bash + git push -u origin + gh pr create --title "(scope): description" --body "Closes #" + ``` + +5. **Before merging**, ensure: + - **E2E verification completed** (see "Completing Work — E2E Checklist") + - CI pipeline passes (all checks green) + - Code has been reviewed if required + - No merge conflicts with `main` + +The `in-progress` label stays on the issue until the PR is merged and the issue is closed. Do not remove it manually. + +**IMPORTANT:** Never push directly to `main`. Always use branches and PRs. + +### Tracker Conventions + +- The roadmap project is the source of truth for prioritization. +- Issues in the roadmap are prioritized; issues outside it are not. +- `bug` marks defects. +- Issues without `bug` are non-bug work by default. +- `in-progress` marks an issue as claimed by an agent — do not start work on it. +- `core`, `wui`, and `tui` are area labels. +- Keep issue bodies focused on the handoff contract: objective, design latitude, acceptance signals, non-goals, and related links. +- Do not put priority metadata in issue bodies. + +### Pull Requests + +**Always use squash merge** when merging PRs to main. This keeps the commit history clean with one commit per feature/fix. + +```bash +# Using GitHub CLI to squash merge a PR +gh pr merge --squash --delete-branch + +# Or with auto-merge enabled +gh pr merge --squash --auto +``` + +Do NOT use regular merge or rebase merge, as these create noisy commit history with intermediate commits. + +### After Squash Merge + +Once a PR is squash-merged, its source branch diverges from main. **Do NOT** try to push additional commits from that branch—you will get merge conflicts. + +For follow-up fixes: +```bash +git checkout main +git pull origin main +git checkout -b fix/ +# Apply fixes on the fresh branch +``` + +### Plans and Worktrees + +#### Plans + +Design documents and implementation plans are stored in `docs/plans/` inside the worktree (not the main repo). Save plans to the worktree so they are committed on the feature branch and visible in the draft PR. + +**Path warning:** When working in a worktree, use paths relative to the worktree root (e.g., `docs/plans/plan.md`). Do NOT prefix with the worktree directory from the main repo (e.g., `agentv.worktrees/feat/xxx/docs/plans/plan.md`) — this creates accidental nested directories inside the worktree. + +Plans are temporary working materials. **Before merging the PR**, delete the plan file and incorporate any user-relevant details into the official documentation. + +#### Git Worktrees + +Use the sibling `../agentv.worktrees/` directory for all AgentV worktrees. This overrides any generic skill or default preference for `.worktrees/` or `worktrees/` inside the repository. Do not create new AgentV worktrees inside the repository root. + +After creating a worktree, always run setup: +```bash +bun install # worktrees do NOT share node_modules +cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env # required for e2e tests and LLM operations +``` +Both steps are required before running builds, tests, or evals in the worktree. + +## Version Management + +This project uses a simple release script for version bumping. The git commit history serves as the changelog. + +### Releasing a new version + +Run the release script for a version bump: + +```bash +bun run release # patch bump (default) +bun run release minor # minor bump +bun run release major # major bump +``` + +The script will: +1. Validate you're on the `main` branch with no uncommitted changes +2. Pull latest changes from origin +3. Bump version in all package.json files +4. Commit the version bump +5. Create and push a git tag + +Recommended publish flow: +```bash +bun run publish:next # publish current version to npm `next` +bun run promote:latest # promote same version to npm `latest` +bun run tag:next 2.18.0 +bun run promote:latest 2.18.0 +``` + +## Package Publishing +- Core package (`packages/core/`) - Core evaluation engine and grading logic (published as `@agentv/core`) +- CLI package (`apps/cli/`) is published as `agentv` on npm +- Uses tsup with `noExternal: ["@agentv/core"]` to bundle workspace dependencies +- Install command: `bun install -g agentv` (preferred) or `npm install -g agentv` + +## Python Scripts +When running Python scripts, always use: `uv run ` diff --git a/apps/cli/src/commands/eval/commands/run.ts b/apps/cli/src/commands/eval/commands/run.ts index e680301f3..5df5ee42b 100644 --- a/apps/cli/src/commands/eval/commands/run.ts +++ b/apps/cli/src/commands/eval/commands/run.ts @@ -175,6 +175,11 @@ export const evalRunCommand = command({ description: 'Number of trailing messages to include in results output (default: 1, or "all")', }), + threshold: option({ + type: optional(number), + long: 'threshold', + description: 'Suite-level quality gate: exit 1 if mean score falls below this value (0-1)', + }), }, handler: async (args) => { // Launch interactive wizard when no eval paths and stdin is a TTY @@ -217,7 +222,11 @@ export const evalRunCommand = command({ graderTarget: args.graderTarget, model: args.model, outputMessages: args.outputMessages, + threshold: args.threshold, }; - await runEvalCommand({ testFiles: resolvedPaths, rawOptions }); + const result = await runEvalCommand({ testFiles: resolvedPaths, rawOptions }); + if (result?.thresholdFailed) { + process.exit(1); + } }, }); diff --git a/apps/cli/src/commands/eval/junit-writer.ts b/apps/cli/src/commands/eval/junit-writer.ts index f3bfb7f18..514b24585 100644 --- a/apps/cli/src/commands/eval/junit-writer.ts +++ b/apps/cli/src/commands/eval/junit-writer.ts @@ -3,6 +3,10 @@ import path from 'node:path'; import type { EvaluationResult } from '@agentv/core'; +export interface JunitWriterOptions { + readonly threshold?: number; +} + export function escapeXml(str: string): string { return str .replace(/&/g, '&') @@ -15,15 +19,17 @@ export function escapeXml(str: string): string { export class JunitWriter { private readonly filePath: string; private readonly results: EvaluationResult[] = []; + private readonly threshold: number; private closed = false; - private constructor(filePath: string) { + private constructor(filePath: string, options?: JunitWriterOptions) { this.filePath = filePath; + this.threshold = options?.threshold ?? 0.5; } - static async open(filePath: string): Promise { + static async open(filePath: string, options?: JunitWriterOptions): Promise { await mkdir(path.dirname(filePath), { recursive: true }); - return new JunitWriter(filePath); + return new JunitWriter(filePath, options); } async append(result: EvaluationResult): Promise { @@ -52,7 +58,7 @@ export class JunitWriter { const suiteXmls: string[] = []; for (const [suiteName, results] of grouped) { - const failures = results.filter((r) => r.score < 0.5).length; + const failures = results.filter((r) => r.score < this.threshold).length; const errors = results.filter((r) => r.error !== undefined).length; const testCases = results.map((r) => { @@ -61,7 +67,7 @@ export class JunitWriter { let inner = ''; if (r.error) { inner = `\n ${escapeXml(r.error)}\n `; - } else if (r.score < 0.5) { + } else if (r.score < this.threshold) { const message = `score=${r.score.toFixed(3)}`; const failedAssertions = r.assertions.filter((a) => !a.passed); const detail = [ @@ -84,7 +90,7 @@ export class JunitWriter { } const totalTests = this.results.length; - const totalFailures = this.results.filter((r) => r.score < 0.5).length; + const totalFailures = this.results.filter((r) => r.score < this.threshold).length; const totalErrors = this.results.filter((r) => r.error !== undefined).length; const xml = `\n\n${suiteXmls.join('\n')}\n\n`; diff --git a/apps/cli/src/commands/eval/output-writer.ts b/apps/cli/src/commands/eval/output-writer.ts index acaf757fe..e4d2cebd8 100644 --- a/apps/cli/src/commands/eval/output-writer.ts +++ b/apps/cli/src/commands/eval/output-writer.ts @@ -15,6 +15,10 @@ export interface OutputWriter { close(): Promise; } +export interface WriterOptions { + readonly threshold?: number; +} + export async function createOutputWriter( filePath: string, format: OutputFormat, @@ -35,7 +39,10 @@ export async function createOutputWriter( const SUPPORTED_EXTENSIONS = new Set(['.jsonl', '.json', '.xml', '.yaml', '.yml', '.html', '.htm']); -export function createWriterFromPath(filePath: string): Promise { +export function createWriterFromPath( + filePath: string, + options?: WriterOptions, +): Promise { const ext = path.extname(filePath).toLowerCase(); switch (ext) { case '.jsonl': @@ -43,7 +50,7 @@ export function createWriterFromPath(filePath: string): Promise { case '.json': return JsonWriter.open(filePath); case '.xml': - return JunitWriter.open(filePath); + return JunitWriter.open(filePath, { threshold: options?.threshold }); case '.yaml': case '.yml': return YamlWriter.open(filePath); @@ -57,8 +64,11 @@ export function createWriterFromPath(filePath: string): Promise { } } -export async function createMultiWriter(filePaths: readonly string[]): Promise { - const writers = await Promise.all(filePaths.map((fp) => createWriterFromPath(fp))); +export async function createMultiWriter( + filePaths: readonly string[], + options?: WriterOptions, +): Promise { + const writers = await Promise.all(filePaths.map((fp) => createWriterFromPath(fp, options))); return { async append(result: EvaluationResult): Promise { await Promise.all(writers.map((w) => w.append(result))); diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 2d486eab2..ac3a84cd9 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -45,6 +45,7 @@ import { calculateEvaluationSummary, formatEvaluationSummary, formatMatrixSummary, + formatThresholdSummary, } from './statistics.js'; import { type TargetSelection, selectMultipleTargets, selectTarget } from './targets.js'; @@ -86,6 +87,7 @@ interface NormalizedOptions { readonly graderTarget?: string; readonly model?: string; readonly outputMessages: number | 'all'; + readonly threshold?: number; } function normalizeBoolean(value: unknown): boolean { @@ -301,6 +303,7 @@ function normalizeOptions( graderTarget: normalizeString(rawOptions.graderTarget), model: normalizeString(rawOptions.model), outputMessages: normalizeOutputMessages(normalizeString(rawOptions.outputMessages)), + threshold: normalizeOptionalNumber(rawOptions.threshold), } satisfies NormalizedOptions; } @@ -430,6 +433,7 @@ async function prepareFileMetadata(params: { readonly yamlCachePath?: string; readonly totalBudgetUsd?: number; readonly failOnError?: FailOnError; + readonly threshold?: number; }> { const { testFilePath, repoRoot, cwd, options } = params; @@ -515,6 +519,7 @@ async function prepareFileMetadata(params: { yamlCachePath: suite.cacheConfig?.cachePath, totalBudgetUsd: suite.totalBudgetUsd, failOnError: suite.failOnError, + threshold: suite.threshold, }; } @@ -749,6 +754,8 @@ export interface RunEvalResult { readonly outputPath: string; readonly testFiles: readonly string[]; readonly target?: string; + /** True when --threshold is set and mean score is below the threshold */ + readonly thresholdFailed?: boolean; } export async function runEvalCommand( @@ -901,12 +908,9 @@ export async function runEvalCommand( extraOutputPaths.length > 0 ? [outputPath, ...extraOutputPaths] : [outputPath]; const uniqueReportedOutputPaths = [...new Set(reportedOutputPaths)]; - let outputWriter: OutputWriter; if (uniqueOutputPaths.length === 1) { - outputWriter = await createOutputWriter(primaryWritePath, options.format); console.log(`Output path: ${outputPath}`); } else { - outputWriter = await createMultiWriter(uniqueOutputPaths); console.log('Output paths:'); for (const p of uniqueReportedOutputPaths) { console.log(` ${p}`); @@ -951,6 +955,7 @@ export async function runEvalCommand( readonly yamlCachePath?: string; readonly totalBudgetUsd?: number; readonly failOnError?: FailOnError; + readonly threshold?: number; } >(); // Separate TypeScript/JS eval files from YAML files. @@ -1006,6 +1011,24 @@ export async function runEvalCommand( console.log(`Response cache: enabled${yamlCachePath ? ` (${yamlCachePath})` : ''}`); } + // Resolve suite-level threshold: CLI --threshold takes precedence over YAML execution.threshold + const yamlThreshold = firstMeta?.threshold; + const resolvedThreshold = options.threshold ?? yamlThreshold; + if (resolvedThreshold !== undefined && (resolvedThreshold < 0 || resolvedThreshold > 1)) { + throw new Error('--threshold must be between 0 and 1'); + } + + // Build the output writer (deferred until after threshold is resolved so JUnit + // writer can use the resolved threshold for per-test pass/fail decisions) + const writerOptions = + resolvedThreshold !== undefined ? { threshold: resolvedThreshold } : undefined; + let outputWriter: OutputWriter; + if (uniqueOutputPaths.length === 1) { + outputWriter = await createOutputWriter(primaryWritePath, options.format); + } else { + outputWriter = await createMultiWriter(uniqueOutputPaths, writerOptions); + } + // Detect matrix mode: multiple targets for any file const isMatrixMode = Array.from(fileMetadata.values()).some((meta) => meta.selections.length > 1); @@ -1152,6 +1175,14 @@ export async function runEvalCommand( const summary = calculateEvaluationSummary(allResults); console.log(formatEvaluationSummary(summary)); + // Threshold quality gate check + let thresholdFailed = false; + if (resolvedThreshold !== undefined) { + const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold); + console.log(`\n${thresholdResult.message}`); + thresholdFailed = !thresholdResult.passed; + } + // Print matrix summary when multiple targets were evaluated if (isMatrixMode && allResults.length > 0) { console.log(formatMatrixSummary(allResults)); @@ -1246,6 +1277,7 @@ export async function runEvalCommand( outputPath, testFiles: resolvedTestFiles, target: options.target, + thresholdFailed, }; } finally { unsubscribeCodexLogs(); diff --git a/apps/cli/src/commands/eval/statistics.ts b/apps/cli/src/commands/eval/statistics.ts index e47a65791..910052d24 100644 --- a/apps/cli/src/commands/eval/statistics.ts +++ b/apps/cli/src/commands/eval/statistics.ts @@ -334,3 +334,17 @@ export function formatMatrixSummary(results: readonly EvaluationResult[]): strin return lines.join('\n'); } + +/** + * Format a threshold check summary line. + * Returns whether the threshold was met and the formatted message. + */ +export function formatThresholdSummary( + meanScore: number, + threshold: number, +): { passed: boolean; message: string } { + const passed = meanScore >= threshold; + const verdict = passed ? 'PASS' : 'FAIL'; + const message = `Suite score: ${meanScore.toFixed(2)} (threshold: ${threshold.toFixed(2)}) — ${verdict}`; + return { passed, message }; +} diff --git a/apps/cli/test/commands/eval/output-writers.test.ts b/apps/cli/test/commands/eval/output-writers.test.ts index 8c1ea67fb..75ff80da2 100644 --- a/apps/cli/test/commands/eval/output-writers.test.ts +++ b/apps/cli/test/commands/eval/output-writers.test.ts @@ -162,6 +162,32 @@ describe('JunitWriter', () => { 'Cannot write to closed JUnit writer', ); }); + + it('uses custom threshold for pass/fail when provided', async () => { + const filePath = path.join(testDir, `junit-threshold-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath, { threshold: 0.8 }); + + await writer.append(makeResult({ testId: 'high', score: 0.9 })); + await writer.append(makeResult({ testId: 'mid', score: 0.6 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { + const filePath = path.join(testDir, `junit-default-${Date.now()}.xml`); + const writer = await JunitWriter.open(filePath); + + await writer.append(makeResult({ testId: 'pass', score: 0.6 })); + await writer.append(makeResult({ testId: 'fail', score: 0.3 })); + await writer.close(); + + const xml = await readFile(filePath, 'utf8'); + expect(xml).not.toContain(' { diff --git a/apps/cli/test/commands/eval/threshold.test.ts b/apps/cli/test/commands/eval/threshold.test.ts new file mode 100644 index 000000000..65c059167 --- /dev/null +++ b/apps/cli/test/commands/eval/threshold.test.ts @@ -0,0 +1,31 @@ +import { describe, expect, it } from 'bun:test'; + +import { formatThresholdSummary } from '../../../src/commands/eval/statistics.js'; + +describe('formatThresholdSummary', () => { + it('returns PASS when mean score meets threshold', () => { + const result = formatThresholdSummary(0.85, 0.6); + expect(result.passed).toBe(true); + expect(result.message).toContain('0.85'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('PASS'); + }); + + it('returns FAIL when mean score is below threshold', () => { + const result = formatThresholdSummary(0.53, 0.6); + expect(result.passed).toBe(false); + expect(result.message).toContain('0.53'); + expect(result.message).toContain('0.60'); + expect(result.message).toContain('FAIL'); + }); + + it('returns PASS when mean score exactly equals threshold', () => { + const result = formatThresholdSummary(0.6, 0.6); + expect(result.passed).toBe(true); + }); + + it('returns PASS for threshold 0 with any score', () => { + const result = formatThresholdSummary(0, 0); + expect(result.passed).toBe(true); + }); +}); diff --git a/apps/web/src/content/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/evaluation/eval-files.mdx index 281614053..41c03eb97 100644 --- a/apps/web/src/content/docs/evaluation/eval-files.mdx +++ b/apps/web/src/content/docs/evaluation/eval-files.mdx @@ -34,7 +34,7 @@ tests: |-------|-------------| | `description` | Human-readable description of the evaluation | | `dataset` | Optional dataset identifier | -| `execution` | Default execution config (`target`, `fail_on_error`, etc.) | +| `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) | | `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/guides/workspace-pool/#external-workspace-config) | | `tests` | Array of individual tests, or a string path to an external file | | `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test | diff --git a/apps/web/src/content/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/evaluation/running-evals.mdx index 7e221bbc6..5c502aa19 100644 --- a/apps/web/src/content/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/evaluation/running-evals.mdx @@ -229,6 +229,33 @@ execution: When halted, remaining tests are recorded with `failureReasonCode: 'error_threshold_exceeded'`. With concurrency > 1, a few additional tests may complete before halting takes effect. +### Suite-Level Quality Threshold + +Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. + +**CLI flag:** + +```bash +agentv eval evals/ --threshold 0.8 +``` + +**YAML config:** + +```yaml +execution: + threshold: 0.8 +``` + +The CLI `--threshold` flag overrides the YAML value. The threshold is a number between 0 and 1. Mean score is computed from quality results only (execution errors are excluded). + +When active, a summary line is printed after the eval results: + +``` +Suite score: 0.85 (threshold: 0.80) — PASS +``` + +The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as `` in JUnit output. When no threshold is set, JUnit defaults to 0.5. + ## Validate Before Running Check eval files for schema errors without executing: diff --git a/packages/core/src/evaluation/loaders/config-loader.ts b/packages/core/src/evaluation/loaders/config-loader.ts index 4835dcbd2..54505cddc 100644 --- a/packages/core/src/evaluation/loaders/config-loader.ts +++ b/packages/core/src/evaluation/loaders/config-loader.ts @@ -333,6 +333,32 @@ export function extractFailOnError(suite: JsonObject): FailOnError | undefined { return undefined; } +/** + * Extract `execution.threshold` from parsed eval suite. + * Accepts a number in [0, 1] range. + * Returns undefined when not specified. + */ +export function extractThreshold(suite: JsonObject): number | undefined { + const execution = suite.execution; + if (!execution || typeof execution !== 'object' || Array.isArray(execution)) { + return undefined; + } + + const executionObj = execution as Record; + const raw = executionObj.threshold; + + if (raw === undefined || raw === null) { + return undefined; + } + + if (typeof raw === 'number' && raw >= 0 && raw <= 1) { + return raw; + } + + logWarning(`Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`); + return undefined; +} + export function parseExecutionDefaults( raw: unknown, configPath: string, diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts index 9ebf758a9..084eb3466 100644 --- a/packages/core/src/evaluation/validation/eval-file.schema.ts +++ b/packages/core/src/evaluation/validation/eval-file.schema.ts @@ -328,6 +328,7 @@ const ExecutionSchema = z.object({ totalBudgetUsd: z.number().min(0).optional(), fail_on_error: FailOnErrorSchema.optional(), failOnError: FailOnErrorSchema.optional(), + threshold: z.number().min(0).max(1).optional(), }); // --------------------------------------------------------------------------- diff --git a/packages/core/src/evaluation/yaml-parser.ts b/packages/core/src/evaluation/yaml-parser.ts index 119cadb47..e8004bbc9 100644 --- a/packages/core/src/evaluation/yaml-parser.ts +++ b/packages/core/src/evaluation/yaml-parser.ts @@ -13,6 +13,7 @@ import { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTotalBudgetUsd, extractTrialsConfig, extractWorkersFromSuite, @@ -59,6 +60,7 @@ export { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTrialsConfig, extractWorkersFromSuite, loadConfig, @@ -180,6 +182,8 @@ export type EvalSuiteResult = { readonly totalBudgetUsd?: number; /** Execution error tolerance: true or false */ readonly failOnError?: import('./types.js').FailOnError; + /** Suite-level quality threshold (0-1) — suite fails if mean score is below */ + readonly threshold?: number; }; /** @@ -201,6 +205,7 @@ export async function loadTestSuite( const { tests, parsed } = await loadTestsFromYaml(evalFilePath, repoRoot, options); const metadata = parseMetadata(parsed); const failOnError = extractFailOnError(parsed); + const threshold = extractThreshold(parsed); return { tests, trials: extractTrialsConfig(parsed), @@ -210,6 +215,7 @@ export async function loadTestSuite( totalBudgetUsd: extractTotalBudgetUsd(parsed), ...(metadata !== undefined && { metadata }), ...(failOnError !== undefined && { failOnError }), + ...(threshold !== undefined && { threshold }), }; } diff --git a/packages/core/test/evaluation/loaders/config-loader.test.ts b/packages/core/test/evaluation/loaders/config-loader.test.ts index 27dd52c1e..ac68e0eb9 100644 --- a/packages/core/test/evaluation/loaders/config-loader.test.ts +++ b/packages/core/test/evaluation/loaders/config-loader.test.ts @@ -5,6 +5,7 @@ import { extractTargetFromSuite, extractTargetsFromSuite, extractTargetsFromTestCase, + extractThreshold, extractTotalBudgetUsd, extractTrialsConfig, parseExecutionDefaults, @@ -302,6 +303,48 @@ describe('extractFailOnError', () => { }); }); +describe('extractThreshold', () => { + it('returns undefined when no execution block', () => { + const suite: JsonObject = { tests: [] }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined when threshold not set', () => { + const suite: JsonObject = { execution: { target: 'default' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('parses valid threshold', () => { + const suite: JsonObject = { execution: { threshold: 0.8 } }; + expect(extractThreshold(suite)).toBe(0.8); + }); + + it('accepts 0 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 0 } }; + expect(extractThreshold(suite)).toBe(0); + }); + + it('accepts 1 as threshold', () => { + const suite: JsonObject = { execution: { threshold: 1 } }; + expect(extractThreshold(suite)).toBe(1); + }); + + it('returns undefined for negative threshold', () => { + const suite: JsonObject = { execution: { threshold: -0.1 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for threshold > 1', () => { + const suite: JsonObject = { execution: { threshold: 1.5 } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); + + it('returns undefined for non-number threshold', () => { + const suite: JsonObject = { execution: { threshold: 'high' } }; + expect(extractThreshold(suite)).toBeUndefined(); + }); +}); + describe('parseExecutionDefaults', () => { it('returns undefined when no execution block', () => { expect(parseExecutionDefaults(undefined, '/test/config.yaml')).toBeUndefined(); diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md index efc818f3c..7a6f2c3f5 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md @@ -520,11 +520,24 @@ execution: When halted, remaining tests get `executionStatus: 'execution_error'` with `failureReasonCode: 'error_threshold_exceeded'`. +## Suite-Level Quality Threshold + +Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. + +```yaml +execution: + threshold: 0.8 +``` + +CLI flag `--threshold 0.8` overrides the YAML value. Must be a number between 0 and 1. Mean score is computed from quality results only (execution errors excluded). + +The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as ``. When no threshold is set, JUnit defaults to 0.5. + ## CLI Commands ```bash # Run evaluation (requires API keys) -agentv eval [--test-id ] [--target ] [--dry-run] +agentv eval [--test-id ] [--target ] [--dry-run] [--threshold <0-1>] # Run with OTLP JSON file (importable by OTel backends) agentv eval --otel-file traces/eval.otlp.json diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json index a7be8362b..9827ee04c 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json +++ b/plugins/agentv-dev/skills/agentv-eval-writer/references/eval-schema.json @@ -6136,6 +6136,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false @@ -12442,6 +12447,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false @@ -15700,6 +15710,11 @@ }, "failOnError": { "type": "boolean" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1 } }, "additionalProperties": false