Skip to content

feat(cli): --threshold flag for suite-level quality gates#760

Merged
christso merged 11 commits into
mainfrom
feat/threshold-flag
Mar 25, 2026
Merged

feat(cli): --threshold flag for suite-level quality gates#760
christso merged 11 commits into
mainfrom
feat/threshold-flag

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

  • Add --threshold <0-1> CLI flag to agentv eval that exits with code 1 if mean quality score falls below the threshold
  • Add execution.threshold YAML config field (CLI flag takes precedence)
  • JUnit XML writer uses the resolved threshold for per-test pass/fail (defaults to 0.5 when unset)
  • Execution errors are excluded from mean score computation and don't affect threshold gate

Closes #698

Changes

Core (packages/core/):

  • extractThreshold in config-loader (follows extractFailOnError pattern)
  • threshold field in ExecutionSchema (Zod validation: 0-1)
  • Wired through yaml-parser.ts into EvalSuiteResult

CLI (apps/cli/):

  • --threshold option in run.ts command definition
  • Threshold resolution (CLI > YAML) and range validation in run-eval.ts
  • formatThresholdSummary() in statistics.ts
  • JUnit writer accepts configurable threshold via JunitWriterOptions
  • process.exit(1) when threshold fails (avoids cmd-ts resetting process.exitCode)

Docs & Skills:

  • running-evals.mdx: Suite-Level Quality Threshold section
  • eval-files.mdx: threshold in execution field table
  • SKILL.md: threshold section + CLI flag in command reference

Test Plan

  • Unit tests: extractThreshold (8 cases), formatThresholdSummary (4 cases), JUnit threshold (2 cases)
  • All 1562 tests pass, lint/typecheck/build clean, 48/48 examples validate
  • Manual UAT: no-threshold (exit 0), threshold PASS (exit 0), threshold FAIL (exit 1)

🤖 Generated with Claude Code

christso and others added 11 commits March 25, 2026 02:14
Design document for suite-level quality gate threshold flag
that fails CI when mean eval score drops below a specified value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 1604663b3709
8-task TDD plan covering core extractor, YAML schema, CLI flag,
threshold check, JUnit integration, and manual UAT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 6cfbff7718f7
PR #757 moved content from CLAUDE.md to AGENTS.md but accidentally
dropped several sections: Evaluator Type System, Git Workflow (issue
claiming, PRs, worktrees), Version Management, Package Publishing,
and Python Scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 1ed266d094ed
…698)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add threshold to ExecutionSchema in Zod, wire extractThreshold through
yaml-parser.ts (import, re-export, EvalSuiteResult type, loadTestSuite),
and regenerate eval-schema.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process.exitCode was being reset by the cmd-ts handler wrapper.
Return thresholdFailed from runEvalCommand and call process.exit(1)
in the handler instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CLI range validation (0-1) for --threshold flag
- Document threshold in running-evals.mdx, eval-files.mdx, and SKILL.md
- Remove temporary plan files before merge

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5ea4e4b
Status: ✅  Deploy successful!
Preview URL: https://da03569e.agentv.pages.dev
Branch Preview URL: https://feat-threshold-flag.agentv.pages.dev

View logs

@christso christso merged commit cfa1402 into main Mar 25, 2026
1 check passed
@christso christso deleted the feat/threshold-flag branch March 25, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(cli): --threshold flag for suite-level quality gates

1 participant