diff --git a/AGENTS.md b/AGENTS.md index 2b66d37665..75049fef30 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -118,22 +118,70 @@ Many packages use `.js` files with JSDoc `@typedef` for type definitions (e.g., ## AI Eval Suite -The `evals/` directory contains a Promptfoo-based evaluation suite for validating AI tool call quality. +The `evals/` directory contains a Promptfoo-based evaluation suite with three levels of evaluation. + +### Level 1: Deterministic Evals (tool selection + argument accuracy) | Command | What it does | Cost | |---------|-------------|------| | `pnpm --filter @superdoc-testing/evals run eval` | Run deterministic evals (reading + argument tests) | ~$0.30 | | `pnpm --filter @superdoc-testing/evals run eval:reading` | Run reading tool tests only | ~$0.15 | -| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark (Model+SuperDoc vs Model-Only) | ~$1-2 | | `pnpm --filter @superdoc-testing/evals run eval:view` | Open Promptfoo web UI with results | Free | | `pnpm --filter @superdoc-testing/evals run baseline:save