feat: Pass@k Trial Strategy for LLM Non-Determinism

## Problem

LLM outputs are inherently non-deterministic — a single eval run can produce different scores. AgentV has target-level retries (for transient failures) but no eval-level trial strategy to handle stochastic outputs.

## Proposed Solution

Add a `trials` configuration option to the execution block:

```yaml
execution:
  trials:
    count: 3                    # Run 3 times
    strategy: pass_at_k         # pass_at_k | mean | confidence_interval
    cost_limit_usd: 5.00        # Auto-skip remaining trials if budget exceeded
```

### Strategies

| Strategy | Result Calculation | When to Use |
|----------|-------------------|-------------|
| **`pass_at_k`** | `result.passed = trials.some(t => t.verdict === 'pass')` | Binary pass/fail, tolerant of occasional failures |
| **`mean`** | `result.score = mean(trials.map(t => t.score))` | Continuous scores, averages out variance |
| **`confidence_interval`** | `result = { score: mean, ci95: [low, high] }` | Statistical rigor, reports uncertainty |

### Output Format

**Current output** (single trial):
```json
{
  "eval_id": "case-1",
  "score": 0.8,
  "verdict": "pass"
}
```

**With trials** (pass@k example):
```json
{
  "eval_id": "case-1",
  "score": 0.8,
  "verdict": "pass",
  "trials": [
    {"attempt": 0, "score": 0.6, "verdict": "borderline"},
    {"attempt": 1, "score": 0.9, "verdict": "pass"},
    {"attempt": 2, "score": 0.7, "verdict": "borderline"}
  ],
  "aggregation": {
    "strategy": "pass_at_k",
    "passed_attempts": 1,
    "total_attempts": 3
  }
}
```

## Implementation Notes

### Where to Look

- **Orchestrator**: `packages/core/src/evaluation/orchestrator.ts` — `runEvalCase` function
- **YAML parser**: `packages/core/src/evaluation/yaml-parser.ts` — add `trials` schema
- **Result types**: `packages/core/src/evaluation/types.ts` — extend `EvaluationResult`

### Key Changes

1. **Schema**: Add `trials` to `ExecutionConfig` in `yaml-parser.ts`
2. **Loop**: Wrap `runEvalCase` in trial loop in `orchestrator.ts`
3. **Aggregation**: Add function to compute strategy-specific results
4. **Cost tracking**: Cumulative spend check before each trial
5. **Output**: Extend `EvaluationResult` with `trials` array

### Cost Handling

**Industry pattern**: Cost is optional. Frameworks (DeepEval, LangWatch, RunLedger) do not calculate cost from token usage — they simply report it if the provider provides it.

For `cost_limit_usd`:
- **If provider reports `costUsd`**: Track cumulative spend, skip remaining trials when limit exceeded
- **If provider does NOT report `costUsd`**: Warn user, continue trials (cost limit cannot be enforced)

```typescript
if (costUsd !== undefined) {
  cumulativeCost += costUsd;
  if (cumulativeCost >= costLimit) {
    // Skip remaining trials
    return { status: 'cost_limited', ... };
  }
} else {
  console.warn('Provider does not report costUsd; cost_limit_usd cannot be enforced');
}
```

**Note**: AgentV does not calculate cost from token usage. Providers (CLI, built-in) are responsible for reporting `costUsd` in `ProviderResponse`.

### Edge Cases

- **Cost limit exceeded**: Skip remaining trials, mark result as `cost_limited`
- **All trials fail**: `pass_at_k` should return `verdict: "fail"` (not border)
- **Single trial**: If `count: 1`, behave like current behavior (no aggregation)
- **Confidence interval**: Use t-distribution for small samples (n < 30)
- **Provider doesn't report cost**: Warn and continue (cost limit unenforceable)

## Evidence

- **nem035/agentevals**: Production-proven implementation with `--trials` flag
- **OpenCode-Bench**: Uses three isolated episodes for statistical reliability

## Effort Estimate

~1 week

## References

- Research: [Pass@k Trial Strategy](https://github.com/agentevals/research/blob/main/research/agentv/README.md#gap-1-passk-trial-strategy)
- nem035/agentevals: https://github.com/nem035/agentevals
- OpenCode-Bench: https://github.com/OpenCode-Bench/OpenCode-Bench

🤖 Generated from [AgentEvals Research](https://github.com/agentevals/research)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Pass@k Trial Strategy for LLM Non-Determinism #214

Problem

Proposed Solution

Strategies

Output Format

Implementation Notes

Where to Look

Key Changes

Cost Handling

Edge Cases

Evidence

Effort Estimate

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strategy	Result Calculation	When to Use
`pass_at_k`	`result.passed = trials.some(t => t.verdict === 'pass')`	Binary pass/fail, tolerant of occasional failures
`mean`	`result.score = mean(trials.map(t => t.score))`	Continuous scores, averages out variance
`confidence_interval`	`result = { score: mean, ci95: [low, high] }`	Statistical rigor, reports uncertainty

feat: Pass@k Trial Strategy for LLM Non-Determinism #214

Description

Problem

Proposed Solution

Strategies

Output Format

Implementation Notes

Where to Look

Key Changes

Cost Handling

Edge Cases

Evidence

Effort Estimate

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions