Skip to content

feat: Pass@k Trial Strategy for LLM Non-Determinism #214

@christso

Description

@christso

Problem

LLM outputs are inherently non-deterministic — a single eval run can produce different scores. AgentV has target-level retries (for transient failures) but no eval-level trial strategy to handle stochastic outputs.

Proposed Solution

Add a trials configuration option to the execution block:

execution:
  trials:
    count: 3                    # Run 3 times
    strategy: pass_at_k         # pass_at_k | mean | confidence_interval
    cost_limit_usd: 5.00        # Auto-skip remaining trials if budget exceeded

Strategies

Strategy Result Calculation When to Use
pass_at_k result.passed = trials.some(t => t.verdict === 'pass') Binary pass/fail, tolerant of occasional failures
mean result.score = mean(trials.map(t => t.score)) Continuous scores, averages out variance
confidence_interval result = { score: mean, ci95: [low, high] } Statistical rigor, reports uncertainty

Output Format

Current output (single trial):

{
  "eval_id": "case-1",
  "score": 0.8,
  "verdict": "pass"
}

With trials (pass@k example):

{
  "eval_id": "case-1",
  "score": 0.8,
  "verdict": "pass",
  "trials": [
    {"attempt": 0, "score": 0.6, "verdict": "borderline"},
    {"attempt": 1, "score": 0.9, "verdict": "pass"},
    {"attempt": 2, "score": 0.7, "verdict": "borderline"}
  ],
  "aggregation": {
    "strategy": "pass_at_k",
    "passed_attempts": 1,
    "total_attempts": 3
  }
}

Implementation Notes

Where to Look

  • Orchestrator: packages/core/src/evaluation/orchestrator.tsrunEvalCase function
  • YAML parser: packages/core/src/evaluation/yaml-parser.ts — add trials schema
  • Result types: packages/core/src/evaluation/types.ts — extend EvaluationResult

Key Changes

  1. Schema: Add trials to ExecutionConfig in yaml-parser.ts
  2. Loop: Wrap runEvalCase in trial loop in orchestrator.ts
  3. Aggregation: Add function to compute strategy-specific results
  4. Cost tracking: Cumulative spend check before each trial
  5. Output: Extend EvaluationResult with trials array

Cost Handling

Industry pattern: Cost is optional. Frameworks (DeepEval, LangWatch, RunLedger) do not calculate cost from token usage — they simply report it if the provider provides it.

For cost_limit_usd:

  • If provider reports costUsd: Track cumulative spend, skip remaining trials when limit exceeded
  • If provider does NOT report costUsd: Warn user, continue trials (cost limit cannot be enforced)
if (costUsd !== undefined) {
  cumulativeCost += costUsd;
  if (cumulativeCost >= costLimit) {
    // Skip remaining trials
    return { status: 'cost_limited', ... };
  }
} else {
  console.warn('Provider does not report costUsd; cost_limit_usd cannot be enforced');
}

Note: AgentV does not calculate cost from token usage. Providers (CLI, built-in) are responsible for reporting costUsd in ProviderResponse.

Edge Cases

  • Cost limit exceeded: Skip remaining trials, mark result as cost_limited
  • All trials fail: pass_at_k should return verdict: "fail" (not border)
  • Single trial: If count: 1, behave like current behavior (no aggregation)
  • Confidence interval: Use t-distribution for small samples (n < 30)
  • Provider doesn't report cost: Warn and continue (cost limit unenforceable)

Evidence

  • nem035/agentevals: Production-proven implementation with --trials flag
  • OpenCode-Bench: Uses three isolated episodes for statistical reliability

Effort Estimate

~1 week

References

🤖 Generated from AgentEvals Research

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions