Skip to content

feat(eval): suite-level total cost budget guardrail (core runtime) #369

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We add a run-wide budget guardrail for matrix/trial evaluations to prevent unbounded spend.

Why This Matters

  • Matrix + trials scale cost multiplicatively (targets × tests × trials).
  • A run can be interrupted late with unexpected cost if we only bound per trial.
  • CI and scheduled benchmark runs need a deterministic total-cost ceiling.

Why This Is Core (Not External)

  • External tooling can only detect overspend after results are written; it cannot reliably stop provider calls in-flight.
  • Budget enforcement is an execution-control primitive, so it belongs in the orchestrator where dispatch decisions are made.
  • We need consistent behavior across providers/targets during the run, not post-hoc filtering.

Architecture Boundary

Core runtime change (execution-control primitive).

Deliverable Location

  • Core implementation: packages/core/src/evaluation/
  • Expected touchpoints:
    • packages/core/src/evaluation/types.ts
    • packages/core/src/evaluation/loaders/config-loader.ts
    • packages/core/src/evaluation/orchestrator.ts
  • Tests: packages/core/test/evaluation/
  • Docs updates: apps/web/src/content/docs/

Design Latitude

We choose schema/CLI shape and budget-breach behavior details, as long as behavior is non-breaking and explicit.

Acceptance Signals

  • We support an optional suite-level total budget cap.
  • We enforce budget accounting across targets/tests/trials.
  • We surface budget-triggered early termination in outputs/status.

Non-Goals

  • We do not replace existing per-trial cost_limit_usd behavior.
  • We avoid provider-specific budgeting semantics in core unless unavoidable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions