Canonical Plan
This issue body is the source of truth for implementation.
Objective
We add a run-wide budget guardrail for matrix/trial evaluations to prevent unbounded spend.
Why This Matters
- Matrix + trials scale cost multiplicatively (
targets × tests × trials).
- A run can be interrupted late with unexpected cost if we only bound per trial.
- CI and scheduled benchmark runs need a deterministic total-cost ceiling.
Why This Is Core (Not External)
- External tooling can only detect overspend after results are written; it cannot reliably stop provider calls in-flight.
- Budget enforcement is an execution-control primitive, so it belongs in the orchestrator where dispatch decisions are made.
- We need consistent behavior across providers/targets during the run, not post-hoc filtering.
Architecture Boundary
Core runtime change (execution-control primitive).
Deliverable Location
- Core implementation:
packages/core/src/evaluation/
- Expected touchpoints:
packages/core/src/evaluation/types.ts
packages/core/src/evaluation/loaders/config-loader.ts
packages/core/src/evaluation/orchestrator.ts
- Tests:
packages/core/test/evaluation/
- Docs updates:
apps/web/src/content/docs/
Design Latitude
We choose schema/CLI shape and budget-breach behavior details, as long as behavior is non-breaking and explicit.
Acceptance Signals
- We support an optional suite-level total budget cap.
- We enforce budget accounting across targets/tests/trials.
- We surface budget-triggered early termination in outputs/status.
Non-Goals
- We do not replace existing per-trial
cost_limit_usd behavior.
- We avoid provider-specific budgeting semantics in core unless unavoidable.
Canonical Plan
This issue body is the source of truth for implementation.
Objective
We add a run-wide budget guardrail for matrix/trial evaluations to prevent unbounded spend.
Why This Matters
targets × tests × trials).Why This Is Core (Not External)
Architecture Boundary
Core runtime change (execution-control primitive).
Deliverable Location
packages/core/src/evaluation/packages/core/src/evaluation/types.tspackages/core/src/evaluation/loaders/config-loader.tspackages/core/src/evaluation/orchestrator.tspackages/core/test/evaluation/apps/web/src/content/docs/Design Latitude
We choose schema/CLI shape and budget-breach behavior details, as long as behavior is non-breaking and explicit.
Acceptance Signals
Non-Goals
cost_limit_usdbehavior.