Summary
Add threshold as a new aggregation strategy for the composite evaluator, implementing "at least N% of child evaluators must pass" logic.
Motivation
promptfoo's assert-set groups assertions with a threshold — e.g., "at least 50% of these assertions must pass for the test case to pass." AgentEvals' composite evaluator currently supports weighted_average, code_judge, and llm_judge aggregation but lacks this mode. This gap was surfaced during the promptfoo integration assessment.
Research reference: integration-assessment-promptfoo-braintrust.md
Proposed EVAL.yaml Syntax
evaluators:
- type: composite
name: quality_gate
aggregation: threshold
threshold: 0.5 # At least 50% of child evaluators must pass
evaluators:
- type: field_accuracy
mode: contains
value: "refund"
- type: execution_metrics
latency_ms: 5000
- type: llm_judge
prompt: "Is the response professional and helpful?"
Behavior
- Each child evaluator produces a verdict (pass/borderline/fail)
threshold counts the proportion of passing children: pass_ratio = passing_count / total_count
- Composite passes if
pass_ratio >= threshold
- Composite score is the
pass_ratio value (0.0 to 1.0)
- Details field lists each child's verdict
Relation to Existing Features
This complements the existing aggregation strategies:
weighted_average — continuous score blending
code_judge — custom code decides final verdict
llm_judge — LLM decides final verdict
threshold (new) — N% of children must pass
See also #235 (assert-set evaluator, closed) — this proposal refines the approach as a composite aggregation strategy rather than a separate evaluator type.
Acceptance Criteria
Effort Estimate
1-2 days
Summary
Add
thresholdas a new aggregation strategy for thecompositeevaluator, implementing "at least N% of child evaluators must pass" logic.Motivation
promptfoo's
assert-setgroups assertions with a threshold — e.g., "at least 50% of these assertions must pass for the test case to pass." AgentEvals'compositeevaluator currently supportsweighted_average,code_judge, andllm_judgeaggregation but lacks this mode. This gap was surfaced during the promptfoo integration assessment.Research reference: integration-assessment-promptfoo-braintrust.md
Proposed EVAL.yaml Syntax
Behavior
thresholdcounts the proportion of passing children:pass_ratio = passing_count / total_countpass_ratio >= thresholdpass_ratiovalue (0.0 to 1.0)Relation to Existing Features
This complements the existing aggregation strategies:
weighted_average— continuous score blendingcode_judge— custom code decides final verdictllm_judge— LLM decides final verdictthreshold(new) — N% of children must passSee also #235 (assert-set evaluator, closed) — this proposal refines the approach as a composite aggregation strategy rather than a separate evaluator type.
Acceptance Criteria
compositeevaluator acceptsaggregation: thresholdwiththreshold: floatEffort Estimate
1-2 days