Skip to content

feat: threshold aggregation strategy for composite evaluator #274

@christso

Description

@christso

Summary

Add threshold as a new aggregation strategy for the composite evaluator, implementing "at least N% of child evaluators must pass" logic.

Motivation

promptfoo's assert-set groups assertions with a threshold — e.g., "at least 50% of these assertions must pass for the test case to pass." AgentEvals' composite evaluator currently supports weighted_average, code_judge, and llm_judge aggregation but lacks this mode. This gap was surfaced during the promptfoo integration assessment.

Research reference: integration-assessment-promptfoo-braintrust.md

Proposed EVAL.yaml Syntax

evaluators:
  - type: composite
    name: quality_gate
    aggregation: threshold
    threshold: 0.5  # At least 50% of child evaluators must pass
    evaluators:
      - type: field_accuracy
        mode: contains
        value: "refund"
      - type: execution_metrics
        latency_ms: 5000
      - type: llm_judge
        prompt: "Is the response professional and helpful?"

Behavior

  • Each child evaluator produces a verdict (pass/borderline/fail)
  • threshold counts the proportion of passing children: pass_ratio = passing_count / total_count
  • Composite passes if pass_ratio >= threshold
  • Composite score is the pass_ratio value (0.0 to 1.0)
  • Details field lists each child's verdict

Relation to Existing Features

This complements the existing aggregation strategies:

  • weighted_average — continuous score blending
  • code_judge — custom code decides final verdict
  • llm_judge — LLM decides final verdict
  • threshold (new) — N% of children must pass

See also #235 (assert-set evaluator, closed) — this proposal refines the approach as a composite aggregation strategy rather than a separate evaluator type.

Acceptance Criteria

  • composite evaluator accepts aggregation: threshold with threshold: float
  • Pass/fail based on proportion of passing children
  • Score equals the pass ratio
  • Details lists each child evaluator's verdict
  • EVAL.yaml schema updated
  • Unit tests for edge cases (0 children, all pass, all fail, exact threshold)

Effort Estimate

1-2 days

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions