feat: threshold aggregation strategy for composite evaluator

## Summary

Add `threshold` as a new aggregation strategy for the `composite` evaluator, implementing "at least N% of child evaluators must pass" logic.

## Motivation

promptfoo's `assert-set` groups assertions with a threshold — e.g., "at least 50% of these assertions must pass for the test case to pass." AgentEvals' `composite` evaluator currently supports `weighted_average`, `code_judge`, and `llm_judge` aggregation but lacks this mode. This gap was surfaced during the promptfoo integration assessment.

**Research reference**: [integration-assessment-promptfoo-braintrust.md](https://github.com/agentevals/agentevals-research/blob/main/research/proposals/integration-assessment-promptfoo-braintrust.md)

## Proposed EVAL.yaml Syntax

```yaml
evaluators:
  - type: composite
    name: quality_gate
    aggregation: threshold
    threshold: 0.5  # At least 50% of child evaluators must pass
    evaluators:
      - type: field_accuracy
        mode: contains
        value: "refund"
      - type: execution_metrics
        latency_ms: 5000
      - type: llm_judge
        prompt: "Is the response professional and helpful?"
```

## Behavior

- Each child evaluator produces a verdict (pass/borderline/fail)
- `threshold` counts the proportion of passing children: `pass_ratio = passing_count / total_count`
- Composite passes if `pass_ratio >= threshold`
- Composite score is the `pass_ratio` value (0.0 to 1.0)
- Details field lists each child's verdict

## Relation to Existing Features

This complements the existing aggregation strategies:
- `weighted_average` — continuous score blending
- `code_judge` — custom code decides final verdict
- `llm_judge` — LLM decides final verdict
- **`threshold`** (new) — N% of children must pass

See also #235 (assert-set evaluator, closed) — this proposal refines the approach as a composite aggregation strategy rather than a separate evaluator type.

## Acceptance Criteria

- [ ] `composite` evaluator accepts `aggregation: threshold` with `threshold: float`
- [ ] Pass/fail based on proportion of passing children
- [ ] Score equals the pass ratio
- [ ] Details lists each child evaluator's verdict
- [ ] EVAL.yaml schema updated
- [ ] Unit tests for edge cases (0 children, all pass, all fail, exact threshold)

## Effort Estimate

1-2 days

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: threshold aggregation strategy for composite evaluator #274

Summary

Motivation

Proposed EVAL.yaml Syntax

Behavior

Relation to Existing Features

Acceptance Criteria

Effort Estimate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: threshold aggregation strategy for composite evaluator #274

Description

Summary

Motivation

Proposed EVAL.yaml Syntax

Behavior

Relation to Existing Features

Acceptance Criteria

Effort Estimate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions