Improve LLM-as-judge calibration and reliability

## Goal
Current LLM judging is inconsistent. Improve reliability and reduce bias.

## Known Issues
- Judge sometimes fails to parse responses
- Verbosity bias: longer responses score higher regardless of quality
- Position bias: first option in comparisons scores higher
- Self-preference: models prefer their own writing style
- Inconsistent scores across runs

## Improvements

### Calibration
1. Maintain gold-standard examples with known scores
2. Run calibration check before each eval batch
3. Detect and flag judge drift

### Bias Mitigation
1. **Position invariance**: Randomize presentation order
2. **Length normalization**: Don't penalize concise answers
3. **Multi-run agreement**: Require >75% agreement or escalate

### Structured Evaluation
1. **Multi-aspect rubrics**: Separate rubrics for different quality dimensions
2. **Criterion-by-criterion**: Judge each criterion independently, not holistically
3. **Evidence requirements**: Judge must cite specific evidence for scores

## Implementation
1. Add calibration set to benchmark infrastructure
2. Implement position randomization in judge prompts
3. Add agreement threshold checking
4. Structure judge output to require evidence

## Success Criteria
- Judge scores should be reproducible (>90% agreement across runs)
- Calibration drift should be detectable
- Bias patterns should be measurably reduced

## References
- Research on LLM-as-judge biases
- GPQA judge calibration approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve LLM-as-judge calibration and reliability #337

Goal

Known Issues

Improvements

Calibration

Bias Mitigation

Structured Evaluation

Implementation

Success Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve LLM-as-judge calibration and reliability #337

Description

Goal

Known Issues

Improvements

Calibration

Bias Mitigation

Structured Evaluation

Implementation

Success Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions