Goal
Current LLM judging is inconsistent. Improve reliability and reduce bias.
Known Issues
- Judge sometimes fails to parse responses
- Verbosity bias: longer responses score higher regardless of quality
- Position bias: first option in comparisons scores higher
- Self-preference: models prefer their own writing style
- Inconsistent scores across runs
Improvements
Calibration
- Maintain gold-standard examples with known scores
- Run calibration check before each eval batch
- Detect and flag judge drift
Bias Mitigation
- Position invariance: Randomize presentation order
- Length normalization: Don't penalize concise answers
- Multi-run agreement: Require >75% agreement or escalate
Structured Evaluation
- Multi-aspect rubrics: Separate rubrics for different quality dimensions
- Criterion-by-criterion: Judge each criterion independently, not holistically
- Evidence requirements: Judge must cite specific evidence for scores
Implementation
- Add calibration set to benchmark infrastructure
- Implement position randomization in judge prompts
- Add agreement threshold checking
- Structure judge output to require evidence
Success Criteria
- Judge scores should be reproducible (>90% agreement across runs)
- Calibration drift should be detectable
- Bias patterns should be measurably reduced
References
- Research on LLM-as-judge biases
- GPQA judge calibration approach
Goal
Current LLM judging is inconsistent. Improve reliability and reduce bias.
Known Issues
Improvements
Calibration
Bias Mitigation
Structured Evaluation
Implementation
Success Criteria
References