Skip to content

Improve LLM-as-judge calibration and reliability #337

@ScuttleBot

Description

@ScuttleBot

Goal

Current LLM judging is inconsistent. Improve reliability and reduce bias.

Known Issues

  • Judge sometimes fails to parse responses
  • Verbosity bias: longer responses score higher regardless of quality
  • Position bias: first option in comparisons scores higher
  • Self-preference: models prefer their own writing style
  • Inconsistent scores across runs

Improvements

Calibration

  1. Maintain gold-standard examples with known scores
  2. Run calibration check before each eval batch
  3. Detect and flag judge drift

Bias Mitigation

  1. Position invariance: Randomize presentation order
  2. Length normalization: Don't penalize concise answers
  3. Multi-run agreement: Require >75% agreement or escalate

Structured Evaluation

  1. Multi-aspect rubrics: Separate rubrics for different quality dimensions
  2. Criterion-by-criterion: Judge each criterion independently, not holistically
  3. Evidence requirements: Judge must cite specific evidence for scores

Implementation

  1. Add calibration set to benchmark infrastructure
  2. Implement position randomization in judge prompts
  3. Add agreement threshold checking
  4. Structure judge output to require evidence

Success Criteria

  • Judge scores should be reproducible (>90% agreement across runs)
  • Calibration drift should be detectable
  • Bias patterns should be measurably reduced

References

  • Research on LLM-as-judge biases
  • GPQA judge calibration approach

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions