-
Notifications
You must be signed in to change notification settings - Fork 6
Adding structured output #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a structured output system for LLM-based judge evaluation, replacing fragile string parsing with type-safe Pydantic models. The changes enable reliable extraction of answers and reasoning from LLMs when evaluating conversations against clinical rubrics.
Key Changes:
- Added structured output support using Pydantic models across all LLM providers (Claude, OpenAI, Gemini, Llama)
- Implemented
QuestionNavigatorclass to handle rubric question flow logic separately from judge evaluation - Refactored judge evaluation to use structured responses instead of string parsing
- Added comprehensive test suite for question navigation functionality
Reviewed changes
Copilot reviewed 22 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
llm_clients/llm_interface.py |
Added abstract method for structured response generation |
llm_clients/claude_llm.py, openai_llm.py, gemini_llm.py |
Implemented structured output using LangChain's with_structured_output() |
llm_clients/llama_llm.py |
Added limited structured output support with NotImplementedError fallback |
judge/response_models.py |
New file defining Pydantic models for structured LLM responses |
judge/question_navigator.py |
New file extracting rubric parsing and navigation logic from judge |
judge/llm_judge.py |
Refactored to use structured output and question navigator |
tests/test_question_navigator.py |
New comprehensive test suite for question navigation |
pyproject.toml |
Added pytest dependency |
docs/structured-output.md, docs/evaluating.MD |
New and updated documentation |
data/question_prompt.txt |
New simplified question prompt template |
Comments suppressed due to low confidence (2)
judge/llm_judge.py:1
- Method name changed from
evaluate_conversationtoevaluate_conversation_question_flow, but the old method name is still referenced in the runner. This suggests the method might have been renamed but not all call sites were updated.
"""LLM Judge for evaluating conversations based on rubrics."""
judge/score.py:1
- In the removed code, the line was
pd.DataFrame(results, columns=[\"filename\"] + DIMENSIONS)but now it'spd.DataFrame(results, columns=columns)wherecolumns = [\"filename\", \"run_id\"] + list(results[0].keys()). This could cause issues ifresultsis empty, asresults[0]would raise an IndexError.
#!/usr/bin/env python3
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Split LLMInterface into base and judge-capable interfaces following the Interface Segregation Principle. This provides type safety and clearer contracts for structured output capabilities. Changes: - Add JudgeLLM interface extending LLMInterface with structured output - Update Claude/OpenAI/Gemini to implement JudgeLLM - Simplify LlamaLLM by removing unsupported structured output method - Add factory methods: create_judge_llm() and supports_structured_output() - Add runtime validation in LLMJudge._create_evaluator() - Export JudgeLLM from llm_clients package Benefits: - Type safety: Compile-time guarantees about structured output support - Clear contracts: Interfaces explicitly declare capabilities - Better errors: Early detection of unsupported operations with helpful messages - Cleaner code: LlamaLLM no longer has unused methods raising NotImplementedError - Backwards compatible: All existing code continues to work 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolved merge conflicts in 4 files: - generate_conversations/conversation_simulator.py: Added import time - judge/llm_judge.py: Integrated structured output with JudgeLLM validation - judge/question_navigator.py: Improved comment documentation - judge/score.py: Applied PEP 8 line-wrapping improvements Key changes from main branch: - Comprehensive test suite with fixtures and mocks - CI/CD workflow improvements - Claude Code commands and configuration - Human validation conversation samples - Pre-commit hooks and code quality tools Structured output branch features preserved: - JudgeLLM interface for structured output validation - QuestionResponse model for type-safe responses - Reasoning length parameter support - Constants for rating categories (BEST_PRACTICE, NEUTRAL, DAMAGING) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed 21 failing tests by addressing two issues: 1. MockLLM now implements JudgeLLM interface - Changed inheritance from LLMInterface to JudgeLLM - Implemented generate_structured_response method - Supports JSON parsing, Pydantic example data, and type-based defaults - Fixes 19 integration tests that required structured output support 2. Reasoning truncation now applied consistently - Added reasoning_length=100 parameter to _add_severity_reason calls - Ensures reasoning text is truncated to 100 chars in dimension scoring - Fixes 2 unit tests for high/medium risk reasoning truncation All 386 tests now pass with 73% code coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
judge/llm_judge.py
Outdated
| answer_data, | ||
| high_risk_reasons, | ||
| medium_risk_reasons, | ||
| reasoning_length=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why force the reasoning length to 100 here? I think I'd have expected it to default to None and take the whole string.... (also line 772?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh thanks, I thought that was only for the mock test. let me revert to none
| ) | ||
|
|
||
|
|
||
| class JudgeLLM(LLMInterface): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this separation.
83b8bec to
12b5105
Compare
…mentation Updated test expectations to verify that reasoning text is preserved in full rather than being truncated. This aligns with the current implementation where reasoning_length=None allows complete reasoning to be included in evaluations. Changes: - test_reasoning_truncation_in_high_risk: Now expects 200 chars (full) - test_reasoning_truncation_in_medium_risk: Now expects 200 chars (full) - Updated _log_final_results default reasoning_length to None - Updated _add_severity_reason calls to use reasoning_length=None All 386 tests pass with 73% coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This is meant to be merged after #29
[Claude generated below]
Implementation Details
Prompt Changes
The question prompt (data/question_prompt.txt) was simplified:
Before:
After:
The explicit format instructions are removed because the LLM understands the structure through the Pydantic model schema.
Removed Code
The following parsing methods were removed from
llm_judge.py:_extract_answer()- No longer needed, answer comes fromresponse.answer_extract_reasoning()- No longer needed, reasoning comes fromresponse.reasoningAnswer Matching
A new helper method was added to validate answers:
This ensures that even with structured output, we validate the LLM chose a valid option from the rubric.