Skip to content

Conversation

@sator-labs
Copy link
Collaborator

@sator-labs sator-labs commented Dec 2, 2025

Summary

Updated all 439 tests to work with the refactored API that uses data structures instead of file paths. This completes the file caching simplification initiative by eliminating 116 lines of complex caching code.

Changes

Core API Refactoring

  • Old API: Passed file paths (conversation_file_paths, rubric_folder, rubric_file)
  • New API: Pre-loaded data structures (ConversationData objects, RubricConfig objects)

Test Updates

Unit Tests (352 tests) ✅

  • test_llm_judge.py (31 tests) - Updated to use rubric_config_factory fixture
  • test_runner_extra_params.py (6 tests) - Updated to pass ConversationData objects
  • test_judge_extra_params.py (11 tests) - Converted to async and fixed API calls
  • test_question_navigator.py (18 tests) - Created async navigator fixture loading from production rubric
  • All other unit tests (286 tests) - Already compatible

Integration Tests (42 tests) ✅

  • test_evaluation_runner.py - Fixed all 42 tests:
    • Replaced conversation_file_pathsconversations (List[ConversationData])
    • Added rubric_config parameter (RubricConfig object)
    • Updated LLMJudge instantiation to accept rubric_config instead of file paths
    • Fixed _create_evaluation_jobs() helper function tests
    • Updated conversation loading to use ConversationData.load() and load_conversations()

Files Modified

  • judge/rubric_config.py - New module with data structures and loaders
  • judge/llm_judge.py - Updated to accept RubricConfig
  • judge/runner.py - Updated to accept conversations and rubric_config
  • judge/question_navigator.py - Updated to accept parsed data dictionaries
  • judge/utils.py - Minor comment formatting fixes
  • tests/conftest.py - Added rubric_config_factory fixture
  • All test files - Updated to use new API

Files Deleted

  • judge/file_cache.py - Eliminated 116 lines of async caching infrastructure

Benefits

  • Simpler codebase: Removed complex file caching with double-check locking patterns
  • Better architecture: Clear separation between I/O (at startup) and logic (during evaluation)
  • Easier testing: Mock data structures instead of file systems
  • Better performance: All files read once in parallel at startup, zero I/O during evaluation
  • More explicit: API clearly shows what data is needed

sator-labs and others added 2 commits December 1, 2025 16:13
Resolved conflicts in:
- judge/llm_judge.py: Merged async file handling with constants imports
- judge/runner.py: Kept refactored worker-based implementation
- uv.lock: Included both aiofiles and freezegun dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

with open(conversation_path, "r", encoding="utf-8") as f:
return f.read()
async with aiofiles.open(conversation_path, "r", encoding="utf-8") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance you want to cache the conversation files too? They'll be less thrash-y than the rubric and template, but if used across multiple judges and instances, could still be inefficient? (And they shouldn't be super-huge in memory.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I realized that we should read the convos once, and have them in memory

that would require an API change, because the judge take the content of the file instead of the file path, but it makes more sense to me? thoughts?

@emily-vanark
Copy link
Collaborator

The code looks mostly okay to me, but if I check out the branch and to uv run pytest I get 22 failures:

================================================================================== short test summary info ===================================================================================
FAILED tests/integration/test_evaluation_runner.py::TestBatchEvaluateWithIndividualJudges::test_batch_evaluate_multiple_conversations - TypeError: batch_evaluate_with_individual_judges() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestBatchEvaluateWithIndividualJudges::test_batch_evaluate_with_limit - TypeError: batch_evaluate_with_individual_judges() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestBatchEvaluateWithIndividualJudges::test_batch_evaluate_extracts_run_id - TypeError: batch_evaluate_with_individual_judges() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestBatchEvaluateWithIndividualJudges::test_batch_evaluate_handles_evaluation_errors - TypeError: batch_evaluate_with_individual_judges() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_basic_workflow - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_custom_output_folder - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_with_limit - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_folder_not_found - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_no_txt_files - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_verbose_output - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_creates_timestamped_folder - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestJudgeConversations::test_judge_conversations_no_save_aggregated - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestConversationFileLoading::test_load_conversation_with_unicode - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestConversationFileLoading::test_load_conversation_multiline_messages - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestEvaluationResultFormat::test_tsv_output_format_validation - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestEvaluationResultFormat::test_results_csv_contains_all_fields - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestEvaluationResultFormat::test_metadata_extraction_from_filenames - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestErrorHandlingAndEdgeCases::test_empty_conversation_file - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestErrorHandlingAndEdgeCases::test_malformed_conversation_format - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestErrorHandlingAndEdgeCases::test_special_characters_in_folder_path - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestErrorHandlingAndEdgeCases::test_very_long_conversation - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?
FAILED tests/integration/test_evaluation_runner.py::TestErrorHandlingAndEdgeCases::test_concurrent_file_writing - TypeError: judge_conversations() got an unexpected keyword argument 'judge_model'. Did you mean 'judge_models'?

@emily-vanark
Copy link
Collaborator

Also, I'd like to try... running it. But I'm a bit confused on what a judge command would look like to have multiple judges run mutliple instances in this brave new world. If you can give me an example of that command, I'm happy to give it a try and make sure it actually.... works... for me.

sator-labs and others added 2 commits December 11, 2025 17:07
Resolved conflicts in judge.py and judge/runner.py by using the regular
__init__ constructor with judge_model_extra_params instead of the async
create() factory method. This ensures compatibility with both the async
I/O improvements and the new temperature CLI settings feature.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sator-labs
Copy link
Collaborator Author

Also, I'd like to try... running it. But I'm a bit confused on what a judge command would look like to have multiple judges run mutliple instances in this brave new world. If you can give me an example of that command, I'm happy to give it a try and make sure it actually.... works... for me.

sorry this was due to the fact that there were too many PRs behind it. should now work, and the tests errors have been fixed

@sator-labs
Copy link
Collaborator Author

For Internal Code
If you have any scripts or notebooks using the old API:
Before:
from judge import judge_conversations
from judge.llm_judge import LLMJudge

Old way - passing paths

judge = LLMJudge(
judge_model="claude-3-7-sonnet",
rubric_folder="data",
)
await judge.evaluate_conversation_question_flow(
conversation_file="path/to/conversation.txt",
output_folder="output/",
)
After:
from judge import judge_conversations
from judge.llm_judge import LLMJudge
from judge.rubric_config import RubricConfig, ConversationData

New way - pre-load data

rubric_config = await RubricConfig.load(rubric_folder="data")
conversation = await ConversationData.load("path/to/conversation.txt")

judge = LLMJudge(
judge_model="claude-3-7-sonnet",
rubric_config=rubric_config,
)
await judge.evaluate_conversation_question_flow(
conversation=conversation,
output_folder="output/",
)

Updated all 439 tests to work with the refactored API that uses data structures
instead of file paths. This completes the file caching simplification from
the previous commits.

Changes:
- Updated 31 tests in test_llm_judge.py to use rubric_config_factory fixture
- Updated 6 tests in test_runner_extra_params.py to pass ConversationData objects
- Updated 11 tests in test_judge_extra_params.py - converted to async and fixed API calls
- Updated 18 tests in test_question_navigator.py - created async navigator fixture
- Fixed 42 integration tests in test_evaluation_runner.py to use new API:
  * Replaced conversation_file_paths with conversations (List[ConversationData])
  * Added rubric_config parameter (RubricConfig object)
  * Updated LLMJudge instantiation to accept rubric_config instead of file paths
  * Fixed _create_evaluation_jobs helper function tests
  * Updated conversation loading to use ConversationData.load() and load_conversations()

All 439 tests now passing with the new data-structure-based API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sator-labs sator-labs requested a review from Copilot December 12, 2025 21:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the test suite to work with the new API that uses pre-loaded data structures (ConversationData and RubricConfig objects) instead of file paths. The changes eliminate 116 lines of async file caching code while maintaining all 439 tests and improving the overall architecture by separating I/O operations from evaluation logic.

Key Changes:

  • Replaced file path parameters with pre-loaded data structures throughout the codebase
  • Updated all 439 tests to use rubric_config_factory fixture and ConversationData objects
  • Added new rubric_config.py module with RubricConfig, ConversationData, and load_conversations
  • Removed judge/file_cache.py (116 lines of caching infrastructure)

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/judge/test_runner_extra_params.py Updated 6 tests to create ConversationData objects and use rubric_config_factory
tests/unit/judge/test_llm_judge.py Converted 31 tests to async and updated to use rubric_config_factory
tests/unit/judge/test_judge_extra_params.py Converted 11 tests to async and updated API calls
tests/test_question_navigator.py Updated 18 tests to use async navigator fixture loading from production rubric
tests/integration/test_evaluation_runner.py Fixed all 42 integration tests to use new data structure API
tests/conftest.py Added rubric_config_factory async fixture
test/code/test_async_cache.py New test file referencing deleted file_cache module
pyproject.toml Added aiofiles>=25.1.0 dependency
judge/utils.py Minor comment formatting improvements
judge/runner.py Updated to accept ConversationData and RubricConfig parameters
judge/rubric_config.py New module with data structures and async loaders
judge/question_navigator.py Updated to accept parsed data dictionaries
judge/llm_judge.py Updated to accept RubricConfig instead of file paths
judge.py Updated CLI to load data structures at startup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1 to 104
#!/usr/bin/env python3
"""Test async file I/O and caching implementation."""

import asyncio

from judge.file_cache import get_cache_stats
from judge.llm_judge import LLMJudge
from judge.question_navigator import QuestionNavigator
from judge.runner import _create_evaluation_jobs


async def test_cache_module():
"""Test that cache module initializes correctly."""
stats = get_cache_stats()
assert stats["text_files_cached"] == 0
assert stats["dataframes_cached"] == 0
print("✅ Cache module initializes correctly")


async def test_question_navigator_async_factory():
"""Test QuestionNavigator.create() async factory method."""
nav = await QuestionNavigator.create("data/rubric.tsv")
assert nav is not None
assert len(nav.question_order) > 0

# Check that rubric was cached
stats = get_cache_stats()
assert stats["dataframes_cached"] == 1
print(
f"✅ QuestionNavigator.create() works (loaded {len(nav.question_order)} questions)"
)


async def test_llm_judge_async_factory():
"""Test LLMJudge.create() async factory method."""
judge = await LLMJudge.create(judge_model="claude-3-5-sonnet-20241022")
assert judge is not None

# Check that files were cached
stats = get_cache_stats()
assert stats["text_files_cached"] >= 1 # At least the question prompt template
assert stats["dataframes_cached"] >= 1 # The rubric
print(f"✅ LLMJudge.create() works (cache: {stats})")


def test_job_creation():
"""Test that job creation works with multiple judges."""
conversation_files = [
"conversations/test/conv1.txt",
"conversations/test/conv2.txt",
]
judge_models = {"claude-3-5-sonnet-20241022": 3, "gpt-4o": 2}
output_folder = "test_output"

jobs = _create_evaluation_jobs(conversation_files, judge_models, output_folder)

expected_jobs = len(conversation_files) * sum(judge_models.values())
assert len(jobs) == expected_jobs, f"Expected {expected_jobs} jobs, got {len(jobs)}"

# Verify job structure
for job in jobs:
assert (
len(job) == 4
) # (conversation_file, judge_model, instance, output_folder)
assert job[1] in judge_models
assert job[2] >= 1 and job[2] <= judge_models[job[1]]
assert job[3] == output_folder

print(f"✅ Job creation works ({len(jobs)} jobs created)")

# Count jobs per model
from collections import Counter

job_counts = Counter(job[1] for job in jobs)
expected_counts = {
model: count * len(conversation_files) for model, count in judge_models.items()
}
assert dict(job_counts) == expected_counts
print(f"✅ Jobs distributed correctly: {dict(job_counts)}")


async def main():
"""Run all tests."""
print("=" * 60)
print("Testing Async File I/O and Caching Implementation")
print("=" * 60)

# Test cache module
await test_cache_module()

# Test async factories
await test_question_navigator_async_factory()
await test_llm_judge_async_factory()

# Test job creation (synchronous)
test_job_creation()

print("=" * 60)
print("✅ All tests passed!")
print("=" * 60)


if __name__ == "__main__":
asyncio.run(main())
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import references judge.file_cache, which was deleted as part of this PR. The test file appears to be testing the old caching implementation that no longer exists. This file should either be deleted or completely rewritten to test the new data loading approach.

Suggested change
#!/usr/bin/env python3
"""Test async file I/O and caching implementation."""
import asyncio
from judge.file_cache import get_cache_stats
from judge.llm_judge import LLMJudge
from judge.question_navigator import QuestionNavigator
from judge.runner import _create_evaluation_jobs
async def test_cache_module():
"""Test that cache module initializes correctly."""
stats = get_cache_stats()
assert stats["text_files_cached"] == 0
assert stats["dataframes_cached"] == 0
print("✅ Cache module initializes correctly")
async def test_question_navigator_async_factory():
"""Test QuestionNavigator.create() async factory method."""
nav = await QuestionNavigator.create("data/rubric.tsv")
assert nav is not None
assert len(nav.question_order) > 0
# Check that rubric was cached
stats = get_cache_stats()
assert stats["dataframes_cached"] == 1
print(
f"✅ QuestionNavigator.create() works (loaded {len(nav.question_order)} questions)"
)
async def test_llm_judge_async_factory():
"""Test LLMJudge.create() async factory method."""
judge = await LLMJudge.create(judge_model="claude-3-5-sonnet-20241022")
assert judge is not None
# Check that files were cached
stats = get_cache_stats()
assert stats["text_files_cached"] >= 1 # At least the question prompt template
assert stats["dataframes_cached"] >= 1 # The rubric
print(f"✅ LLMJudge.create() works (cache: {stats})")
def test_job_creation():
"""Test that job creation works with multiple judges."""
conversation_files = [
"conversations/test/conv1.txt",
"conversations/test/conv2.txt",
]
judge_models = {"claude-3-5-sonnet-20241022": 3, "gpt-4o": 2}
output_folder = "test_output"
jobs = _create_evaluation_jobs(conversation_files, judge_models, output_folder)
expected_jobs = len(conversation_files) * sum(judge_models.values())
assert len(jobs) == expected_jobs, f"Expected {expected_jobs} jobs, got {len(jobs)}"
# Verify job structure
for job in jobs:
assert (
len(job) == 4
) # (conversation_file, judge_model, instance, output_folder)
assert job[1] in judge_models
assert job[2] >= 1 and job[2] <= judge_models[job[1]]
assert job[3] == output_folder
print(f"✅ Job creation works ({len(jobs)} jobs created)")
# Count jobs per model
from collections import Counter
job_counts = Counter(job[1] for job in jobs)
expected_counts = {
model: count * len(conversation_files) for model, count in judge_models.items()
}
assert dict(job_counts) == expected_counts
print(f"✅ Jobs distributed correctly: {dict(job_counts)}")
async def main():
"""Run all tests."""
print("=" * 60)
print("Testing Async File I/O and Caching Implementation")
print("=" * 60)
# Test cache module
await test_cache_module()
# Test async factories
await test_question_navigator_async_factory()
await test_llm_judge_async_factory()
# Test job creation (synchronous)
test_job_creation()
print("=" * 60)
print("✅ All tests passed!")
print("=" * 60)
if __name__ == "__main__":
asyncio.run(main())

Copilot uses AI. Check for mistakes.
sator-labs and others added 7 commits December 12, 2025 13:54
Deleted test/code/test_async_cache.py which tested the old file caching
architecture that was removed during the refactoring to use RubricConfig
and data structures instead of file paths.

The file tested:
- judge.file_cache.get_cache_stats() (module deleted)
- QuestionNavigator.create() async factory (removed)
- LLMJudge.create() async factory (removed)
- File caching behavior (no longer exists)

These features were intentionally removed as part of the refactoring to
pre-load all data at startup and pass data structures through the API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add --verbose-workers CLI flag to enable detailed logging of worker
lifecycle and concurrency behavior during judge evaluation.

Changes:
- Add --verbose-workers flag to judge.py CLI
- Thread verbose_workers parameter through runner functions
- Add worker lifecycle logging: start, processing, completion, finish
- Add worker pool summary showing total workers and concurrency mode
- Fix bug: use conversation.metadata['filename'] instead of non-existent file_path

This allows verification that --per-judge flag correctly spawns workers
per judge model (e.g., 3 judges × 5 workers = 15 concurrent workers).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
)

# Load all files in parallel
rubric_df_task = asyncio.to_thread(pd.read_csv, str(rubric_path), sep=sep)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we want to do this async / in parallel - there's only one file in rubric_path, yes? And it's not a super large one? And we're only reading it once?

return dimensions

@staticmethod
def _parse_rubric(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review this super closely, because I assumed it was the same code pulled from the former location into this one-time loader. If that's wrong, let me know and I'll review more closely.

Copy link
Collaborator

@emily-vanark emily-vanark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sator-labs sator-labs marked this pull request as ready for review January 6, 2026 11:41
@sator-labs sator-labs merged commit f7439cc into main Jan 6, 2026
@sator-labs sator-labs deleted the async_io branch January 6, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants