-
Notifications
You must be signed in to change notification settings - Fork 6
Async io #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async io #45
Conversation
Resolved conflicts in: - judge/llm_judge.py: Merged async file handling with constants imports - judge/runner.py: Kept refactored worker-based implementation - uv.lock: Included both aiofiles and freezegun dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
judge/llm_judge.py
Outdated
|
|
||
| with open(conversation_path, "r", encoding="utf-8") as f: | ||
| return f.read() | ||
| async with aiofiles.open(conversation_path, "r", encoding="utf-8") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance you want to cache the conversation files too? They'll be less thrash-y than the rubric and template, but if used across multiple judges and instances, could still be inefficient? (And they shouldn't be super-huge in memory.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I realized that we should read the convos once, and have them in memory
that would require an API change, because the judge take the content of the file instead of the file path, but it makes more sense to me? thoughts?
|
The code looks mostly okay to me, but if I check out the branch and to |
|
Also, I'd like to try... running it. But I'm a bit confused on what a judge command would look like to have multiple judges run mutliple instances in this brave new world. If you can give me an example of that command, I'm happy to give it a try and make sure it actually.... works... for me. |
Resolved conflicts in judge.py and judge/runner.py by using the regular __init__ constructor with judge_model_extra_params instead of the async create() factory method. This ensures compatibility with both the async I/O improvements and the new temperature CLI settings feature. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sorry this was due to the fact that there were too many PRs behind it. should now work, and the tests errors have been fixed |
|
For Internal Code Old way - passing pathsjudge = LLMJudge( New way - pre-load datarubric_config = await RubricConfig.load(rubric_folder="data") judge = LLMJudge( |
Updated all 439 tests to work with the refactored API that uses data structures instead of file paths. This completes the file caching simplification from the previous commits. Changes: - Updated 31 tests in test_llm_judge.py to use rubric_config_factory fixture - Updated 6 tests in test_runner_extra_params.py to pass ConversationData objects - Updated 11 tests in test_judge_extra_params.py - converted to async and fixed API calls - Updated 18 tests in test_question_navigator.py - created async navigator fixture - Fixed 42 integration tests in test_evaluation_runner.py to use new API: * Replaced conversation_file_paths with conversations (List[ConversationData]) * Added rubric_config parameter (RubricConfig object) * Updated LLMJudge instantiation to accept rubric_config instead of file paths * Fixed _create_evaluation_jobs helper function tests * Updated conversation loading to use ConversationData.load() and load_conversations() All 439 tests now passing with the new data-structure-based API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the test suite to work with the new API that uses pre-loaded data structures (ConversationData and RubricConfig objects) instead of file paths. The changes eliminate 116 lines of async file caching code while maintaining all 439 tests and improving the overall architecture by separating I/O operations from evaluation logic.
Key Changes:
- Replaced file path parameters with pre-loaded data structures throughout the codebase
- Updated all 439 tests to use
rubric_config_factoryfixture andConversationDataobjects - Added new
rubric_config.pymodule withRubricConfig,ConversationData, andload_conversations - Removed
judge/file_cache.py(116 lines of caching infrastructure)
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tests/unit/judge/test_runner_extra_params.py |
Updated 6 tests to create ConversationData objects and use rubric_config_factory |
tests/unit/judge/test_llm_judge.py |
Converted 31 tests to async and updated to use rubric_config_factory |
tests/unit/judge/test_judge_extra_params.py |
Converted 11 tests to async and updated API calls |
tests/test_question_navigator.py |
Updated 18 tests to use async navigator fixture loading from production rubric |
tests/integration/test_evaluation_runner.py |
Fixed all 42 integration tests to use new data structure API |
tests/conftest.py |
Added rubric_config_factory async fixture |
test/code/test_async_cache.py |
New test file referencing deleted file_cache module |
pyproject.toml |
Added aiofiles>=25.1.0 dependency |
judge/utils.py |
Minor comment formatting improvements |
judge/runner.py |
Updated to accept ConversationData and RubricConfig parameters |
judge/rubric_config.py |
New module with data structures and async loaders |
judge/question_navigator.py |
Updated to accept parsed data dictionaries |
judge/llm_judge.py |
Updated to accept RubricConfig instead of file paths |
judge.py |
Updated CLI to load data structures at startup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/code/test_async_cache.py
Outdated
| #!/usr/bin/env python3 | ||
| """Test async file I/O and caching implementation.""" | ||
|
|
||
| import asyncio | ||
|
|
||
| from judge.file_cache import get_cache_stats | ||
| from judge.llm_judge import LLMJudge | ||
| from judge.question_navigator import QuestionNavigator | ||
| from judge.runner import _create_evaluation_jobs | ||
|
|
||
|
|
||
| async def test_cache_module(): | ||
| """Test that cache module initializes correctly.""" | ||
| stats = get_cache_stats() | ||
| assert stats["text_files_cached"] == 0 | ||
| assert stats["dataframes_cached"] == 0 | ||
| print("✅ Cache module initializes correctly") | ||
|
|
||
|
|
||
| async def test_question_navigator_async_factory(): | ||
| """Test QuestionNavigator.create() async factory method.""" | ||
| nav = await QuestionNavigator.create("data/rubric.tsv") | ||
| assert nav is not None | ||
| assert len(nav.question_order) > 0 | ||
|
|
||
| # Check that rubric was cached | ||
| stats = get_cache_stats() | ||
| assert stats["dataframes_cached"] == 1 | ||
| print( | ||
| f"✅ QuestionNavigator.create() works (loaded {len(nav.question_order)} questions)" | ||
| ) | ||
|
|
||
|
|
||
| async def test_llm_judge_async_factory(): | ||
| """Test LLMJudge.create() async factory method.""" | ||
| judge = await LLMJudge.create(judge_model="claude-3-5-sonnet-20241022") | ||
| assert judge is not None | ||
|
|
||
| # Check that files were cached | ||
| stats = get_cache_stats() | ||
| assert stats["text_files_cached"] >= 1 # At least the question prompt template | ||
| assert stats["dataframes_cached"] >= 1 # The rubric | ||
| print(f"✅ LLMJudge.create() works (cache: {stats})") | ||
|
|
||
|
|
||
| def test_job_creation(): | ||
| """Test that job creation works with multiple judges.""" | ||
| conversation_files = [ | ||
| "conversations/test/conv1.txt", | ||
| "conversations/test/conv2.txt", | ||
| ] | ||
| judge_models = {"claude-3-5-sonnet-20241022": 3, "gpt-4o": 2} | ||
| output_folder = "test_output" | ||
|
|
||
| jobs = _create_evaluation_jobs(conversation_files, judge_models, output_folder) | ||
|
|
||
| expected_jobs = len(conversation_files) * sum(judge_models.values()) | ||
| assert len(jobs) == expected_jobs, f"Expected {expected_jobs} jobs, got {len(jobs)}" | ||
|
|
||
| # Verify job structure | ||
| for job in jobs: | ||
| assert ( | ||
| len(job) == 4 | ||
| ) # (conversation_file, judge_model, instance, output_folder) | ||
| assert job[1] in judge_models | ||
| assert job[2] >= 1 and job[2] <= judge_models[job[1]] | ||
| assert job[3] == output_folder | ||
|
|
||
| print(f"✅ Job creation works ({len(jobs)} jobs created)") | ||
|
|
||
| # Count jobs per model | ||
| from collections import Counter | ||
|
|
||
| job_counts = Counter(job[1] for job in jobs) | ||
| expected_counts = { | ||
| model: count * len(conversation_files) for model, count in judge_models.items() | ||
| } | ||
| assert dict(job_counts) == expected_counts | ||
| print(f"✅ Jobs distributed correctly: {dict(job_counts)}") | ||
|
|
||
|
|
||
| async def main(): | ||
| """Run all tests.""" | ||
| print("=" * 60) | ||
| print("Testing Async File I/O and Caching Implementation") | ||
| print("=" * 60) | ||
|
|
||
| # Test cache module | ||
| await test_cache_module() | ||
|
|
||
| # Test async factories | ||
| await test_question_navigator_async_factory() | ||
| await test_llm_judge_async_factory() | ||
|
|
||
| # Test job creation (synchronous) | ||
| test_job_creation() | ||
|
|
||
| print("=" * 60) | ||
| print("✅ All tests passed!") | ||
| print("=" * 60) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| asyncio.run(main()) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import references judge.file_cache, which was deleted as part of this PR. The test file appears to be testing the old caching implementation that no longer exists. This file should either be deleted or completely rewritten to test the new data loading approach.
| #!/usr/bin/env python3 | |
| """Test async file I/O and caching implementation.""" | |
| import asyncio | |
| from judge.file_cache import get_cache_stats | |
| from judge.llm_judge import LLMJudge | |
| from judge.question_navigator import QuestionNavigator | |
| from judge.runner import _create_evaluation_jobs | |
| async def test_cache_module(): | |
| """Test that cache module initializes correctly.""" | |
| stats = get_cache_stats() | |
| assert stats["text_files_cached"] == 0 | |
| assert stats["dataframes_cached"] == 0 | |
| print("✅ Cache module initializes correctly") | |
| async def test_question_navigator_async_factory(): | |
| """Test QuestionNavigator.create() async factory method.""" | |
| nav = await QuestionNavigator.create("data/rubric.tsv") | |
| assert nav is not None | |
| assert len(nav.question_order) > 0 | |
| # Check that rubric was cached | |
| stats = get_cache_stats() | |
| assert stats["dataframes_cached"] == 1 | |
| print( | |
| f"✅ QuestionNavigator.create() works (loaded {len(nav.question_order)} questions)" | |
| ) | |
| async def test_llm_judge_async_factory(): | |
| """Test LLMJudge.create() async factory method.""" | |
| judge = await LLMJudge.create(judge_model="claude-3-5-sonnet-20241022") | |
| assert judge is not None | |
| # Check that files were cached | |
| stats = get_cache_stats() | |
| assert stats["text_files_cached"] >= 1 # At least the question prompt template | |
| assert stats["dataframes_cached"] >= 1 # The rubric | |
| print(f"✅ LLMJudge.create() works (cache: {stats})") | |
| def test_job_creation(): | |
| """Test that job creation works with multiple judges.""" | |
| conversation_files = [ | |
| "conversations/test/conv1.txt", | |
| "conversations/test/conv2.txt", | |
| ] | |
| judge_models = {"claude-3-5-sonnet-20241022": 3, "gpt-4o": 2} | |
| output_folder = "test_output" | |
| jobs = _create_evaluation_jobs(conversation_files, judge_models, output_folder) | |
| expected_jobs = len(conversation_files) * sum(judge_models.values()) | |
| assert len(jobs) == expected_jobs, f"Expected {expected_jobs} jobs, got {len(jobs)}" | |
| # Verify job structure | |
| for job in jobs: | |
| assert ( | |
| len(job) == 4 | |
| ) # (conversation_file, judge_model, instance, output_folder) | |
| assert job[1] in judge_models | |
| assert job[2] >= 1 and job[2] <= judge_models[job[1]] | |
| assert job[3] == output_folder | |
| print(f"✅ Job creation works ({len(jobs)} jobs created)") | |
| # Count jobs per model | |
| from collections import Counter | |
| job_counts = Counter(job[1] for job in jobs) | |
| expected_counts = { | |
| model: count * len(conversation_files) for model, count in judge_models.items() | |
| } | |
| assert dict(job_counts) == expected_counts | |
| print(f"✅ Jobs distributed correctly: {dict(job_counts)}") | |
| async def main(): | |
| """Run all tests.""" | |
| print("=" * 60) | |
| print("Testing Async File I/O and Caching Implementation") | |
| print("=" * 60) | |
| # Test cache module | |
| await test_cache_module() | |
| # Test async factories | |
| await test_question_navigator_async_factory() | |
| await test_llm_judge_async_factory() | |
| # Test job creation (synchronous) | |
| test_job_creation() | |
| print("=" * 60) | |
| print("✅ All tests passed!") | |
| print("=" * 60) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) |
Deleted test/code/test_async_cache.py which tested the old file caching architecture that was removed during the refactoring to use RubricConfig and data structures instead of file paths. The file tested: - judge.file_cache.get_cache_stats() (module deleted) - QuestionNavigator.create() async factory (removed) - LLMJudge.create() async factory (removed) - File caching behavior (no longer exists) These features were intentionally removed as part of the refactoring to pre-load all data at startup and pass data structures through the API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add --verbose-workers CLI flag to enable detailed logging of worker lifecycle and concurrency behavior during judge evaluation. Changes: - Add --verbose-workers flag to judge.py CLI - Thread verbose_workers parameter through runner functions - Add worker lifecycle logging: start, processing, completion, finish - Add worker pool summary showing total workers and concurrency mode - Fix bug: use conversation.metadata['filename'] instead of non-existent file_path This allows verification that --per-judge flag correctly spawns workers per judge model (e.g., 3 judges × 5 workers = 15 concurrent workers). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| ) | ||
|
|
||
| # Load all files in parallel | ||
| rubric_df_task = asyncio.to_thread(pd.read_csv, str(rubric_path), sep=sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we want to do this async / in parallel - there's only one file in rubric_path, yes? And it's not a super large one? And we're only reading it once?
| return dimensions | ||
|
|
||
| @staticmethod | ||
| def _parse_rubric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review this super closely, because I assumed it was the same code pulled from the former location into this one-time loader. If that's wrong, let me know and I'll review more closely.
emily-vanark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary
Updated all 439 tests to work with the refactored API that uses data structures instead of file paths. This completes the file caching simplification initiative by eliminating 116 lines of complex caching code.
Changes
Core API Refactoring
conversation_file_paths,rubric_folder,rubric_file)ConversationDataobjects,RubricConfigobjects)Test Updates
Unit Tests (352 tests) ✅
rubric_config_factoryfixtureConversationDataobjectsIntegration Tests (42 tests) ✅
conversation_file_paths→conversations(List[ConversationData])rubric_configparameter (RubricConfig object)LLMJudgeinstantiation to acceptrubric_configinstead of file paths_create_evaluation_jobs()helper function testsConversationData.load()andload_conversations()Files Modified
judge/rubric_config.py- New module with data structures and loadersjudge/llm_judge.py- Updated to accept RubricConfigjudge/runner.py- Updated to accept conversations and rubric_configjudge/question_navigator.py- Updated to accept parsed data dictionariesjudge/utils.py- Minor comment formatting fixestests/conftest.py- Addedrubric_config_factoryfixtureFiles Deleted
judge/file_cache.py- Eliminated 116 lines of async caching infrastructureBenefits