Use critics defined in SDK directly#95
Conversation
- Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands please fix the failing actions on PR #95 at branch |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev>
* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>
After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Successfully Fixed Failing GitHub Actions on PR #95All failing checks on PR #95 ("Use critics defined in SDK directly") have been resolved! Summary of ChangesBranch: What Was FixedThe PR originally had a failing pre-commit check because the new Changes made to
This aligns the new OpenAgentSafety benchmark with the PR's goal of using SDK critics directly instead of the deprecated adapter pattern. CI Status ✅All checks passing:
PR StatusPR #95 is now ready to merge:
|
|
@OpenHands please fix the failing actions on PR #95 at branch |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev>
All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev>
The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Successfully Fixed All Failing CI Checks on PR #95All GitHub Actions checks are now passing on PR #95 (branch Summary of ChangesI fixed the failing CI checks by addressing two critical issues: 1. Fixed Import Errors (Commit
|
|
50 instance eval gives 40 resolved (80%) - seems promising |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
I finished a full iterative eval with sonnet 4.5, comparable result help (373/500=74.6% convinced this PR is correct) https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0 |
* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>
Summary
This PR removes the
CriticAdapterabstraction layer and uses SDK critics directly throughout the benchmarks codebase.Changes
Removed
CriticAdapterclass (no longer needed)create_critic()function (useCriticRegistry.create_critic()instead)Added
convert_history_to_events(): Convert EvalOutput history to LLMConvertibleEvent listextract_git_patch(): Extract git patch from EvalOutputevaluate_output(): Evaluate EvalOutput with a critic (convenience function)Updated
critics.pyto re-export SDK components directlyiterative.pyto use SDK critics directlyBenefits
Migration
Code using the old adapter is automatically compatible through re-exports:
Before:
After (Option 1 - Direct SDK):
After (Option 2 - Helper Function):
Testing
All tests pass:
Related PR
Companion PR in SDK repo: OpenHands/software-agent-sdk#1171