Use critics defined in SDK directly by xingyaoww · Pull Request #95 · OpenHands/benchmarks

xingyaoww · 2025-11-14T23:04:29Z

Summary

This PR removes the CriticAdapter abstraction layer and uses SDK critics directly throughout the benchmarks codebase.

Changes

Removed

❌ CriticAdapter class (no longer needed)
❌ create_critic() function (use CriticRegistry.create_critic() instead)

Added

✅ Re-exports of SDK critic components for convenience
✅ Helper functions for common operations:
- convert_history_to_events(): Convert EvalOutput history to LLMConvertibleEvent list
- extract_git_patch(): Extract git patch from EvalOutput
- evaluate_output(): Evaluate EvalOutput with a critic (convenience function)

Updated

Updated critics.py to re-export SDK components directly
Updated iterative.py to use SDK critics directly
Updated SDK submodule reference to include organized critic implementations

Benefits

Simpler Architecture: No adapter layer, direct SDK access
Better Type Safety: Direct type hints from SDK
Cleaner API: Direct access to SDK critic methods
Easier to Maintain: Less code, fewer layers
Consistent Interface: Same interface everywhere

Migration

Code using the old adapter is automatically compatible through re-exports:

Before:

from benchmarks.utils.critics import create_critic

critic = create_critic('empty_patch_critic')
success = critic.evaluate_instance(history, git_patch)

After (Option 1 - Direct SDK):

from benchmarks.utils.critics import CriticRegistry, convert_history_to_events

critic = CriticRegistry.create_critic('empty_patch_critic')
events = convert_history_to_events(history)
result = critic.evaluate(events, git_patch)
success = result.success

After (Option 2 - Helper Function):

from benchmarks.utils.critics import CriticRegistry, evaluate_output

critic = CriticRegistry.create_critic('empty_patch_critic')
success = evaluate_output(critic, eval_output)

Testing

All tests pass:

✅ SDK critic imports successful
✅ Benchmark module imports working
✅ Helper functions work correctly
✅ Downstream modules (evaluation.py, iterative.py) work
✅ End-to-end integration verified

Related PR

Companion PR in SDK repo: OpenHands/software-agent-sdk#1171

- Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww · 2025-11-17T18:37:25Z

@OpenHands please fix the failing actions on PR #95 at branch xw/critic

openhands-ai · 2025-11-17T18:37:34Z

I'm on it! xingyaoww can track my progress at all-hands.dev

- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev>

* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>

After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-17T18:55:33Z

✅ Successfully Fixed Failing GitHub Actions on PR #95

All failing checks on PR #95 ("Use critics defined in SDK directly") have been resolved!

Summary of Changes

Branch: xw/critic
Commit: 2ecf016 (now part of merge commit 76119eb)
Files Modified: benchmarks/openagentsafety/run_infer.py

What Was Fixed

The PR originally had a failing pre-commit check because the new openagentsafety module (added to main after the PR branch was created) still used the deprecated critic_name parameter.

Changes made to benchmarks/openagentsafety/run_infer.py:

✅ Added import: from benchmarks.utils.critic_factory import create_critic
✅ Created critic instance: critic = create_critic(args)
✅ Updated EvalMetadata to use critic=critic instead of critic_name=args.critic

This aligns the new OpenAgentSafety benchmark with the PR's goal of using SDK critics directly instead of the deprecated adapter pattern.

CI Status ✅

All checks passing:

✅ Pre-commit checks: Passed (Ruff format, Ruff lint, PEP8, Pyright)
✅ Run tests: Passed (13/13 tests)

PR Status

PR #95 is now ready to merge:

State: Open
Mergeable: ✅ Yes
All CI checks: ✅ Passing
Link: Use critics defined in SDK directly #95

_{View full conversation}

xingyaoww · 2025-11-17T19:28:25Z

@OpenHands please fix the failing actions on PR #95 at branch xw/critic

openhands-ai · 2025-11-17T19:28:35Z

I'm on it! xingyaoww can track my progress at all-hands.dev

The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev>

All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev>

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-17T19:47:30Z

✅ Successfully Fixed All Failing CI Checks on PR #95

All GitHub Actions checks are now passing on PR #95 (branch xw/critic)! 🎉

Summary of Changes

I fixed the failing CI checks by addressing two critical issues:

1. Fixed Import Errors (Commit `abf0c8e`)

All benchmark modules were importing from a non-existent critic_factory module. Updated imports to use the correct critics module:

✅ benchmarks/swe_bench/run_infer.py
✅ benchmarks/openagentsafety/run_infer.py
✅ benchmarks/gaia/run_infer.py

Changed: from benchmarks.utils.critic_factory import create_critic → from benchmarks.utils.critics import create_critic

2. Fixed Keyboard Interrupt Tests (Commit `c854c56`)

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing because it didn't provide the now-required critic parameter to EvalMetadata:

✅ Added from openhands.sdk.critic import PassCritic import
✅ Added critic=PassCritic() parameter to EvalMetadata initialization

CI Status

✅ pre-commit checks: PASSING
✅ tests: PASSING (13/13 tests, including all previously failing tests)

Test Results

All tests now pass:

✅ test_keyboard_interrupt_cleanup - Fixed ✨
✅ test_keyboard_interrupt_immediate - Fixed ✨
✅ test_metrics.py (7 tests) - All passing
✅ test_workspace_cleanup.py (4 tests) - All passing

The PR is ready for review! 🚀

View PR: #95

_{View full conversation}

xingyaoww · 2025-11-18T02:02:06Z

50 instance eval gives 40 resolved (80%) - seems promising

Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 40
Instances unresolved: 10
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

openhands-ai · 2025-11-18T05:01:11Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Build SWE-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #95 at branch `xw/critic`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

juanmichelini

tested and LGTM!

xingyaoww · 2025-11-19T17:57:11Z

I finished a full iterative eval with sonnet 4.5, comparable result help (373/500=74.6% convinced this PR is correct)

https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0

Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 499
Instances incomplete: 0
Instances resolved: 373
Instances unresolved: 126
Instances with empty patches: 1
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>

xingyaoww marked this pull request as draft November 14, 2025 23:04

xingyaoww changed the title ~~Remove CriticAdapter and use SDK critics directly~~ Use critics defined in SDK directly Nov 14, 2025

xingyaoww added 7 commits November 14, 2025 23:11

remove util

d26f8ba

bump sha

b3bf5ca

rename BaseCritic to CriticBase

396950f

bump

b933587

bump sdk

dbb3889

bump

b419c44

add critic impl for eval

07db967

openhands-agent and others added 6 commits November 17, 2025 18:43

bump commit

c181e30

simplify

6f13798

Merge commit '3a85cb6465a62f40d9a23ce4dd66074433be882f' into xw/critic

76119eb

xingyaoww added 2 commits November 17, 2025 19:04

simplify

d749dbe

make critic required (since we have a reasonable default)

0808bf7

xingyaoww added the build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. label Nov 17, 2025

xingyaoww and others added 6 commits November 17, 2025 19:30

fix imports

4a6df02

simplify

ed3bef6

pre-import tools for serialization/deserialization

c9b6d86

Merge commit 'c854c565c8ab0fe615070ac1a829f255daa14792' into xw/critic

e5a4e7c

include short sha for eval

d3b7865

xingyaoww added build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 17, 2025

Merge commit '0700080242c8d1d70df68d53a0deafd09e38496f' into xw/critic

ce1038f

xingyaoww added build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 17, 2025

improve logging

223aaa6

xingyaoww added the build-swebench-200 Build 200 SWE-Bench Verified Image based on SDK version on this PR. label Nov 18, 2025

xingyaoww marked this pull request as ready for review November 18, 2025 01:59

xingyaoww requested a review from juanmichelini November 18, 2025 02:02

xingyaoww requested review from neubig and simonrosenberg November 18, 2025 19:42

juanmichelini approved these changes Nov 18, 2025

View reviewed changes

juanmichelini merged commit 297ec5b into main Nov 18, 2025
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use critics defined in SDK directly#95

Use critics defined in SDK directly#95
juanmichelini merged 26 commits intomainfrom
xw/critic

xingyaoww commented Nov 14, 2025

Uh oh!

xingyaoww commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

xingyaoww commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

xingyaoww commented Nov 18, 2025

Uh oh!

openhands-ai bot commented Nov 18, 2025

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

xingyaoww commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

xingyaoww commented Nov 14, 2025

Summary

Changes

Removed

Added

Updated

Benefits

Migration

Testing

Related PR

Uh oh!

xingyaoww commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

✅ Successfully Fixed Failing GitHub Actions on PR #95

Summary of Changes

What Was Fixed

CI Status ✅

PR Status

Uh oh!

xingyaoww commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

✅ Successfully Fixed All Failing CI Checks on PR #95

Summary of Changes

1. Fixed Import Errors (Commit abf0c8e)

2. Fixed Keyboard Interrupt Tests (Commit c854c56)

CI Status

Test Results

Uh oh!

xingyaoww commented Nov 18, 2025

Uh oh!

openhands-ai bot commented Nov 18, 2025

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xingyaoww commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

1. Fixed Import Errors (Commit `abf0c8e`)

2. Fixed Keyboard Interrupt Tests (Commit `c854c56`)

xingyaoww commented Nov 19, 2025 •

edited

Loading