Skip to content

Use critics defined in SDK directly#95

Merged
juanmichelini merged 26 commits intomainfrom
xw/critic
Nov 18, 2025
Merged

Use critics defined in SDK directly#95
juanmichelini merged 26 commits intomainfrom
xw/critic

Conversation

@xingyaoww
Copy link
Contributor

Summary

This PR removes the CriticAdapter abstraction layer and uses SDK critics directly throughout the benchmarks codebase.

Changes

Removed

  • CriticAdapter class (no longer needed)
  • create_critic() function (use CriticRegistry.create_critic() instead)

Added

  • ✅ Re-exports of SDK critic components for convenience
  • ✅ Helper functions for common operations:
    • convert_history_to_events(): Convert EvalOutput history to LLMConvertibleEvent list
    • extract_git_patch(): Extract git patch from EvalOutput
    • evaluate_output(): Evaluate EvalOutput with a critic (convenience function)

Updated

  • Updated critics.py to re-export SDK components directly
  • Updated iterative.py to use SDK critics directly
  • Updated SDK submodule reference to include organized critic implementations

Benefits

  • Simpler Architecture: No adapter layer, direct SDK access
  • Better Type Safety: Direct type hints from SDK
  • Cleaner API: Direct access to SDK critic methods
  • Easier to Maintain: Less code, fewer layers
  • Consistent Interface: Same interface everywhere

Migration

Code using the old adapter is automatically compatible through re-exports:

Before:

from benchmarks.utils.critics import create_critic

critic = create_critic('empty_patch_critic')
success = critic.evaluate_instance(history, git_patch)

After (Option 1 - Direct SDK):

from benchmarks.utils.critics import CriticRegistry, convert_history_to_events

critic = CriticRegistry.create_critic('empty_patch_critic')
events = convert_history_to_events(history)
result = critic.evaluate(events, git_patch)
success = result.success

After (Option 2 - Helper Function):

from benchmarks.utils.critics import CriticRegistry, evaluate_output

critic = CriticRegistry.create_critic('empty_patch_critic')
success = evaluate_output(critic, eval_output)

Testing

All tests pass:

  • ✅ SDK critic imports successful
  • ✅ Benchmark module imports working
  • ✅ Helper functions work correctly
  • ✅ Downstream modules (evaluation.py, iterative.py) work
  • ✅ End-to-end integration verified

Related PR

Companion PR in SDK repo: OpenHands/software-agent-sdk#1171

- Remove CriticAdapter class entirely from benchmarks
- Update critics.py to re-export SDK critic components directly
- Add helper functions for common operations:
  - convert_history_to_events(): Convert history to LLMConvertibleEvent
  - extract_git_patch(): Extract patch from EvalOutput
  - evaluate_output(): Evaluate EvalOutput with critic
- Update iterative.py to use SDK critics directly
- Update SDK submodule with organized critic implementations

Benefits:
- Simpler architecture: no adapter layer
- Direct access to SDK critic methods
- Better type safety with direct SDK types
- Easier to maintain and extend

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww marked this pull request as draft November 14, 2025 23:04
@xingyaoww xingyaoww changed the title Remove CriticAdapter and use SDK critics directly Use critics defined in SDK directly Nov 14, 2025
@xingyaoww
Copy link
Contributor Author

@OpenHands please fix the failing actions on PR #95 at branch xw/critic

@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-agent and others added 6 commits November 17, 2025 18:43
- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests
- Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field)
- Add necessary imports for MessageEvent, Message, and TextContent

This fixes the validation error where EvalOutput expected proper Event objects
but received MagicMock objects, which don't match the expected event types.

Co-authored-by: openhands <openhands@all-hands.dev>
* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>
After merging main branch, update openagentsafety/run_infer.py to use the new
critic API:
- Import create_critic from benchmarks.utils.critic_factory
- Create critic instance using create_critic(args)
- Pass critic=critic instead of critic_name=args.critic to EvalMetadata

This aligns with the changes made in PR #95 to use SDK critics directly.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

✅ Successfully Fixed Failing GitHub Actions on PR #95

All failing checks on PR #95 ("Use critics defined in SDK directly") have been resolved!

Summary of Changes

Branch: xw/critic
Commit: 2ecf016 (now part of merge commit 76119eb)
Files Modified: benchmarks/openagentsafety/run_infer.py

What Was Fixed

The PR originally had a failing pre-commit check because the new openagentsafety module (added to main after the PR branch was created) still used the deprecated critic_name parameter.

Changes made to benchmarks/openagentsafety/run_infer.py:

  1. ✅ Added import: from benchmarks.utils.critic_factory import create_critic
  2. ✅ Created critic instance: critic = create_critic(args)
  3. ✅ Updated EvalMetadata to use critic=critic instead of critic_name=args.critic

This aligns the new OpenAgentSafety benchmark with the PR's goal of using SDK critics directly instead of the deprecated adapter pattern.

CI Status ✅

All checks passing:

  • Pre-commit checks: Passed (Ruff format, Ruff lint, PEP8, Pyright)
  • Run tests: Passed (13/13 tests)

PR Status

PR #95 is now ready to merge:

View full conversation

@xingyaoww
Copy link
Contributor Author

@OpenHands please fix the failing actions on PR #95 at branch xw/critic

@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

@xingyaoww xingyaoww added the build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. label Nov 17, 2025
xingyaoww and others added 6 commits November 17, 2025 19:30
The critic field is now required in EvalMetadata. Updated test fixtures
to provide PassCritic instances for all test cases.

Changes:
- Added PassCritic import to test_workspace_cleanup.py and test_metrics.py
- Added critic=PassCritic() to all EvalMetadata instances in tests
- All tests pass and pre-commit checks are clean

Co-authored-by: openhands <openhands@all-hands.dev>
All benchmark modules were importing from the non-existent
critic_factory module. Updated to import from critics module instead,
which contains the create_critic function.

Changes:
- benchmarks/swe_bench/run_infer.py: critic_factory -> critics
- benchmarks/openagentsafety/run_infer.py: critic_factory -> critics
- benchmarks/gaia/run_infer.py: critic_factory -> critics

This fixes the import errors that prevented benchmarks from being
discovered and tested.

Co-authored-by: openhands <openhands@all-hands.dev>
The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to
create an EvalMetadata instance because the critic parameter is now
required. Added PassCritic import and critic parameter to fix the tests.

Changes:
- Added import for PassCritic from openhands.sdk.critic
- Added critic=PassCritic() to EvalMetadata initialization

This fixes the keyboard interrupt tests that were failing because the
subprocess couldn't start properly.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

✅ Successfully Fixed All Failing CI Checks on PR #95

All GitHub Actions checks are now passing on PR #95 (branch xw/critic)! 🎉

Summary of Changes

I fixed the failing CI checks by addressing two critical issues:

1. Fixed Import Errors (Commit abf0c8e)

All benchmark modules were importing from a non-existent critic_factory module. Updated imports to use the correct critics module:

  • benchmarks/swe_bench/run_infer.py
  • benchmarks/openagentsafety/run_infer.py
  • benchmarks/gaia/run_infer.py

Changed: from benchmarks.utils.critic_factory import create_criticfrom benchmarks.utils.critics import create_critic

2. Fixed Keyboard Interrupt Tests (Commit c854c56)

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing because it didn't provide the now-required critic parameter to EvalMetadata:

  • ✅ Added from openhands.sdk.critic import PassCritic import
  • ✅ Added critic=PassCritic() parameter to EvalMetadata initialization

CI Status

  • pre-commit checks: PASSING
  • tests: PASSING (13/13 tests, including all previously failing tests)

Test Results

All tests now pass:

  • test_keyboard_interrupt_cleanup - Fixed ✨
  • test_keyboard_interrupt_immediate - Fixed ✨
  • test_metrics.py (7 tests) - All passing
  • test_workspace_cleanup.py (4 tests) - All passing

The PR is ready for review! 🚀

View PR: #95

View full conversation

@xingyaoww xingyaoww added build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 17, 2025
@xingyaoww xingyaoww added build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 17, 2025
@xingyaoww xingyaoww added the build-swebench-200 Build 200 SWE-Bench Verified Image based on SDK version on this PR. label Nov 18, 2025
@xingyaoww xingyaoww marked this pull request as ready for review November 18, 2025 01:59
@xingyaoww
Copy link
Contributor Author

50 instance eval gives 40 resolved (80%) - seems promising

Total instances: 500
Instances submitted: 50
Instances completed: 50
Instances incomplete: 450
Instances resolved: 40
Instances unresolved: 10
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

@xingyaoww xingyaoww added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench-50 Build 50 SWE-Bench Verified Image based on SDK version on this PR. build-swebench-200 Build 200 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 18, 2025
@openhands-ai
Copy link

openhands-ai bot commented Nov 18, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Build SWE-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #95 at branch `xw/critic`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and LGTM!

@juanmichelini juanmichelini merged commit 297ec5b into main Nov 18, 2025
4 of 6 checks passed
@xingyaoww
Copy link
Contributor Author

xingyaoww commented Nov 19, 2025

I finished a full iterative eval with sonnet 4.5, comparable result help (373/500=74.6% convinced this PR is correct)

https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0

Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 499
Instances incomplete: 0
Instances resolved: 373
Instances unresolved: 126
Instances with empty patches: 1
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500
Report written to openhands.eval_output.swebench.json

juanmichelini added a commit that referenced this pull request Dec 17, 2025
* Remove CriticAdapter and use SDK critics directly

- Remove CriticAdapter class entirely from benchmarks
- Update critics.py to re-export SDK critic components directly
- Add helper functions for common operations:
  - convert_history_to_events(): Convert history to LLMConvertibleEvent
  - extract_git_patch(): Extract patch from EvalOutput
  - evaluate_output(): Evaluate EvalOutput with critic
- Update iterative.py to use SDK critics directly
- Update SDK submodule with organized critic implementations

Benefits:
- Simpler architecture: no adapter layer
- Direct access to SDK critic methods
- Better type safety with direct SDK types
- Easier to maintain and extend

Co-authored-by: openhands <openhands@all-hands.dev>

* remove util

* bump sha

* rename BaseCritic to CriticBase

* bump

* bump sdk

* bump

* add critic impl for eval

* Fix test_metrics.py to use proper MessageEvent instead of MagicMock

- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests
- Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field)
- Add necessary imports for MessageEvent, Message, and TextContent

This fixes the validation error where EvalOutput expected proper Event objects
but received MagicMock objects, which don't match the expected event types.

Co-authored-by: openhands <openhands@all-hands.dev>

* bump commit

* Open agent safety (#91)

* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>

* Fix openagentsafety to use critic field instead of critic_name

After merging main branch, update openagentsafety/run_infer.py to use the new
critic API:
- Import create_critic from benchmarks.utils.critic_factory
- Create critic instance using create_critic(args)
- Pass critic=critic instead of critic_name=args.critic to EvalMetadata

This aligns with the changes made in PR #95 to use SDK critics directly.

Co-authored-by: openhands <openhands@all-hands.dev>

* simplify

* simplify

* make critic required (since we have a reasonable default)

* fix imports

* simplify

* Fix tests: add required critic parameter to EvalMetadata

The critic field is now required in EvalMetadata. Updated test fixtures
to provide PassCritic instances for all test cases.

Changes:
- Added PassCritic import to test_workspace_cleanup.py and test_metrics.py
- Added critic=PassCritic() to all EvalMetadata instances in tests
- All tests pass and pre-commit checks are clean

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix imports: change critic_factory to critics

All benchmark modules were importing from the non-existent
critic_factory module. Updated to import from critics module instead,
which contains the create_critic function.

Changes:
- benchmarks/swe_bench/run_infer.py: critic_factory -> critics
- benchmarks/openagentsafety/run_infer.py: critic_factory -> critics
- benchmarks/gaia/run_infer.py: critic_factory -> critics

This fixes the import errors that prevented benchmarks from being
discovered and tested.

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix keyboard interrupt tests: add required critic parameter

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to
create an EvalMetadata instance because the critic parameter is now
required. Added PassCritic import and critic parameter to fix the tests.

Changes:
- Added import for PassCritic from openhands.sdk.critic
- Added critic=PassCritic() to EvalMetadata initialization

This fixes the keyboard interrupt tests that were failing because the
subprocess couldn't start properly.

Co-authored-by: openhands <openhands@all-hands.dev>

* pre-import tools for serialization/deserialization

* include short sha for eval

* improve logging

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com>
Co-authored-by: juanmichelini <juan@juan.com.uy>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments