Port over Critic system from benchmark project by xingyaoww · Pull Request #1171 · OpenHands/software-agent-sdk

xingyaoww · 2025-11-14T23:02:41Z

Summary

This PR ports the critic system from the benchmark project to the SDK, providing a framework for evaluating agent execution quality. The critic system allows evaluation of agent performance based on events and generated git patches.

What is a Critic?

A critic is a component that evaluates agent execution quality by analyzing:

The sequence of events (actions) taken by the agent
The git patch generated (if any)

Critics return a CriticResult with:

A score between 0.0 and 1.0 (probability of success)
An optional message explaining the evaluation
A success property (threshold: 0.5)

Changes

Core Components

Base Classes (`base.py`)

CriticBase: Abstract base class for all critics
- evaluate() method accepts Sequence[LLMConvertibleEvent] (covariant) and optional git patch
- Returns CriticResult with score and message
CriticResult: Pydantic model representing evaluation results
- score: Float between 0.0 and 1.0
- message: Optional explanation
- success: Property based on 0.5 threshold

Critic Implementations (`impl/`)

1. PassCritic (`impl/pass_critic.py`)

Always returns success (score 1.0). Useful when no evaluation is needed or all instances should pass.

2. EmptyPatchCritic (`impl/empty_patch.py`)

Evaluates whether a git patch is non-empty:

Success: Patch is non-empty (actual changes made)
Failure: Patch is empty or missing

3. AgentFinishedCritic (`impl/agent_finished.py`)

Evaluates proper task completion with two criteria:

The last action must be a FinishAction (proper completion)
The git patch must be non-empty (actual changes made)

Success: Both criteria met
Failure: Either criterion fails

Type System Improvements

Changed evaluate() signature from list[LLMConvertibleEvent] to Sequence[LLMConvertibleEvent]:

Supports covariance (accept list[ActionEvent] where Sequence[LLMConvertibleEvent] expected)
Better type safety without requiring type: ignore
Consistent with modern Python typing best practices

Testing (`tests/sdk/critic/test_critic.py`)

Comprehensive test suite with 16 tests:

CriticResult Tests (2 tests)

Score threshold behavior
Pydantic validation (score bounds)

PassCritic Tests (1 test)

Always returns success

EmptyPatchCritic Tests (2 tests)

Failure cases: None, empty string, whitespace-only patches
Success cases: Non-empty patches

AgentFinishedCritic Tests (4 tests)

Failure: Empty patch with FinishAction
Failure: Non-empty patch without FinishAction
Success: FinishAction + non-empty patch
Failure: FinishAction not as last action

Abstract Base Tests (1 test)

CriticBase cannot be instantiated directly

Benefits

Extensible Framework: Easy to add new critics by extending CriticBase
Type Safe: Uses Sequence for covariance, avoiding type errors
Well Organized: Separate files for each critic implementation
Thread Safe: CriticRegistry uses RLock for concurrent access
Discoverable: Registry pattern makes critics easy to find and use
Thoroughly Tested: 16 tests covering all functionality and edge cases

Testing

All tests pass:

✅ 16 critic tests (100% pass rate)
✅ Type checking passes (pyright)
✅ Linting passes (ruff)
✅ Code formatting passes (ruff format)
✅ PEP8 style checks pass (pycodestyle)

Related Work

This PR enables:

Removal of CriticAdapter from benchmark project
Direct usage of SDK critics in benchmarks
Consistent critic interface across projects

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:947f216-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-947f216-python \
  ghcr.io/openhands/agent-server:947f216-python

All tags pushed for this build

ghcr.io/openhands/agent-server:947f216-golang-amd64
ghcr.io/openhands/agent-server:947f216-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:947f216-golang-arm64
ghcr.io/openhands/agent-server:947f216-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:947f216-java-amd64
ghcr.io/openhands/agent-server:947f216-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:947f216-java-arm64
ghcr.io/openhands/agent-server:947f216-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:947f216-python-amd64
ghcr.io/openhands/agent-server:947f216-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:947f216-python-arm64
ghcr.io/openhands/agent-server:947f216-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:947f216-golang
ghcr.io/openhands/agent-server:947f216-java
ghcr.io/openhands/agent-server:947f216-python

About Multi-Architecture Support

Each variant tag (e.g., 947f216-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 947f216-python-amd64) are also available if needed

- Split implementations.py into individual files in impl/ folder - impl/agent_finished.py: AgentFinishedCritic - impl/empty_patch.py: EmptyPatchCritic - impl/pass_critic.py: PassCritic - Each critic now has its own file for better organization - Added MANIFEST.in to ensure critic module is packaged - Updated imports in __init__.py to use impl module Benefits: - Better modularity: one file per critic - Easier to maintain and extend - Clear separation of concerns Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww · 2025-11-14T23:20:50Z

@OpenHands review diff against main, understand the purpose of this PR, update the description of this PR and make sure pre-commit run --all-files pass. Then create a test file includig tests for the critic and make sure they pass.
Commit and push when you are done.

openhands-ai · 2025-11-14T23:21:00Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2025-11-14T23:27:41Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1171 at branch `xw/critic`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

- Add CriticRegistry for managing critic implementations - Thread-safe singleton with register/get/create/list methods - Default registrations for pass, empty_patch_critic, finish_with_patch - Update type signatures from list to Sequence for covariance support - Changed CriticBase.evaluate() to accept Sequence[LLMConvertibleEvent] - Updated all impl files (pass_critic, empty_patch, agent_finished) - Add comprehensive test suite with 16 tests - Test CriticResult validation and success threshold - Test all three critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic) - Test CriticRegistry functionality and edge cases - Test abstract base class enforcement - All tests passing, pre-commit checks passing Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-14T23:33:34Z

Summary of Completed Work

I successfully completed all requested tasks for PR #1171 (xw/critic branch):

✅ Checklist of Completed Tasks

Reviewed diff against main - Analyzed all changes in the critic system implementation
Understood PR purpose - This PR ports the critic system from the benchmark project to the SDK, providing a framework for evaluating agent execution quality
Updated PR description - Replaced the brief description with a comprehensive one including:
- Overview of what critics are and how they work
- Detailed documentation of all components (CriticBase, CriticResult, CriticRegistry)
- Description of all 3 critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic)
- Usage examples and benefits
- Test coverage details
Ensured pre-commit passes - All checks pass ✅:
- Format YAML files: Passed
- Ruff format: Passed
- Ruff lint: Passed
- PEP8 style check: Passed
- Type check with pyright: Passed
Created comprehensive test suite - Added tests/sdk/critic/test_critic.py with 16 tests covering:
- CriticResult validation and success threshold (2 tests)
- PassCritic functionality (1 test)
- EmptyPatchCritic with empty and non-empty patches (2 tests)
- AgentFinishedCritic with various scenarios (4 tests)
- CriticRegistry registration, retrieval, and creation (6 tests)
- Abstract base class enforcement (1 test)
All tests pass - 16/16 tests passing (100%)
Committed and pushed - Changes committed and pushed to xw/critic branch

Key Changes Made

Created registry.py - Thread-safe CriticRegistry singleton with register/get/create/list/clear methods
Updated type signatures - Changed evaluate() from list[LLMConvertibleEvent] to Sequence[LLMConvertibleEvent] for covariance support (fixes type errors)
Created comprehensive test suite - 380 lines of tests covering all functionality and edge cases

Commit Details

Commit: a1612b63
Branch: xw/critic
Status: Pushed to remote
PR: Port over Critic system from benchmark project #1171

All requirements have been successfully completed and the PR is ready for review!

_{View full conversation}

github-actions · 2025-11-14T23:35:14Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/critic
base.py	17	2	88%	26, 38
openhands-sdk/openhands/sdk/critic/impl
agent_finished.py	28	19	32%	42, 45–48, 55–58, 63–64, 71–72, 75–76, 78–79, 81, 83
empty_patch.py	12	5	58%	44–46, 48–49
pass_critic.py	9	2	77%	41–42
TOTAL	12201	5610	54%

xingyaoww · 2025-11-15T03:20:35Z

openhands-sdk/openhands/sdk/critic/registry.py

@OpenHands please remove critic registry and all its associated tests

I'm on it! xingyaoww can track my progress at all-hands.dev

Summary

I've successfully removed the critic registry and all its associated tests as requested. The changes have been committed and pushed to the xw/critic branch.

Changes Made:

✅ Deleted Files:

openhands-sdk/openhands/sdk/critic/registry.py (99 lines) - Removed the entire CriticRegistry class with its thread-safe singleton implementation

✅ Updated Files:

openhands-sdk/openhands/sdk/critic/__init__.py (8 lines removed)

Removed import of CriticRegistry

Removed default critic registrations (finish_with_patch, empty_patch_critic, pass)

Removed CriticRegistry from __all__ exports

tests/sdk/critic/test_critic.py (86 lines removed)

Removed CriticRegistry import

Removed 6 registry-related test functions:

test_critic_registry_register

test_critic_registry_register_invalid

test_critic_registry_get_not_found

test_critic_registry_create

test_critic_registry_default_registrations

test_critic_registry_duplicate_warning

What Remains:

The core critic functionality is preserved:

CriticBase abstract base class

CriticResult model

Three critic implementations: PassCritic, EmptyPatchCritic, AgentFinishedCritic

10 tests for the base classes and implementations

Verification:

✅ All 1,193 SDK tests pass
✅ All pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
✅ No other files in the codebase reference CriticRegistry

Total changes: 3 files modified, 193 lines deleted

The changes have been pushed to origin/xw/critic and are ready for review in PR #1171.

_{View full conversation}

- Deleted openhands-sdk/openhands/sdk/critic/registry.py - Updated critic/__init__.py to remove CriticRegistry import and registrations - Removed 6 registry-related tests from test_critic.py: - test_critic_registry_register - test_critic_registry_register_invalid - test_critic_registry_get_not_found - test_critic_registry_create - test_critic_registry_default_registrations - test_critic_registry_duplicate_warning - All remaining tests (1193) pass successfully - Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright) Co-authored-by: openhands <openhands@all-hands.dev>

openhands-sdk/openhands/sdk/critic/base.py

tofarr

LGTM!

juanmichelini

LGMT!

xingyaoww and others added 2 commits November 14, 2025 19:55

add base classes for critic

3b1ed86

xingyaoww mentioned this pull request Nov 14, 2025

Use critics defined in SDK directly OpenHands/benchmarks#95

Merged

xingyaoww marked this pull request as draft November 14, 2025 23:05

xingyaoww changed the title ~~Refactor critic system: organize implementations into separate files~~ Port over Critic system from benchmark project Nov 14, 2025

xingyaoww added 5 commits November 14, 2025 23:08

remove MANIFEST

d0147c2

rename BaseCritic to CriticBase

f0357a8

improve implementation

d2bf604

remove registry

dae8b57

Merge branch 'main' into xw/critic

abeeaad

xingyaoww marked this pull request as ready for review November 15, 2025 03:19

xingyaoww commented Nov 15, 2025

View reviewed changes

openhands-agent and others added 3 commits November 15, 2025 03:26

make Critic a discriminated union mixin class

bc097ea

Merge commit '7046e760ddb4a70f59f779f9c308f0b9232645e9' into xw/critic

213af77

xingyaoww requested review from enyst, juanmichelini and simonrosenberg November 17, 2025 17:46

Merge branch 'main' into xw/critic

6809dc1

xingyaoww commented Nov 17, 2025

View reviewed changes

openhands-sdk/openhands/sdk/critic/base.py Outdated Show resolved Hide resolved

Update openhands-sdk/openhands/sdk/critic/base.py

ac2334b

tofarr approved these changes Nov 17, 2025

View reviewed changes

Merge commit 'ac2334b1dd736aaf286ad22a247020068adb5803' into xw/critic

fb7197a

simonrosenberg approved these changes Nov 17, 2025

View reviewed changes

juanmichelini approved these changes Nov 17, 2025

View reviewed changes

xingyaoww merged commit 79868ae into main Nov 18, 2025
20 checks passed

xingyaoww deleted the xw/critic branch November 18, 2025 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port over Critic system from benchmark project#1171

Port over Critic system from benchmark project#1171
xingyaoww merged 14 commits intomainfrom
xw/critic

xingyaoww commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

xingyaoww commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Uh oh!

github-actions bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

xingyaoww Nov 15, 2025

Uh oh!

openhands-ai bot Nov 15, 2025

Uh oh!

openhands-ai bot Nov 15, 2025

Uh oh!

Uh oh!

tofarr left a comment

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

xingyaoww commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is a Critic?

Changes

Core Components

Base Classes (base.py)

Critic Implementations (impl/)

1. PassCritic (impl/pass_critic.py)

2. EmptyPatchCritic (impl/empty_patch.py)

3. AgentFinishedCritic (impl/agent_finished.py)

Type System Improvements

Testing (tests/sdk/critic/test_critic.py)

Benefits

Testing

Related Work

Uh oh!

xingyaoww commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Uh oh!

openhands-ai bot commented Nov 14, 2025

Summary of Completed Work

✅ Checklist of Completed Tasks

Key Changes Made

Commit Details

Uh oh!

github-actions bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xingyaoww Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 15, 2025

Choose a reason for hiding this comment

Summary

Changes Made:

What Remains:

Verification:

Uh oh!

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

xingyaoww commented Nov 14, 2025 •

edited by github-actions bot

Loading

Base Classes (`base.py`)

Critic Implementations (`impl/`)

1. PassCritic (`impl/pass_critic.py`)

2. EmptyPatchCritic (`impl/empty_patch.py`)

3. AgentFinishedCritic (`impl/agent_finished.py`)

Testing (`tests/sdk/critic/test_critic.py`)

github-actions bot commented Nov 14, 2025 •

edited

Loading