Skip to content

Open agent safety#91

Merged
neubig merged 25 commits intoOpenHands:mainfrom
MadhaviSG:open-agent-safety
Nov 17, 2025
Merged

Open agent safety#91
neubig merged 25 commits intoOpenHands:mainfrom
MadhaviSG:open-agent-safety

Conversation

@MadhaviSG
Copy link
Contributor

@MadhaviSG MadhaviSG commented Nov 13, 2025

Overview

Adds OpenAgentSafety benchmark for evaluating AI agent safety in workplace scenarios with NPC (Non-Player Character) interactions.

Key Changes

OpenAgentSafety Benchmark

  • NPC interaction tool (chat_with_npc) for realistic workplace conversations
  • Jinja2 prompt templating for customizable agent behavior
  • Dockerized evaluation environment with TheAgentCompany integration
  • Cost metrics tracking and enhanced evaluation utilities

Usage

uv run openagentsafety-infer .llm_config/gpt-4o-mini.json \
  --dataset mgulavani/openagentsafety_full_updated_v3 \
  --split train \
  --output-dir ./results

Output Files

Evaluation generates output files for each task:

  • eval_*.json - Final evaluation results and scores
  • traj_*.json - Complete agent trajectory and actions taken
  • state_*.json - Final environment state after task completion
  • error_*.json - Error logs if task fails

Plus summary files:

  • output.jsonl - Consolidated results across all tasks
  • output_critic.jsonl - Critic evaluation results (if enabled)

Setup Requirements

  • Docker environment (~30GB space)
  • TheAgentCompany services
  • LLM config in .llm_config/ directory
  • NPC API key

Files Added

  • benchmarks/openagentsafety/ - Complete benchmark implementation
  • benchmarks/openagentsafety/custom_tools/npc_tool.py - NPC communication
  • benchmarks/openagentsafety/prompts/default.j2 - Jinja2 templates

@MadhaviSG MadhaviSG marked this pull request as draft November 13, 2025 05:59
- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>
@MadhaviSG MadhaviSG marked this pull request as ready for review November 13, 2025 06:07
"toml",
"tqdm",
"openhands-sdk",
"openhands-tools",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think this file got messed up during resolving conflicts. Let me take a look

pyproject.toml Outdated
]
=======
lines-after-imports = 2
>>>>>>> 0ee2f25 (code changes with common evaluation framework)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems kike merge conflict here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching, will fix it

@neubig neubig requested review from neubig and xingyaoww November 13, 2025 20:56
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now, thank you!

@juanmichelini
Copy link
Collaborator

@OpenHands can you look at the Pre-commit checks and do the minimal changes needed to pass the pre-commit checks? do not change anything else.

@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

I'm on it! juanmichelini can track my progress at all-hands.dev

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

Summary

I have successfully fixed all pre-commit check failures for PR #91 (OpenAgentSafety benchmark) with minimal changes:

✅ Request Completion Checklist

  • Analyzed the pre-commit check failures from the GitHub Actions link
  • Identified specific issues: Ruff formatting and Pyright type checking errors
  • Made only the minimal changes needed to pass pre-commit checks
  • Verified all pre-commit checks now pass (Ruff format, Ruff lint, PEP8, Pyright)
  • Did not change any functionality or make extraneous modifications
  • Committed and pushed changes to update PR Open agent safety #91

🔧 Changes Made

Type System Fixes:

  • Added pyright[nodejs] dependency to resolve nodeenv installation issues
  • Fixed type annotations in npc_tool.py and run_infer.py
  • Added missing from typing import Any import
  • Fixed method signature mismatches (e.g., NumpyEncoder.default() parameter name)
  • Fixed function parameter types (str = Nonestr | None = None)

Code Corrections:

  • Removed invalid completed_instances parameter from get_dataset() call
  • Fixed metrics handling to pass actual Metrics objects instead of dictionaries
  • Fixed tool factory return type in register.py
  • Added # type: ignore comments for OpenAI client message compatibility

✅ Verification

All pre-commit checks now pass:

  • ✅ Ruff format
  • ✅ Ruff lint
  • ✅ PEP8 style check
  • ✅ Pyright type checking (strict)

The changes have been committed and pushed to the open-agent-safety branch, updating PR #91. The GitHub Actions pre-commit workflow should now pass successfully.

View full conversation

@neubig neubig enabled auto-merge (squash) November 17, 2025 17:39
@neubig neubig merged commit 3a85cb6 into OpenHands:main Nov 17, 2025
2 checks passed
xingyaoww pushed a commit that referenced this pull request Nov 17, 2025
* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this tool is not passed to Agent, is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were compatibility issues with custom tools which is why this logic was moved to Dockerfile. It is safe to remove this folder. We are also working on fixing another issue. Will do it as part of next PR. Thanks!

juanmichelini added a commit that referenced this pull request Nov 18, 2025
* Remove CriticAdapter and use SDK critics directly

- Remove CriticAdapter class entirely from benchmarks
- Update critics.py to re-export SDK critic components directly
- Add helper functions for common operations:
  - convert_history_to_events(): Convert history to LLMConvertibleEvent
  - extract_git_patch(): Extract patch from EvalOutput
  - evaluate_output(): Evaluate EvalOutput with critic
- Update iterative.py to use SDK critics directly
- Update SDK submodule with organized critic implementations

Benefits:
- Simpler architecture: no adapter layer
- Direct access to SDK critic methods
- Better type safety with direct SDK types
- Easier to maintain and extend

Co-authored-by: openhands <openhands@all-hands.dev>

* remove util

* bump sha

* rename BaseCritic to CriticBase

* bump

* bump sdk

* bump

* add critic impl for eval

* Fix test_metrics.py to use proper MessageEvent instead of MagicMock

- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests
- Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field)
- Add necessary imports for MessageEvent, Message, and TextContent

This fixes the validation error where EvalOutput expected proper Event objects
but received MagicMock objects, which don't match the expected event types.

Co-authored-by: openhands <openhands@all-hands.dev>

* bump commit

* Open agent safety (#91)

* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>

* Fix openagentsafety to use critic field instead of critic_name

After merging main branch, update openagentsafety/run_infer.py to use the new
critic API:
- Import create_critic from benchmarks.utils.critic_factory
- Create critic instance using create_critic(args)
- Pass critic=critic instead of critic_name=args.critic to EvalMetadata

This aligns with the changes made in PR #95 to use SDK critics directly.

Co-authored-by: openhands <openhands@all-hands.dev>

* simplify

* simplify

* make critic required (since we have a reasonable default)

* fix imports

* simplify

* Fix tests: add required critic parameter to EvalMetadata

The critic field is now required in EvalMetadata. Updated test fixtures
to provide PassCritic instances for all test cases.

Changes:
- Added PassCritic import to test_workspace_cleanup.py and test_metrics.py
- Added critic=PassCritic() to all EvalMetadata instances in tests
- All tests pass and pre-commit checks are clean

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix imports: change critic_factory to critics

All benchmark modules were importing from the non-existent
critic_factory module. Updated to import from critics module instead,
which contains the create_critic function.

Changes:
- benchmarks/swe_bench/run_infer.py: critic_factory -> critics
- benchmarks/openagentsafety/run_infer.py: critic_factory -> critics
- benchmarks/gaia/run_infer.py: critic_factory -> critics

This fixes the import errors that prevented benchmarks from being
discovered and tested.

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix keyboard interrupt tests: add required critic parameter

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to
create an EvalMetadata instance because the critic parameter is now
required. Added PassCritic import and critic parameter to fix the tests.

Changes:
- Added import for PassCritic from openhands.sdk.critic
- Added critic=PassCritic() to EvalMetadata initialization

This fixes the keyboard interrupt tests that were failing because the
subprocess couldn't start properly.

Co-authored-by: openhands <openhands@all-hands.dev>

* pre-import tools for serialization/deserialization

* include short sha for eval

* improve logging

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com>
Co-authored-by: juanmichelini <juan@juan.com.uy>
juanmichelini added a commit that referenced this pull request Dec 17, 2025
* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>
juanmichelini added a commit that referenced this pull request Dec 17, 2025
* Remove CriticAdapter and use SDK critics directly

- Remove CriticAdapter class entirely from benchmarks
- Update critics.py to re-export SDK critic components directly
- Add helper functions for common operations:
  - convert_history_to_events(): Convert history to LLMConvertibleEvent
  - extract_git_patch(): Extract patch from EvalOutput
  - evaluate_output(): Evaluate EvalOutput with critic
- Update iterative.py to use SDK critics directly
- Update SDK submodule with organized critic implementations

Benefits:
- Simpler architecture: no adapter layer
- Direct access to SDK critic methods
- Better type safety with direct SDK types
- Easier to maintain and extend

Co-authored-by: openhands <openhands@all-hands.dev>

* remove util

* bump sha

* rename BaseCritic to CriticBase

* bump

* bump sdk

* bump

* add critic impl for eval

* Fix test_metrics.py to use proper MessageEvent instead of MagicMock

- Replace MagicMock event objects with proper MessageEvent instances in GAIA tests
- Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field)
- Add necessary imports for MessageEvent, Message, and TextContent

This fixes the validation error where EvalOutput expected proper Event objects
but received MagicMock objects, which don't match the expected event types.

Co-authored-by: openhands <openhands@all-hands.dev>

* bump commit

* Open agent safety (#91)

* initial commit

* npc-agent communication

* added npc chat tool

* dockerized implementation

* evaluate single task and results

* evaluate all tasks with dedicated docker container

* output files gpt-4o

* evaluation framework integration

* code changes with common evaluation framework

* output files

* code changes for downloading workspace and utils files

* modified logic for sownloading utils and workspace files

* Delete benchmarks/openagentsafety/run_infer.sh

* added cost metrics and updated pyproject.toml

* changes to use jinja for agent prompting

* Added openagentsafety readme

* updated common readme

* Delete benchmarks/openagentsafety/backup directory

* Delete benchmarks/openagentsafety/outputs directory

* Fix README to preserve GAIA benchmark and add OpenAgentSafety

- Merged README structure from main branch that includes GAIA benchmark
- Added OpenAgentSafety to the benchmark table without overwriting existing content
- Maintained clean structure with proper links to individual benchmark directories
- Fixed submodule path references to match main branch structure

Co-authored-by: openhands <openhands@all-hands.dev>

* corrected pyproject.toml

* revert deletion of uv.lock

* uv.lock

* Fix pre-commit check failures

- Add pyright[nodejs] dependency to fix nodeenv issues
- Fix type annotations in npc_tool.py and run_infer.py
- Fix NPCTool to inherit from ToolDefinition with proper create() method
- Fix ToolExecutor __call__ method signature to match base class
- Fix NumpyEncoder.default() method parameter name to match JSONEncoder
- Fix generate_instruction function signature with proper union type
- Remove invalid completed_instances parameter from get_dataset call
- Fix Metrics type mismatches by passing actual Metrics objects
- Add type: ignore comments for OpenAI client message compatibility

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: juanmichelini <juan@juan.com.uy>

* Fix openagentsafety to use critic field instead of critic_name

After merging main branch, update openagentsafety/run_infer.py to use the new
critic API:
- Import create_critic from benchmarks.utils.critic_factory
- Create critic instance using create_critic(args)
- Pass critic=critic instead of critic_name=args.critic to EvalMetadata

This aligns with the changes made in PR #95 to use SDK critics directly.

Co-authored-by: openhands <openhands@all-hands.dev>

* simplify

* simplify

* make critic required (since we have a reasonable default)

* fix imports

* simplify

* Fix tests: add required critic parameter to EvalMetadata

The critic field is now required in EvalMetadata. Updated test fixtures
to provide PassCritic instances for all test cases.

Changes:
- Added PassCritic import to test_workspace_cleanup.py and test_metrics.py
- Added critic=PassCritic() to all EvalMetadata instances in tests
- All tests pass and pre-commit checks are clean

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix imports: change critic_factory to critics

All benchmark modules were importing from the non-existent
critic_factory module. Updated to import from critics module instead,
which contains the create_critic function.

Changes:
- benchmarks/swe_bench/run_infer.py: critic_factory -> critics
- benchmarks/openagentsafety/run_infer.py: critic_factory -> critics
- benchmarks/gaia/run_infer.py: critic_factory -> critics

This fixes the import errors that prevented benchmarks from being
discovered and tested.

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix keyboard interrupt tests: add required critic parameter

The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to
create an EvalMetadata instance because the critic parameter is now
required. Added PassCritic import and critic parameter to fix the tests.

Changes:
- Added import for PassCritic from openhands.sdk.critic
- Added critic=PassCritic() to EvalMetadata initialization

This fixes the keyboard interrupt tests that were failing because the
subprocess couldn't start properly.

Co-authored-by: openhands <openhands@all-hands.dev>

* pre-import tools for serialization/deserialization

* include short sha for eval

* improve logging

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com>
Co-authored-by: juanmichelini <juan@juan.com.uy>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments