Open agent safety by MadhaviSG · Pull Request #91 · OpenHands/benchmarks

MadhaviSG · 2025-11-13T05:59:12Z

Overview

Adds OpenAgentSafety benchmark for evaluating AI agent safety in workplace scenarios with NPC (Non-Player Character) interactions.

Key Changes

OpenAgentSafety Benchmark

NPC interaction tool (chat_with_npc) for realistic workplace conversations
Jinja2 prompt templating for customizable agent behavior
Dockerized evaluation environment with TheAgentCompany integration
Cost metrics tracking and enhanced evaluation utilities

Usage

uv run openagentsafety-infer .llm_config/gpt-4o-mini.json \
  --dataset mgulavani/openagentsafety_full_updated_v3 \
  --split train \
  --output-dir ./results

Output Files

Evaluation generates output files for each task:

eval_*.json - Final evaluation results and scores
traj_*.json - Complete agent trajectory and actions taken
state_*.json - Final environment state after task completion
error_*.json - Error logs if task fails

Plus summary files:

output.jsonl - Consolidated results across all tasks
output_critic.jsonl - Critic evaluation results (if enabled)

Setup Requirements

Docker environment (~30GB space)
TheAgentCompany services
LLM config in .llm_config/ directory
NPC API key

Files Added

benchmarks/openagentsafety/ - Complete benchmark implementation
benchmarks/openagentsafety/custom_tools/npc_tool.py - NPC communication
benchmarks/openagentsafety/prompts/default.j2 - Jinja2 templates

- Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww · 2025-11-13T18:10:59Z

pyproject.toml

    "toml",
    "tqdm",
-    "openhands-sdk",
-    "openhands-tools",


Is this intended?

no, I think this file got messed up during resolving conflicts. Let me take a look

xingyaoww · 2025-11-13T18:11:06Z

pyproject.toml

 ]
+=======
+lines-after-imports = 2
+>>>>>>> 0ee2f25 (code changes with common evaluation framework)


seems kike merge conflict here

thanks for catching, will fix it

neubig

LGTM for now, thank you!

juanmichelini · 2025-11-17T15:38:28Z

@OpenHands can you look at the Pre-commit checks and do the minimal changes needed to pass the pre-commit checks? do not change anything else.

openhands-ai · 2025-11-17T15:38:37Z

I'm on it! juanmichelini can track my progress at all-hands.dev

- Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-17T15:50:48Z

Summary

I have successfully fixed all pre-commit check failures for PR #91 (OpenAgentSafety benchmark) with minimal changes:

✅ Request Completion Checklist

Analyzed the pre-commit check failures from the GitHub Actions link
Identified specific issues: Ruff formatting and Pyright type checking errors
Made only the minimal changes needed to pass pre-commit checks
Verified all pre-commit checks now pass (Ruff format, Ruff lint, PEP8, Pyright)
Did not change any functionality or make extraneous modifications
Committed and pushed changes to update PR Open agent safety #91

🔧 Changes Made

Type System Fixes:

Added pyright[nodejs] dependency to resolve nodeenv installation issues
Fixed type annotations in npc_tool.py and run_infer.py
Added missing from typing import Any import
Fixed method signature mismatches (e.g., NumpyEncoder.default() parameter name)
Fixed function parameter types (str = None → str | None = None)

Code Corrections:

Removed invalid completed_instances parameter from get_dataset() call
Fixed metrics handling to pass actual Metrics objects instead of dictionaries
Fixed tool factory return type in register.py
Added # type: ignore comments for OpenAI client message compatibility

✅ Verification

All pre-commit checks now pass:

✅ Ruff format
✅ Ruff lint
✅ PEP8 style check
✅ Pyright type checking (strict)

The changes have been committed and pushed to the open-agent-safety branch, updating PR #91. The GitHub Actions pre-commit workflow should now pass successfully.

_{View full conversation}

* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>

xingyaoww · 2025-11-17T18:53:23Z

benchmarks/openagentsafety/custom_tools/npc_tool.py

It seems this tool is not passed to Agent, is this expected?

There were compatibility issues with custom tools which is why this logic was moved to Dockerfile. It is safe to remove this folder. We are also working on fixing another issue. Will do it as part of next PR. Thanks!

* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>

* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>

* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>

MadhaviSG and others added 19 commits November 13, 2025 05:47

initial commit

fe15894

npc-agent communication

bc46410

added npc chat tool

b8c40c0

dockerized implementation

05f5c40

evaluate single task and results

b503a37

evaluate all tasks with dedicated docker container

91df164

output files gpt-4o

436b76d

evaluation framework integration

ad34e64

code changes with common evaluation framework

c58406f

output files

78ea45a

code changes for downloading workspace and utils files

6214051

modified logic for sownloading utils and workspace files

6dac6a7

Delete benchmarks/openagentsafety/run_infer.sh

a3fe55f

added cost metrics and updated pyproject.toml

e9b2670

changes to use jinja for agent prompting

68a8f51

Added openagentsafety readme

73a0200

updated common readme

424d20d

Delete benchmarks/openagentsafety/backup directory

b0110e3

Delete benchmarks/openagentsafety/outputs directory

ddb0dc3

MadhaviSG marked this pull request as draft November 13, 2025 05:59

MadhaviSG marked this pull request as ready for review November 13, 2025 06:07

xingyaoww reviewed Nov 13, 2025

View reviewed changes

corrected pyproject.toml

4fcbac7

neubig requested review from neubig and xingyaoww November 13, 2025 20:56

MadhaviSG added 2 commits November 14, 2025 09:27

revert deletion of uv.lock

a823434

uv.lock

1ded210

neubig approved these changes Nov 14, 2025

View reviewed changes

Merge branch 'main' into open-agent-safety

37eeae6

neubig enabled auto-merge (squash) November 17, 2025 17:39

neubig merged commit 3a85cb6 into OpenHands:main Nov 17, 2025
2 checks passed

xingyaoww reviewed Nov 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open agent safety#91

Open agent safety#91
neubig merged 25 commits intoOpenHands:mainfrom
MadhaviSG:open-agent-safety

MadhaviSG commented Nov 13, 2025 •

edited

Loading

Uh oh!

xingyaoww Nov 13, 2025

Uh oh!

MadhaviSG Nov 13, 2025

Uh oh!

xingyaoww Nov 13, 2025

Uh oh!

MadhaviSG Nov 13, 2025

Uh oh!

neubig left a comment

Uh oh!

juanmichelini commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

Uh oh!

xingyaoww Nov 17, 2025

Uh oh!

MadhaviSG Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

MadhaviSG commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

OpenAgentSafety Benchmark

Usage

Output Files

Setup Requirements

Files Added

Uh oh!

xingyaoww Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

MadhaviSG Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

MadhaviSG Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

juanmichelini commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Summary

✅ Request Completion Checklist

🔧 Changes Made

✅ Verification

Uh oh!

Uh oh!

xingyaoww Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

MadhaviSG Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

MadhaviSG commented Nov 13, 2025 •

edited

Loading