Conversation
- Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev>
| "toml", | ||
| "tqdm", | ||
| "openhands-sdk", | ||
| "openhands-tools", |
There was a problem hiding this comment.
no, I think this file got messed up during resolving conflicts. Let me take a look
pyproject.toml
Outdated
| ] | ||
| ======= | ||
| lines-after-imports = 2 | ||
| >>>>>>> 0ee2f25 (code changes with common evaluation framework) |
There was a problem hiding this comment.
seems kike merge conflict here
There was a problem hiding this comment.
thanks for catching, will fix it
|
@OpenHands can you look at the Pre-commit checks and do the minimal changes needed to pass the pre-commit checks? do not change anything else. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
- Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI have successfully fixed all pre-commit check failures for PR #91 (OpenAgentSafety benchmark) with minimal changes: ✅ Request Completion Checklist
🔧 Changes MadeType System Fixes:
Code Corrections:
✅ VerificationAll pre-commit checks now pass:
The changes have been committed and pushed to the |
* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>
There was a problem hiding this comment.
It seems this tool is not passed to Agent, is this expected?
There was a problem hiding this comment.
There were compatibility issues with custom tools which is why this logic was moved to Dockerfile. It is safe to remove this folder. We are also working on fixing another issue. Will do it as part of next PR. Thanks!
* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>
* initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy>
* Remove CriticAdapter and use SDK critics directly - Remove CriticAdapter class entirely from benchmarks - Update critics.py to re-export SDK critic components directly - Add helper functions for common operations: - convert_history_to_events(): Convert history to LLMConvertibleEvent - extract_git_patch(): Extract patch from EvalOutput - evaluate_output(): Evaluate EvalOutput with critic - Update iterative.py to use SDK critics directly - Update SDK submodule with organized critic implementations Benefits: - Simpler architecture: no adapter layer - Direct access to SDK critic methods - Better type safety with direct SDK types - Easier to maintain and extend Co-authored-by: openhands <openhands@all-hands.dev> * remove util * bump sha * rename BaseCritic to CriticBase * bump * bump sdk * bump * add critic impl for eval * Fix test_metrics.py to use proper MessageEvent instead of MagicMock - Replace MagicMock event objects with proper MessageEvent instances in GAIA tests - Remove deprecated critic_name parameter from EvalMetadata (replaced by critic field) - Add necessary imports for MessageEvent, Message, and TextContent This fixes the validation error where EvalOutput expected proper Event objects but received MagicMock objects, which don't match the expected event types. Co-authored-by: openhands <openhands@all-hands.dev> * bump commit * Open agent safety (#91) * initial commit * npc-agent communication * added npc chat tool * dockerized implementation * evaluate single task and results * evaluate all tasks with dedicated docker container * output files gpt-4o * evaluation framework integration * code changes with common evaluation framework * output files * code changes for downloading workspace and utils files * modified logic for sownloading utils and workspace files * Delete benchmarks/openagentsafety/run_infer.sh * added cost metrics and updated pyproject.toml * changes to use jinja for agent prompting * Added openagentsafety readme * updated common readme * Delete benchmarks/openagentsafety/backup directory * Delete benchmarks/openagentsafety/outputs directory * Fix README to preserve GAIA benchmark and add OpenAgentSafety - Merged README structure from main branch that includes GAIA benchmark - Added OpenAgentSafety to the benchmark table without overwriting existing content - Maintained clean structure with proper links to individual benchmark directories - Fixed submodule path references to match main branch structure Co-authored-by: openhands <openhands@all-hands.dev> * corrected pyproject.toml * revert deletion of uv.lock * uv.lock * Fix pre-commit check failures - Add pyright[nodejs] dependency to fix nodeenv issues - Fix type annotations in npc_tool.py and run_infer.py - Fix NPCTool to inherit from ToolDefinition with proper create() method - Fix ToolExecutor __call__ method signature to match base class - Fix NumpyEncoder.default() method parameter name to match JSONEncoder - Fix generate_instruction function signature with proper union type - Remove invalid completed_instances parameter from get_dataset call - Fix Metrics type mismatches by passing actual Metrics objects - Add type: ignore comments for OpenAI client message compatibility Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: juanmichelini <juan@juan.com.uy> * Fix openagentsafety to use critic field instead of critic_name After merging main branch, update openagentsafety/run_infer.py to use the new critic API: - Import create_critic from benchmarks.utils.critic_factory - Create critic instance using create_critic(args) - Pass critic=critic instead of critic_name=args.critic to EvalMetadata This aligns with the changes made in PR #95 to use SDK critics directly. Co-authored-by: openhands <openhands@all-hands.dev> * simplify * simplify * make critic required (since we have a reasonable default) * fix imports * simplify * Fix tests: add required critic parameter to EvalMetadata The critic field is now required in EvalMetadata. Updated test fixtures to provide PassCritic instances for all test cases. Changes: - Added PassCritic import to test_workspace_cleanup.py and test_metrics.py - Added critic=PassCritic() to all EvalMetadata instances in tests - All tests pass and pre-commit checks are clean Co-authored-by: openhands <openhands@all-hands.dev> * Fix imports: change critic_factory to critics All benchmark modules were importing from the non-existent critic_factory module. Updated to import from critics module instead, which contains the create_critic function. Changes: - benchmarks/swe_bench/run_infer.py: critic_factory -> critics - benchmarks/openagentsafety/run_infer.py: critic_factory -> critics - benchmarks/gaia/run_infer.py: critic_factory -> critics This fixes the import errors that prevented benchmarks from being discovered and tested. Co-authored-by: openhands <openhands@all-hands.dev> * Fix keyboard interrupt tests: add required critic parameter The EVALUATION_SCRIPT in test_keyboard_interrupt.py was failing to create an EvalMetadata instance because the critic parameter is now required. Added PassCritic import and critic parameter to fix the tests. Changes: - Added import for PassCritic from openhands.sdk.critic - Added critic=PassCritic() to EvalMetadata initialization This fixes the keyboard interrupt tests that were failing because the subprocess couldn't start properly. Co-authored-by: openhands <openhands@all-hands.dev> * pre-import tools for serialization/deserialization * include short sha for eval * improve logging --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Madhavi Gulavani <77915603+MadhaviSG@users.noreply.github.com> Co-authored-by: juanmichelini <juan@juan.com.uy>
Overview
Adds OpenAgentSafety benchmark for evaluating AI agent safety in workplace scenarios with NPC (Non-Player Character) interactions.
Key Changes
OpenAgentSafety Benchmark
chat_with_npc) for realistic workplace conversationsUsage
Output Files
Evaluation generates output files for each task:
eval_*.json- Final evaluation results and scorestraj_*.json- Complete agent trajectory and actions takenstate_*.json- Final environment state after task completionerror_*.json- Error logs if task failsPlus summary files:
output.jsonl- Consolidated results across all tasksoutput_critic.jsonl- Critic evaluation results (if enabled)Setup Requirements
.llm_config/directoryFiles Added
benchmarks/openagentsafety/- Complete benchmark implementationbenchmarks/openagentsafety/custom_tools/npc_tool.py- NPC communicationbenchmarks/openagentsafety/prompts/default.j2- Jinja2 templates