(feat)NeMo-Evaluator Integration by e-dobrowolska · Pull Request #412 · OpenHands/benchmarks

e-dobrowolska · 2026-02-12T21:06:33Z

Changes introduced in the PR:

1. Packaging Improvements

Modified pyproject.toml to add new CLI entrypoints, nemo-evaluator dependency and to include package files.
Modified benchmarks/utils/version.py to gracefully handle SDK SHA extraction with fallback to "unknown" when git submodule information is unavailable. This prevents build failures in packaging scenarios where git metadata isn't present.
Moved imports inside functions to avoid errors related to missing git metadata or package structure discoveries when the package is installed via pip without cloning the full repository.
Added a fallback for loading files (especially .jp2 images), so they can be found and loaded even when the package is installed with pip and the repo wasn’t cloned.

2. Unified Benchmark Execution

Created benchmarks/scripts/run_benchmark.py as a single entrypoint to run both inference and evaluation for any benchmark in the repository.
Created benchmarks/scripts/generate_llm_config.py script to create LLM config JSON files from command-line arguments.
Made sure all benchmark scripts load LLM configs in the unified way, with an option to set API keys through environment variables.

3. Docker Image Handling

Added IMAGE_TAG_PREFIX and EVAL_AGENT_IMAGE environment variable support to enable custom image naming and use of pre-built images from custom registries.
All benchmark harnesses now have an option to check for image availability locally, and to use pre-built images.
Created benchmarks/utils/image_utils.py with image_exists() to check for pre-built images locally.

4. New Evaluation Parameters

Added --conversation-timeout parameter to all benchmark inference scripts to avoid timeout errros on long evaluation runs.
Added --skip-failed-samples flag to control whether evaluation continues past individual sample failures or stops immediately on the first error.

5. UV Installation Handling

If execution with uv fails due to it not being installed, the system will now gracefully retry using the standard Python interpreter.

6. Report format Standardization

Standardized output report format across all benchmarks with consistent JSON structure including benchmark_name, metadata, results, and metrics fields.

7. Per-Benchmark Changes & Bugfixes

GAIA Critic Configuration: Changed GAIA to use "pass" critic instead of "finish_with_patch" since it's a question-answering benchmark, not code generation.
OpenAgentSafety Pandas NaN Handling: Fixed syntax/logic errors in pd.isna() checks to prevent crashes.
Error Raising: Updated error handling to ensure all caught exceptions are now re-raised after logging, preventing silent failures and making issues visible during execution.
MultiSWEBench Docker Builds: Fixed a bug in benchmarks.multiswebench.build_images where building images tried to load/prepare the HF dataset and crashed with TypeError: Couldn't cast array of type ... to ... (which then resulted in datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset).
MultiSWEBench Dataset Name Bug: Replaced dataset_path.startswith("ByteDance-Seed/Multi-SWE-bench" with "multi-swe-bench" in dataset_path.lower() -- otherwise MultiSWEBench's run_infer.py would fail to detect that the bytedance-research/Multi-SWE-Bench dataset needs to be downloaded.

8. NeMo Evaluator Integration

NeMo Evaluator requires a specific directory structure and configuration files to recognize and run benchmark frameworks.

What changed:

Created nemo_evaluator/ directory with the following structure:
- nemo_evaluator/openhands_benchmarks/__init__.py
- nemo_evaluator/openhands_benchmarks/framework.yml
- nemo_evaluator/openhands_benchmarks/output.py

How it works now:

The NeMo Evaluator framework can now properly discover and load OpenHands benchmarks as a registered evaluation framework
The framework.yml defines the benchmark configuration for NeMo Evaluator
The output.py module handles output formatting compatible with NeMo Evaluator's reporting system

e-dobrowolska added 2 commits February 12, 2026 20:43

nemo-evaluator implementation

5249749

remove redundant VERSION file

c5f4051

neubig requested a review from simonrosenberg February 15, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat)NeMo-Evaluator Integration #412

(feat)NeMo-Evaluator Integration #412
e-dobrowolska wants to merge 2 commits intoOpenHands:mainfrom
e-dobrowolska:nemo-evaluator

e-dobrowolska commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

e-dobrowolska commented Feb 12, 2026

1. Packaging Improvements

2. Unified Benchmark Execution

3. Docker Image Handling

4. New Evaluation Parameters

5. UV Installation Handling

6. Report format Standardization

7. Per-Benchmark Changes & Bugfixes

8. NeMo Evaluator Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments