Skip to content

(feat)NeMo-Evaluator Integration #412

Open
e-dobrowolska wants to merge 2 commits intoOpenHands:mainfrom
e-dobrowolska:nemo-evaluator
Open

(feat)NeMo-Evaluator Integration #412
e-dobrowolska wants to merge 2 commits intoOpenHands:mainfrom
e-dobrowolska:nemo-evaluator

Conversation

@e-dobrowolska
Copy link

Changes introduced in the PR:

1. Packaging Improvements

  • Modified pyproject.toml to add new CLI entrypoints, nemo-evaluator dependency and to include package files.
  • Modified benchmarks/utils/version.py to gracefully handle SDK SHA extraction with fallback to "unknown" when git submodule information is unavailable. This prevents build failures in packaging scenarios where git metadata isn't present.
  • Moved imports inside functions to avoid errors related to missing git metadata or package structure discoveries when the package is installed via pip without cloning the full repository.
  • Added a fallback for loading files (especially .jp2 images), so they can be found and loaded even when the package is installed with pip and the repo wasn’t cloned.

2. Unified Benchmark Execution

  • Created benchmarks/scripts/run_benchmark.py as a single entrypoint to run both inference and evaluation for any benchmark in the repository.
  • Created benchmarks/scripts/generate_llm_config.py script to create LLM config JSON files from command-line arguments.
  • Made sure all benchmark scripts load LLM configs in the unified way, with an option to set API keys through environment variables.

3. Docker Image Handling

  • Added IMAGE_TAG_PREFIX and EVAL_AGENT_IMAGE environment variable support to enable custom image naming and use of pre-built images from custom registries.
  • All benchmark harnesses now have an option to check for image availability locally, and to use pre-built images.
  • Created benchmarks/utils/image_utils.py with image_exists() to check for pre-built images locally.

4. New Evaluation Parameters

  • Added --conversation-timeout parameter to all benchmark inference scripts to avoid timeout errros on long evaluation runs.
  • Added --skip-failed-samples flag to control whether evaluation continues past individual sample failures or stops immediately on the first error.

5. UV Installation Handling

  • If execution with uv fails due to it not being installed, the system will now gracefully retry using the standard Python interpreter.

6. Report format Standardization

  • Standardized output report format across all benchmarks with consistent JSON structure including benchmark_name, metadata, results, and metrics fields.

7. Per-Benchmark Changes & Bugfixes

  • GAIA Critic Configuration: Changed GAIA to use "pass" critic instead of "finish_with_patch" since it's a question-answering benchmark, not code generation.
  • OpenAgentSafety Pandas NaN Handling: Fixed syntax/logic errors in pd.isna() checks to prevent crashes.
  • Error Raising: Updated error handling to ensure all caught exceptions are now re-raised after logging, preventing silent failures and making issues visible during execution.
  • MultiSWEBench Docker Builds: Fixed a bug in benchmarks.multiswebench.build_images where building images tried to load/prepare the HF dataset and crashed with TypeError: Couldn't cast array of type ... to ... (which then resulted in datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset).
  • MultiSWEBench Dataset Name Bug: Replaced dataset_path.startswith("ByteDance-Seed/Multi-SWE-bench" with "multi-swe-bench" in dataset_path.lower() -- otherwise MultiSWEBench's run_infer.py would fail to detect that the bytedance-research/Multi-SWE-Bench dataset needs to be downloaded.

8. NeMo Evaluator Integration

NeMo Evaluator requires a specific directory structure and configuration files to recognize and run benchmark frameworks.

What changed:

  • Created nemo_evaluator/ directory with the following structure:
    • nemo_evaluator/openhands_benchmarks/__init__.py
    • nemo_evaluator/openhands_benchmarks/framework.yml
    • nemo_evaluator/openhands_benchmarks/output.py

How it works now:

  • The NeMo Evaluator framework can now properly discover and load OpenHands benchmarks as a registered evaluation framework
  • The framework.yml defines the benchmark configuration for NeMo Evaluator
  • The output.py module handles output formatting compatible with NeMo Evaluator's reporting system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments