(feat)NeMo-Evaluator Integration #412
Open
e-dobrowolska wants to merge 2 commits intoOpenHands:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes introduced in the PR:
1. Packaging Improvements
pyproject.tomlto add new CLI entrypoints, nemo-evaluator dependency and to include package files.benchmarks/utils/version.pyto gracefully handle SDK SHA extraction with fallback to"unknown"when git submodule information is unavailable. This prevents build failures in packaging scenarios where git metadata isn't present..jp2images), so they can be found and loaded even when the package is installed with pip and the repo wasn’t cloned.2. Unified Benchmark Execution
benchmarks/scripts/run_benchmark.pyas a single entrypoint to run both inference and evaluation for any benchmark in the repository.benchmarks/scripts/generate_llm_config.pyscript to create LLM config JSON files from command-line arguments.3. Docker Image Handling
IMAGE_TAG_PREFIXandEVAL_AGENT_IMAGEenvironment variable support to enable custom image naming and use of pre-built images from custom registries.benchmarks/utils/image_utils.pywithimage_exists()to check for pre-built images locally.4. New Evaluation Parameters
--conversation-timeoutparameter to all benchmark inference scripts to avoid timeout errros on long evaluation runs.--skip-failed-samplesflag to control whether evaluation continues past individual sample failures or stops immediately on the first error.5. UV Installation Handling
uvfails due to it not being installed, the system will now gracefully retry using the standard Python interpreter.6. Report format Standardization
benchmark_name,metadata,results, andmetricsfields.7. Per-Benchmark Changes & Bugfixes
"pass"critic instead of"finish_with_patch"since it's a question-answering benchmark, not code generation.pd.isna()checks to prevent crashes.benchmarks.multiswebench.build_imageswhere building images tried to load/prepare the HF dataset and crashed withTypeError: Couldn't cast array of type ... to ...(which then resulted indatasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset).dataset_path.startswith("ByteDance-Seed/Multi-SWE-bench"with"multi-swe-bench" in dataset_path.lower()-- otherwise MultiSWEBench'srun_infer.pywould fail to detect that thebytedance-research/Multi-SWE-Benchdataset needs to be downloaded.8. NeMo Evaluator Integration
NeMo Evaluator requires a specific directory structure and configuration files to recognize and run benchmark frameworks.
What changed:
nemo_evaluator/directory with the following structure:nemo_evaluator/openhands_benchmarks/__init__.pynemo_evaluator/openhands_benchmarks/framework.ymlnemo_evaluator/openhands_benchmarks/output.pyHow it works now:
framework.ymldefines the benchmark configuration for NeMo Evaluatoroutput.pymodule handles output formatting compatible with NeMo Evaluator's reporting system