Objective
Add an agentv import huggingface CLI command to import datasets from HuggingFace Hub directly into agentv eval format.
Motivation
SWE-bench and similar benchmarks publish their datasets on HuggingFace (e.g., SWE-bench/SWE-bench_Verified, princeton-nlp/SWE-bench). Currently users must manually convert these to agentv YAML/JSONL format. A built-in importer would remove this friction and make agentv a first-class tool for benchmark evaluation.
Design
This should be a CLI wrapper (not a core feature), following the "Lightweight Core, Plugin Extensibility" principle.
Proposed interface
agentv import huggingface --repo SWE-bench/SWE-bench_Verified --split test --limit 10 --output evals/swebench/
Mapping (SWE-bench → agentv)
| SWE-bench Field |
AgentV Field |
instance_id |
tests[].id |
problem_statement |
tests[].input[0].content |
repo + base_commit |
workspace.docker.image (convention-based) |
FAIL_TO_PASS |
tests[].assertions[].command (code-grader) |
difficulty |
tests[].metadata.difficulty |
Implementation approach
- Python script using
uv run (per repo convention for Python scripts)
- Uses
datasets library to load from HuggingFace
- Outputs
.EVAL.yaml files with Docker workspace configs
- Template-based: support different dataset schemas via mapping configs
Acceptance Criteria
Non-goals
- Not adding HuggingFace as a core dependency
- Not supporting all possible HuggingFace dataset formats (start with SWE-bench)
Objective
Add an
agentv import huggingfaceCLI command to import datasets from HuggingFace Hub directly into agentv eval format.Motivation
SWE-bench and similar benchmarks publish their datasets on HuggingFace (e.g.,
SWE-bench/SWE-bench_Verified,princeton-nlp/SWE-bench). Currently users must manually convert these to agentv YAML/JSONL format. A built-in importer would remove this friction and make agentv a first-class tool for benchmark evaluation.Design
This should be a CLI wrapper (not a core feature), following the "Lightweight Core, Plugin Extensibility" principle.
Proposed interface
agentv import huggingface --repo SWE-bench/SWE-bench_Verified --split test --limit 10 --output evals/swebench/Mapping (SWE-bench → agentv)
instance_idtests[].idproblem_statementtests[].input[0].contentrepo+base_commitworkspace.docker.image(convention-based)FAIL_TO_PASStests[].assertions[].command(code-grader)difficultytests[].metadata.difficultyImplementation approach
uv run(per repo convention for Python scripts)datasetslibrary to load from HuggingFace.EVAL.yamlfiles with Docker workspace configsAcceptance Criteria
agentv import huggingface --repo <name>produces valid eval YAML files--limit,--split,--outputflagsagentv validateNon-goals