feat: HuggingFace dataset import command

## Objective

Add an `agentv import huggingface` CLI command to import datasets from HuggingFace Hub directly into agentv eval format.

## Motivation

SWE-bench and similar benchmarks publish their datasets on HuggingFace (e.g., `SWE-bench/SWE-bench_Verified`, `princeton-nlp/SWE-bench`). Currently users must manually convert these to agentv YAML/JSONL format. A built-in importer would remove this friction and make agentv a first-class tool for benchmark evaluation.

## Design

This should be a **CLI wrapper** (not a core feature), following the "Lightweight Core, Plugin Extensibility" principle.

### Proposed interface
```bash
agentv import huggingface --repo SWE-bench/SWE-bench_Verified --split test --limit 10 --output evals/swebench/
```

### Mapping (SWE-bench → agentv)
| SWE-bench Field | AgentV Field |
|-----------------|-------------|
| `instance_id` | `tests[].id` |
| `problem_statement` | `tests[].input[0].content` |
| `repo` + `base_commit` | `workspace.docker.image` (convention-based) |
| `FAIL_TO_PASS` | `tests[].assertions[].command` (code-grader) |
| `difficulty` | `tests[].metadata.difficulty` |

### Implementation approach
- Python script using `uv run` (per repo convention for Python scripts)
- Uses `datasets` library to load from HuggingFace
- Outputs `.EVAL.yaml` files with Docker workspace configs
- Template-based: support different dataset schemas via mapping configs

## Acceptance Criteria

- [ ] `agentv import huggingface --repo <name>` produces valid eval YAML files
- [ ] Works with SWE-bench Verified as the primary test case
- [ ] Supports `--limit`, `--split`, `--output` flags
- [ ] Generated evals pass `agentv validate`
- [ ] Documentation updated

## Non-goals
- Not adding HuggingFace as a core dependency
- Not supporting all possible HuggingFace dataset formats (start with SWE-bench)

SWE-bench Field	AgentV Field
`instance_id`	`tests[].id`
`problem_statement`	`tests[].input[0].content`
`repo` + `base_commit`	`workspace.docker.image` (convention-based)
`FAIL_TO_PASS`	`tests[].assertions[].command` (code-grader)
`difficulty`	`tests[].metadata.difficulty`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HuggingFace dataset import command #978

Objective

Motivation

Design

Proposed interface

Mapping (SWE-bench → agentv)

Implementation approach

Acceptance Criteria

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: HuggingFace dataset import command #978

Description

Objective

Motivation

Design

Proposed interface

Mapping (SWE-bench → agentv)

Implementation approach

Acceptance Criteria

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions