feat: Docker workspace execution environments for coding benchmarks

## Objective

Add Docker-based execution environments as a workspace type in AgentV, enabling reproducible evaluation of coding agents against real-world repositories (SWE-bench, Aider benchmarks, etc.).

## Context

From [SWE-bench competitive analysis](https://github.com/agentevals/agentevals-research/pull/55): Docker isolation is the industry standard for coding agent benchmarks. SWE-bench uses a 3-layer Docker image architecture (base → env → instance) to provide reproducible, isolated evaluation environments for 72 repositories across 8 languages. Without Docker workspace support, AgentV cannot credibly host or run coding agent benchmarks.

AgentV's existing `workspace_template` with git repos provides lightweight isolation, but it cannot:
- Install language-specific dependencies (conda envs, pip, npm, cargo, etc.)
- Guarantee reproducible builds across machines
- Run untrusted agent-generated code safely
- Leverage pre-built image registries for fast evaluation startup

## Design decisions (settled)

1. **Docker workspaces are a new workspace type** alongside existing git-based workspace templates, not a replacement. Follow "Lightweight Core, Plugin Extensibility".

2. **The agent runs on the host, the grader runs inside the container.** Flow:
   - AgentV sends the prompt to the agent target (e.g., Claude, Codex) on the host
   - The agent produces output (a patch/diff or file changes)
   - AgentV creates a Docker container from the specified image
   - AgentV copies the agent's output (patch file) into the container
   - The code-grader script runs **inside the container** (has access to the repo, deps, test suite)
   - The grader returns JSON result to stdout, AgentV collects it
   - Container is destroyed

3. **Implementation location**: New Docker workspace provider in `packages/core/src/evaluation/` — follows the same pattern as the existing workspace pool/template system. The provider should be lazy-loaded (don't require Docker as a dependency for non-Docker evals).

4. **YAML schema**: New `workspace.docker` field:
   ```yaml
   workspace:
     docker:
       image: swebench/sweb.eval.x86_64.django__django-15180
       timeout: 1800        # seconds, default 1800
       memory: 4g           # optional memory limit
       cpus: 2              # optional CPU limit
   ```

### Execution flow (detailed)

```
1. agentv eval starts
2. For each test:
   a. Send prompt to agent target → receive agent output (patch/files)
   b. docker pull <image> (if not cached)
   c. docker create --memory=4g --cpus=2 <image>
   d. docker cp <agent_patch> <container>:/tmp/patch.diff
   e. docker exec <container> /bin/bash -c "<grader_command>"
      - grader reads patch, applies it, runs tests, returns JSON to stdout
   f. Parse grader JSON output (score, assertions)
   g. docker rm -f <container>
3. Aggregate results, write JSONL
```

### Example EVAL.yaml

```yaml
workspace:
  docker:
    image: swebench/sweb.eval.x86_64.django__django-15180

tests:
  - id: django__django-15180
    input:
      - role: user
        content: |
          <problem_statement>
    assertions:
      - type: code-grader
        value: ./graders/swe-bench-grader.ts
        config:
          test_command: "cd /testbed && python -m pytest tests/template_tests/ -x"
          fail_to_pass: ["test_cached_loader_invalidation"]
          pass_to_pass: ["test_cached_loader_basic"]
```

The `swe-bench-grader.ts` uses `@agentv/eval`'s `defineCodeGrader`, receives the agent's patch via `config`, applies it inside the container with `git apply`, runs the test command, parses pytest output, and checks FAIL_TO_PASS / PASS_TO_PASS transitions.

## Minimum viable scope

- `workspace.docker.image` field in EVAL.yaml schema (Zod validation)
- Docker workspace provider in `packages/core/` that handles pull → create → cp → exec → rm
- Timeout: `docker exec` with timeout, `docker rm -f` on expiry
- Resource limits: memory, cpus passed to `docker create`
- Works with existing `code-grader` evaluator — grader runs inside container
- Container cleanup on success, failure, and timeout

## Stretch goals (not required for MVP)

- Layered image caching (base → env → instance) like SWE-bench
- `agentv workspace prepare` command to pre-pull/build Docker images
- Dockerfile-based workspace definitions (build from Dockerfile, not just pull)
- Docker Compose support for multi-container setups

## Acceptance signals

- [ ] A coding eval using a Docker workspace runs end-to-end with a real agent target
- [ ] Container is destroyed after evaluation (no state leakage)
- [ ] Timeout kills the container if evaluation exceeds limit
- [ ] Existing git-based workspace templates continue working unchanged
- [ ] At least one working example: SWE-bench instance evaluated through AgentV
- [ ] `bun run test` passes with new Docker workspace tests (unit tests can mock Docker)

## Non-goals

- Building a Docker image registry or hosting infrastructure
- Replacing SWE-bench's harness — AgentV should consume their images, not rebuild them
- Docker-in-Docker support
- Cloud execution backends (separate concern, tracked separately)
- Running the agent inside the container — agents run on the host

## Related

- [SWE-bench competitive analysis](https://github.com/agentevals/agentevals-research/pull/55)
- [Workspace patterns research](https://github.com/agentevals/agentevals-research/blob/main/research/findings/workspace-patterns-swe-bench.md)
- #563 (Studio hardening — will benefit from richer workspace metadata)
- #966 (Public leaderboard — depends on this for running SWE-bench)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Docker workspace execution environments for coding benchmarks #965

Objective

Context

Design decisions (settled)

Execution flow (detailed)

Example EVAL.yaml

Minimum viable scope

Stretch goals (not required for MVP)

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Docker workspace execution environments for coding benchmarks #965

Description

Objective

Context

Design decisions (settled)

Execution flow (detailed)

Example EVAL.yaml

Minimum viable scope

Stretch goals (not required for MVP)

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions