Objective
Add Docker-based execution environments as a workspace type in AgentV, enabling reproducible evaluation of coding agents against real-world repositories (SWE-bench, Aider benchmarks, etc.).
Context
From SWE-bench competitive analysis: Docker isolation is the industry standard for coding agent benchmarks. SWE-bench uses a 3-layer Docker image architecture (base → env → instance) to provide reproducible, isolated evaluation environments for 72 repositories across 8 languages. Without Docker workspace support, AgentV cannot credibly host or run coding agent benchmarks.
AgentV's existing workspace_template with git repos provides lightweight isolation, but it cannot:
- Install language-specific dependencies (conda envs, pip, npm, cargo, etc.)
- Guarantee reproducible builds across machines
- Run untrusted agent-generated code safely
- Leverage pre-built image registries for fast evaluation startup
Design decisions (settled)
-
Docker workspaces are a new workspace type alongside existing git-based workspace templates, not a replacement. Follow "Lightweight Core, Plugin Extensibility".
-
The agent runs on the host, the grader runs inside the container. Flow:
- AgentV sends the prompt to the agent target (e.g., Claude, Codex) on the host
- The agent produces output (a patch/diff or file changes)
- AgentV creates a Docker container from the specified image
- AgentV copies the agent's output (patch file) into the container
- The code-grader script runs inside the container (has access to the repo, deps, test suite)
- The grader returns JSON result to stdout, AgentV collects it
- Container is destroyed
-
Implementation location: New Docker workspace provider in packages/core/src/evaluation/ — follows the same pattern as the existing workspace pool/template system. The provider should be lazy-loaded (don't require Docker as a dependency for non-Docker evals).
-
YAML schema: New workspace.docker field:
workspace:
docker:
image: swebench/sweb.eval.x86_64.django__django-15180
timeout: 1800 # seconds, default 1800
memory: 4g # optional memory limit
cpus: 2 # optional CPU limit
Execution flow (detailed)
1. agentv eval starts
2. For each test:
a. Send prompt to agent target → receive agent output (patch/files)
b. docker pull <image> (if not cached)
c. docker create --memory=4g --cpus=2 <image>
d. docker cp <agent_patch> <container>:/tmp/patch.diff
e. docker exec <container> /bin/bash -c "<grader_command>"
- grader reads patch, applies it, runs tests, returns JSON to stdout
f. Parse grader JSON output (score, assertions)
g. docker rm -f <container>
3. Aggregate results, write JSONL
Example EVAL.yaml
workspace:
docker:
image: swebench/sweb.eval.x86_64.django__django-15180
tests:
- id: django__django-15180
input:
- role: user
content: |
<problem_statement>
assertions:
- type: code-grader
value: ./graders/swe-bench-grader.ts
config:
test_command: "cd /testbed && python -m pytest tests/template_tests/ -x"
fail_to_pass: ["test_cached_loader_invalidation"]
pass_to_pass: ["test_cached_loader_basic"]
The swe-bench-grader.ts uses @agentv/eval's defineCodeGrader, receives the agent's patch via config, applies it inside the container with git apply, runs the test command, parses pytest output, and checks FAIL_TO_PASS / PASS_TO_PASS transitions.
Minimum viable scope
workspace.docker.image field in EVAL.yaml schema (Zod validation)
- Docker workspace provider in
packages/core/ that handles pull → create → cp → exec → rm
- Timeout:
docker exec with timeout, docker rm -f on expiry
- Resource limits: memory, cpus passed to
docker create
- Works with existing
code-grader evaluator — grader runs inside container
- Container cleanup on success, failure, and timeout
Stretch goals (not required for MVP)
- Layered image caching (base → env → instance) like SWE-bench
agentv workspace prepare command to pre-pull/build Docker images
- Dockerfile-based workspace definitions (build from Dockerfile, not just pull)
- Docker Compose support for multi-container setups
Acceptance signals
Non-goals
- Building a Docker image registry or hosting infrastructure
- Replacing SWE-bench's harness — AgentV should consume their images, not rebuild them
- Docker-in-Docker support
- Cloud execution backends (separate concern, tracked separately)
- Running the agent inside the container — agents run on the host
Related
Objective
Add Docker-based execution environments as a workspace type in AgentV, enabling reproducible evaluation of coding agents against real-world repositories (SWE-bench, Aider benchmarks, etc.).
Context
From SWE-bench competitive analysis: Docker isolation is the industry standard for coding agent benchmarks. SWE-bench uses a 3-layer Docker image architecture (base → env → instance) to provide reproducible, isolated evaluation environments for 72 repositories across 8 languages. Without Docker workspace support, AgentV cannot credibly host or run coding agent benchmarks.
AgentV's existing
workspace_templatewith git repos provides lightweight isolation, but it cannot:Design decisions (settled)
Docker workspaces are a new workspace type alongside existing git-based workspace templates, not a replacement. Follow "Lightweight Core, Plugin Extensibility".
The agent runs on the host, the grader runs inside the container. Flow:
Implementation location: New Docker workspace provider in
packages/core/src/evaluation/— follows the same pattern as the existing workspace pool/template system. The provider should be lazy-loaded (don't require Docker as a dependency for non-Docker evals).YAML schema: New
workspace.dockerfield:Execution flow (detailed)
Example EVAL.yaml
The
swe-bench-grader.tsuses@agentv/eval'sdefineCodeGrader, receives the agent's patch viaconfig, applies it inside the container withgit apply, runs the test command, parses pytest output, and checks FAIL_TO_PASS / PASS_TO_PASS transitions.Minimum viable scope
workspace.docker.imagefield in EVAL.yaml schema (Zod validation)packages/core/that handles pull → create → cp → exec → rmdocker execwith timeout,docker rm -fon expirydocker createcode-graderevaluator — grader runs inside containerStretch goals (not required for MVP)
agentv workspace preparecommand to pre-pull/build Docker imagesAcceptance signals
bun run testpasses with new Docker workspace tests (unit tests can mock Docker)Non-goals
Related