Skip to content

feat: Docker workspace execution environments for coding benchmarks #965

@christso

Description

@christso

Objective

Add Docker-based execution environments as a workspace type in AgentV, enabling reproducible evaluation of coding agents against real-world repositories (SWE-bench, Aider benchmarks, etc.).

Context

From SWE-bench competitive analysis: Docker isolation is the industry standard for coding agent benchmarks. SWE-bench uses a 3-layer Docker image architecture (base → env → instance) to provide reproducible, isolated evaluation environments for 72 repositories across 8 languages. Without Docker workspace support, AgentV cannot credibly host or run coding agent benchmarks.

AgentV's existing workspace_template with git repos provides lightweight isolation, but it cannot:

  • Install language-specific dependencies (conda envs, pip, npm, cargo, etc.)
  • Guarantee reproducible builds across machines
  • Run untrusted agent-generated code safely
  • Leverage pre-built image registries for fast evaluation startup

Design decisions (settled)

  1. Docker workspaces are a new workspace type alongside existing git-based workspace templates, not a replacement. Follow "Lightweight Core, Plugin Extensibility".

  2. The agent runs on the host, the grader runs inside the container. Flow:

    • AgentV sends the prompt to the agent target (e.g., Claude, Codex) on the host
    • The agent produces output (a patch/diff or file changes)
    • AgentV creates a Docker container from the specified image
    • AgentV copies the agent's output (patch file) into the container
    • The code-grader script runs inside the container (has access to the repo, deps, test suite)
    • The grader returns JSON result to stdout, AgentV collects it
    • Container is destroyed
  3. Implementation location: New Docker workspace provider in packages/core/src/evaluation/ — follows the same pattern as the existing workspace pool/template system. The provider should be lazy-loaded (don't require Docker as a dependency for non-Docker evals).

  4. YAML schema: New workspace.docker field:

    workspace:
      docker:
        image: swebench/sweb.eval.x86_64.django__django-15180
        timeout: 1800        # seconds, default 1800
        memory: 4g           # optional memory limit
        cpus: 2              # optional CPU limit

Execution flow (detailed)

1. agentv eval starts
2. For each test:
   a. Send prompt to agent target → receive agent output (patch/files)
   b. docker pull <image> (if not cached)
   c. docker create --memory=4g --cpus=2 <image>
   d. docker cp <agent_patch> <container>:/tmp/patch.diff
   e. docker exec <container> /bin/bash -c "<grader_command>"
      - grader reads patch, applies it, runs tests, returns JSON to stdout
   f. Parse grader JSON output (score, assertions)
   g. docker rm -f <container>
3. Aggregate results, write JSONL

Example EVAL.yaml

workspace:
  docker:
    image: swebench/sweb.eval.x86_64.django__django-15180

tests:
  - id: django__django-15180
    input:
      - role: user
        content: |
          <problem_statement>
    assertions:
      - type: code-grader
        value: ./graders/swe-bench-grader.ts
        config:
          test_command: "cd /testbed && python -m pytest tests/template_tests/ -x"
          fail_to_pass: ["test_cached_loader_invalidation"]
          pass_to_pass: ["test_cached_loader_basic"]

The swe-bench-grader.ts uses @agentv/eval's defineCodeGrader, receives the agent's patch via config, applies it inside the container with git apply, runs the test command, parses pytest output, and checks FAIL_TO_PASS / PASS_TO_PASS transitions.

Minimum viable scope

  • workspace.docker.image field in EVAL.yaml schema (Zod validation)
  • Docker workspace provider in packages/core/ that handles pull → create → cp → exec → rm
  • Timeout: docker exec with timeout, docker rm -f on expiry
  • Resource limits: memory, cpus passed to docker create
  • Works with existing code-grader evaluator — grader runs inside container
  • Container cleanup on success, failure, and timeout

Stretch goals (not required for MVP)

  • Layered image caching (base → env → instance) like SWE-bench
  • agentv workspace prepare command to pre-pull/build Docker images
  • Dockerfile-based workspace definitions (build from Dockerfile, not just pull)
  • Docker Compose support for multi-container setups

Acceptance signals

  • A coding eval using a Docker workspace runs end-to-end with a real agent target
  • Container is destroyed after evaluation (no state leakage)
  • Timeout kills the container if evaluation exceeds limit
  • Existing git-based workspace templates continue working unchanged
  • At least one working example: SWE-bench instance evaluated through AgentV
  • bun run test passes with new Docker workspace tests (unit tests can mock Docker)

Non-goals

  • Building a Docker image registry or hosting infrastructure
  • Replacing SWE-bench's harness — AgentV should consume their images, not rebuild them
  • Docker-in-Docker support
  • Cloud execution backends (separate concern, tracked separately)
  • Running the agent inside the container — agents run on the host

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions