Skip to content

feat: SWE-bench submission format export #979

@christso

Description

@christso

Objective

Add an export adapter that converts agentv results to the SWE-bench submission format, enabling participation in the SWE-bench leaderboard.

Motivation

AgentV's Docker workspace feature (#971) makes it capable of running SWE-bench evaluations. However, to submit results to the official SWE-bench leaderboard, results must be in their specific format (all_preds.jsonl + metadata.yaml + trajs/). A built-in exporter bridges this gap.

Design

Following "Lightweight Core, Plugin Extensibility" — this should be a CLI command or export adapter, not a core feature.

Proposed interface

agentv export --format swe-bench <results-dir> --output <submission-dir>

Output structure

submission/
├── all_preds.jsonl          # {instance_id, model_name_or_path, model_patch}
├── metadata.yaml            # Agent scaffold description
├── README.md                # Auto-generated from config
├── trajs/                   # Agent trajectories
│   └── <instance_id>.json   # From agentv trace data
└── results.json             # Converted from index.jsonl

Mapping (agentv → SWE-bench)

AgentV Field SWE-bench Field
test_id instance_id
unified_diff / output model_patch
target model_name_or_path
trace_summary trajs/<instance_id>.json
score, scores[] results.json

Acceptance Criteria

  • agentv export --format swe-bench produces valid SWE-bench submission directory
  • Generates all_preds.jsonl with correct instance_id → model_patch mapping
  • Generates trajectory files from trace data
  • Auto-generates metadata.yaml from agentv config
  • Can be submitted to SWE-bench leaderboard validation

Non-goals

  • Auto-submitting to leaderboard (just generate the format)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions