OpenHands · mkaramuk · Feb 18, 2026 · Feb 18, 2026 · Feb 18, 2026
diff --git a/benchmarks/swesmith/.env.example b/benchmarks/swesmith/.env.example
@@ -0,0 +1,6 @@
+# SSH key for private repo access (optional, only needed for non-standard key paths)
+# If not set, default keys in ~/.ssh/ are used automatically.
+# GITHUB_USER_SSH_KEY=/home/user/.ssh/id_ed25519_github
+
+# GitHub token (optional, increases GitHub API rate limit from 60 to 5000 req/hour)
+# GITHUB_TOKEN=
diff --git a/benchmarks/swesmith/README.md b/benchmarks/swesmith/README.md
@@ -1,19 +1,23 @@
-# SWE-Smith Benchmark - building Docker images
+# SWE-Smith Benchmark Evaluation
 
-This directory contains implementation for building custom agent server Docker images for SWE-Smith. The primary purpose is to use GitHub workflows for building these images fast and using them to train LLMs as SWE agents.
+This directory contains the implementation for running SWE-Smith evaluation using OpenHands agents.
+
+## Overview
+
+SWE-Smith is a benchmark for training and evaluating AI agents on synthetically generated software engineering tasks. Task instances are created by injecting bugs into real repositories and validating them against test suites.
 
 ## Dataset
 
-- **Source**: [Paper](https://arxiv.org/abs/2504.21798)
-- **Dataset**: 
-  - `SWE-bench/SWE-smith-py` - Full dataset
+- **Source**: [SWE-Smith Paper](https://arxiv.org/abs/2504.21798)
+- **Dataset**: `SWE-bench/SWE-smith-py`
 - **Splits**: `train`
+- Local task instance files (`.json` / `.jsonl`) generated via SWE-Smith are also supported.
 
 ## Usage
 
-### Build Docker Images
+### Step 1: Build Docker Images
 
-You need to build Docker images for the SWE-Smith instances. Each instance requires a specific environment setup based on the repository and issue. **Note that this will consume atleast 150-200GB of disk space. Considering setting `--n-limit` to a smaller value if required.**
+Before running inference, you need to build Docker images for the SWE-Smith instances. Each instance requires a specific environment setup. Disk usage depends on the number and size of task instances — the full dataset can consume 150-200GB, but smaller local instance files will use proportionally less.
 
 ```bash
 uv run python -m benchmarks.swesmith.build_images \
@@ -23,7 +27,177 @@ uv run python -m benchmarks.swesmith.build_images \
   --target source-minimal
 ```
 
-### Running rollouts
+For local task instance files:
+
+```bash
+uv run python -m benchmarks.swesmith.build_images \
+  --dataset /path/to/task_instances.json \
+  --split train \
+  --image ghcr.io/openhands/eval-agent-server \
+  --target source-minimal \
+  --n-limit 10
+```
+
+### Step 2: Run Inference
+
+```bash
+uv run swesmith-infer path/to/llm_config.json \
+  --dataset /path/to/task_instances.json \
+  --workspace docker \
+  --max-iterations 75 \
+  --num-workers 4
+```
+
+**Selecting specific instances:**
+
+```bash
+# Create instances.txt with one instance ID per line
+echo "encode__httpx.ae1b9f66.lm_modify__abc123" > instances.txt
+
+uv run swesmith-infer path/to/llm_config.json \
+  --dataset /path/to/task_instances.json \
+  --select instances.txt \
+  --workspace docker
+```
+
+### Configuration Options
+
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--dataset` | HuggingFace dataset name or local file path | `SWE-bench/SWE-smith-py` |
+| `--split` | Dataset split | `train` |
+| `--workspace` | Workspace type | `docker` |
+| `--num-workers` | Parallel workers | `4` |
+| `--max-iterations` | Max agent turns per instance | `500` |
+| `--n-limit` | Limit number of instances | all |
+| `--select` | Text file with instance IDs (one per line) | - |
+| `--max-attempts` | Retry attempts with critic | `3` |
+| `--critic` | `pass` / `finish_with_patch` / `empty_patch_critic` | `finish_with_patch` |
+| `--prompt-path` | Jinja2 prompt template | `prompts/default.j2` |
+| `--note` | Note appended to output directory name | - |
+
+### Private Repositories
+
+For private repos, an SSH key must be accessible. The lookup order is:
+
+1. `GITHUB_USER_SSH_KEY` environment variable (path to key file)
+2. `~/.ssh/id_rsa`, `id_ecdsa`, `id_ecdsa_sk`, `id_ed25519`, `id_ed25519_sk` (first match)
+
+```bash
+# Only needed if your key has a non-standard name
+export GITHUB_USER_SSH_KEY=~/.ssh/my_custom_key
+```
+
+### Environment Variables
+
+Environment variables can be set directly or via a `.env` file in the project root.
+
+All environment variables prefixed with `OPENHANDS_` are forwarded into the Docker container with the prefix stripped. For example, `OPENHANDS_ANTHROPIC_API_KEY` becomes `ANTHROPIC_API_KEY` inside the container. This is how you pass LLM API keys and other credentials to the agent.
+
+```bash
+export OPENHANDS_ANTHROPIC_API_KEY=sk-xxx
+export OPENHANDS_OPENAI_API_KEY=sk-xxx
+export OPENHANDS_GOOGLE_APPLICATION_CREDENTIALS='{"type":"service_account",...}'
+```
+
+| Variable | Description |
+|----------|-------------|
+| `OPENHANDS_*` | Forwarded into the container with prefix stripped (LLM keys, credentials, etc.) |
+| `GITHUB_USER_SSH_KEY` | Path to SSH key for private repos |
+| `SKIP_BUILD` | Set to `1` to skip Docker image building during inference (default: `1`) |
+
+## Evaluation
+
+After running inference, evaluate the generated patches:
+
+```bash
+uv run swesmith-eval output.jsonl \
+  --run-id my_eval \
+  --dataset /path/to/task_instances.json
+```
+
+**Advanced options:**
+
+```bash
+# Faster evaluation using only fail-to-pass tests
+uv run swesmith-eval output.jsonl \
+  --run-id my_eval \
+  --dataset /path/to/task_instances.json \
+  --f2p-only
+
+# Re-evaluate failed/errored instances
+uv run swesmith-eval output.jsonl \
+  --run-id my_eval \
+  --dataset /path/to/task_instances.json \
+  --redo-existing
+
+# Only regenerate the report from existing evaluation logs
+uv run swesmith-eval output.jsonl \
+  --run-id my_eval \
+  --dataset /path/to/task_instances.json \
+  --report-only
+```
+
+## Output Structure
+
+```
+eval_outputs/
+└── <dataset>-<split>/
+    └── <model>/
+        ├── output.jsonl                    # Main results
+        ├── output.critic_attempt_N.jsonl   # Per-attempt results
+        ├── output.swesmith.jsonl           # SWE-Smith format predictions
+        ├── output.report.json              # Evaluation report (SWE-Smith format)
+        ├── cost_report.jsonl               # Token usage and cost
+        └── conversations/                  # Per-instance conversation logs
+            └── <instance_id>.tar.gz
+```
+
+**Inference result** (`output.jsonl`, one entry per line):
+
+```json
+{
+  "instance_id": "encode__httpx.ae1b9f66.lm_modify__abc123",
+  "attempt": 1,
+  "test_result": {
+    "git_patch": "diff --git a/file.py b/file.py\n..."
+  },
+  "instruction": "...",
+  "history": [],
+  "metrics": {},
+  "error": null
+}
+```
+
+**Evaluation report** (`output.report.json`) follows the SWE-Smith report format:
+
+```json
+{
+  "resolved": 5,
+  "unresolved": 3,
+  "total": 8,
+  "ids_resolved": ["instance_1", "..."],
+  "ids_unresolved": ["instance_3", "..."]
+}
+```
+
+## Custom Repository Profiles
+
+To add a custom repository, define a profile class in `profiles.py`:
+
+```python
+@dataclass
+class MyRepoBcd12345(PythonProfile):
+    owner: str = "github-org"
+    repo: str = "my-repo"
+    commit: str = "bcd1234567890"
+    org_gh: str = "org-swesmith"
+```
+
+Profiles are auto-registered on import. For Go repositories, inherit from `GoProfile` instead.
 
-This is not supported yet for SWE-Smith because the primary purpose of this directory is fast and smooth creation of Docker images.
+## References
 
+- [SWE-Smith Paper](https://arxiv.org/abs/2504.21798)
+- [SWE-Smith GitHub](https://github.com/SWE-bench/SWE-smith)
+- [SWE-Smith Dataset on HuggingFace](https://huggingface.co/datasets/SWE-bench/SWE-smith)
diff --git a/benchmarks/swesmith/config.py b/benchmarks/swesmith/config.py
@@ -0,0 +1,15 @@
+"""
+SWE-Smith benchmark configuration.
+"""
+
+# Inference defaults (used by run_infer.py)
+INFER_DEFAULTS = {
+    "dataset": "SWE-bench/SWE-smith-py",
+    "split": "train",
+    "num_workers": 4,
+}
+
+# Evaluation defaults (used by eval_infer.py)
+EVAL_DEFAULTS = {
+    "workers": 4,
+}
diff --git a/benchmarks/swesmith/constants.py b/benchmarks/swesmith/constants.py
@@ -0,0 +1,28 @@
+"""
+SWE-Smith hyperparameters and constant values.
+"""
+
+from typing import Final, Literal
+
+
+# Build target type (matches openhands.agent_server.docker.build.TargetType)
+TargetType = Literal["binary", "binary-minimal", "source", "source-minimal"]
+BUILD_TARGET_SOURCE_MINIMAL: Final[TargetType] = "source-minimal"
+BUILD_TARGET_BINARY: Final[TargetType] = "binary"
+DEFAULT_BUILD_TARGET: Final[TargetType] = BUILD_TARGET_SOURCE_MINIMAL
+
+# Runtime
+DEFAULT_RUNTIME_API_URL: Final[str] = "https://runtime.eval.all-hands.dev"
+DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT: Final[int] = 600
+
+# Git
+GIT_USER_EMAIL: Final[str] = "evaluation@openhands.dev"
+GIT_USER_NAME: Final[str] = "OpenHands Evaluation"
+GIT_COMMIT_MESSAGE: Final[str] = "patch"
+
+# Patch Processing
+SETUP_FILES_TO_REMOVE: Final[tuple[str, ...]] = (
+    "pyproject.toml",
+    "tox.ini",
+    "setup.py",
+)