From e6361a41b97f3c35251c8e75f77b80c19c381b54 Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Thu, 20 Nov 2025 15:02:48 -0600
Subject: [PATCH 01/14] refactor: reorganizing and improving the step-by-step
 guidelines for adding new artifacts to ArtEvalBench

---
 benchmarks/arteval_bench/README.md | 153 +++++++++++++++++------------
 benchmarks/arteval_bench/WHY.md    |  43 ++++++++
 2 files changed, 134 insertions(+), 62 deletions(-)
 create mode 100644 benchmarks/arteval_bench/WHY.md

diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
index f6711ca6..014922e1 100644
--- a/benchmarks/arteval_bench/README.md
+++ b/benchmarks/arteval_bench/README.md
@@ -1,71 +1,100 @@
 # ArtEvalBench
 
-`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process. Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
-
-Want to find out more or contribute? Jump to the [contributor's guide](#contributors-guide).
-
-## Goals and Objectives
-
-Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process.
-
-Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant:
-
-1. provisions the reference environment;
-
-2. builds/installs a particular version of the artifact using the specified toolchain;
-
-3. retrieves and prepares datasets or other third-party targets;
-
-4. orchestrates experiments with explicit configuration, time and resource budgets; and
-
-5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them.
-
-The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing.
-
-## Background
-
-#### » The artifact evaluation process
-
-Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge:
-
-* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced.
-
-* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose.
-
-* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis).
-
-Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html).
-
-#### » What makes AE challenging in practice?
-
-Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
-
-Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
+`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process ([why artifact evaluation?](WHY.md)). Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
 
 ## Contributor's guide
 
 #### » Overview and high-level structure
 
-To train and improve AE agents in a principled way we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison we include artifacts that have been already evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
+To train and improve AE agents in a principled way, we introduce `ArtEvalBench`, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison, we include artifacts that have already been evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
 
 `ArtEvalBench` is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).
 
-To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these for stages correspond to:
-
-1. Environment Setup: verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
-2. Build/Install: confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
-3. Benchmark Preparation: asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
-4. Experiment Runs: executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
+To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these four stages correspond to:
 
-For a typical example, check out the [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) of [WASABI](data/benchmark/sosp24_wasabi/wasabi/).
+1. **Environment setup.** verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
+2. **Build (and install) the artifact.** confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking `--help` or equivalent.
+3. **Benchmark preparation.** asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
+4. **Experiment runs.** executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
 
 #### » Adding a new artifact
 
-Adding a new artifact to the benchmark requires several steps:
+Adding to the benchmark requires users to include a new entry into `ArtEvalBench` [schema file](data/benchmark/arteval_tasks.jsonl), where:
+- `artifact_id` is a unique identifier for the artifact;
+- `artifact_dir` the artifact directory within `data/benchmark/`;
+- `artifact_readme` is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
+- `artifact_url` the URL to the original artifact; 
+- `evaluator` is a path to the evaluator's `main.py` entrypoint;
+- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages ([!NOTE] Users are encouraged not to change this value, unless they opt for another universal metric for artifact evaluation).
+
+It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.
+
+Using WASABI's [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/) as a template, users will therefore need to extend the artifact with:
+
+1. An `_agent_eval/` package which contains all benchmark-specific code and does *not* modify your original artifact logic.
+
+2. One oracle module per stage, implemented in four distinct Python files each checking one of the four canonical stages of artifact evaluation. A typical oracle module looks as follows (simplified):
+   ```python
+   # _agent_eval/env_setup.py
+   import subprocess
+   from pathlib import Path
+
+   def check() -> bool:
+       # Example: verify virtualenv exists
+       if not Path("venv").exists():
+           print("Missing venv/ directory")
+           return False
+
+       # Example: verify Python version inside the venv
+       proc = subprocess.run(
+           ["venv/bin/python", "--version"],
+           capture_output=True,
+           text=True,
+       )
+       print(proc.stdout.strip())
+       return proc.returncode == 0 and proc.stdout.startswith("Python 3.10")
+    ```
+    Also, note that each oracle should be:
+    - Non-interactive, meaning not expecting input or prompt interactions.
+    - Idempotent, meaning safe to run multiple times without side-effects.
+    - It returns `True` or `False` based on the validation outcome and prints a brief diagnostic message.
+
+3. A single `main.py` orchestrator, the entrypoint used by ArtEvalBench, which invokes the four oracle modules, runs them in order, and returns an overall score (an integer between 0 and 4):
+    ```python
+    # _agent_eval/main.py
+    from . import env_setup, build_install, prep_benchmark, run_experiments
+
+    def main() -> int:
+        score = 0
+        stages = [
+            ("env_setup", env_setup.check),
+            ("build_install", build_install.check),
+            ("prep_benchmark", prep_benchmark.check),
+            ("run_experiments", run_experiments.check),
+        ]
+
+        for name, check in stages:
+            try:
+                ok = bool(check())
+            except Exception as e:
+                print(f"[{name}] FAILED with exception: {e}")
+                ok = False
+
+            if ok:
+                print(f"[{name}] PASSED")
+                score += 1
+            else:
+                print(f"[{name}] FAILED")
+
+        print(f"FINAL_SCORE {score}/4")
+        return score
+
+    if __name__ == "__main__":
+        raise SystemExit(main())
+    ```
+
+    Note that the `ArtEvalBench` framework will invoke `main.py` to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
 
-1. Create a stand-alone directory in `./data/benchmark` and copying all artifact files including the README file.
-2. Implement oracles for evaluating the AI agent. This feature should follow the same structure as Wasabi's [evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval/), where each oracle is implemented in a separate Python source file and orchestrated by a `main.py` whose `main()` method returns a single integer, the overal score (0..4) the agent achieved.
-3. Create an entry into the [task journal](data/benchmark/arteval_tasks.jsonl) and populate the appropriate fields.
 
 ## Benchmark Setup
 
@@ -76,11 +105,11 @@ To install the benchmark, simply run the `install.sh` script to set up the envir
  ./install.sh
  ```
 
- This operaiton will:
- * Install Python 3.12 virtual environment
- * Clone and install SWE-agent
- * Install required Python packages (pytest, pytest-cov)
- * Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.)
+ This operation will:
+ - Install Python 3.12 virtual environment
+ - Clone and install SWE-agent
+ - Install required Python packages (pytest, pytest-cov)
+ - Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.)
 
 #### » Run the benchmark
 
@@ -104,8 +133,8 @@ To run the benchmark:
 #### » Supported Agents
 
 The benchmark supports multiple AI agents:
-* **Claude Code**: Anthropic's code assistant
-* **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
-* **OpenHands**: Open-source coding agent
+- **Claude Code**: Anthropic's code assistant
+- **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
+- **OpenHands**: Open-source coding agent
 
-To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
+To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/WHY.md b/benchmarks/arteval_bench/WHY.md
new file mode 100644
index 00000000..2c639b00
--- /dev/null
+++ b/benchmarks/arteval_bench/WHY.md
@@ -0,0 +1,43 @@
+# Why Artifact Evaluation as an AI Training Task?
+
+`ArtEvalBenc`h` treats the artifact evaluation (AE) process as a training ground for AI agents to help form core [system intelligence capabilites](https://www.sigops.org/2025/defining-system-intelligence/). During AE, reviewers must reconstruct a target environment from incomplete specifications, build and configure complex software stacks with many implicit assumptions, prepare datasets and external benchmarks whose availability can change over time, run multi-stage experiments under strict resource and time constraints, and verify that reproduced results stay within acceptable margins of those reported in the paper. This makes AE a rich, realistic testbed for AI: agents must reason across all these steps, yet we believe they can be trained to reliably assist reviewers by automating most of this process.
+
+Want to find out more or contribute? Take a look at our [contributor's guide](README.md).
+
+## Goals and Objectives
+
+Artifact evaluation has become a standard component of the peer-review process across a wide range of conferences in Computer Science, especially in Systems and related areas. Despite this progress however, the practical work of provisioning operational environments, resolving dependencies, building artifacts, preparing benchmarks, running experiments, and checking results remains brittle and time-consuming. To alleviate this burden, we envision an automated artifact evaluation AI assistant that executes repeatable steps under (human) reviewer supervision. This "AE assistant" would target artifact mechanics (e.g., code compilation, dataset/benchmark preparation, experiment orchestration, and output validation) alongside code auditing (e.g., does the artifact implementation match the paper prose? are results closely matching those in the paper?). The agent's output can then inform more a complex methodological assessment, design trade-off analysis, and results interpretation that reviewers need to perform to complete the AE process.
+
+Concretely, given an artifact (code, documentation, experiment framework), a complete installation & operation guide, and the paper itself, the AE assistant:
+
+1. provisions the reference environment;
+
+2. builds/installs a particular version of the artifact using the specified toolchain;
+
+3. retrieves and prepares datasets or other third-party targets;
+
+4. orchestrates experiments with explicit configuration, time and resource budgets; and
+
+5. generates a human-readable report that summarizes the outcome of each step, indicating any blockers (e.g., install missing dependencies) and how it managed to overcome them.
+
+The goal is to reduce reviewer effort on mechanical tasks so attention can shift to scientific auditing.
+
+## Background
+
+#### » The artifact evaluation process
+
+Most conferences award badges to incentivize high-quality artifacts that support the paper's claims by asking authors to participate in a multi-stage evaluation process where reviewers attempt to download, install, and operate the artifacts themselves. The following summarizes the widely used criteria for each badge:
+
+* Artifact Available. This badge indicates that the artifact itself (code, documentation, scripts, benchmarks, etc.) is publicly accessible with a persistent identifier (e.g., DOI, commit ID) on an (ideally, long-term) archival repository (e.g., Zenodo, Github). Availability does not imply the artifact can compile, build, or is functionally correct. It only confirms that the materials needed to verify key claims, reproduce experimental results, and reuse the tool itself are open-sourced.
+
+* Artifact Functional. This badge indicates that the artifact installs/builds in a reference environment and runs at least a subset of the documented experiments. It confirms that dependencies and configurations are explicitly recorded, and outputs, at least for said subset of experiments, are consistent with the paper's prose.
+
+* Results Reproduced. This badge indicates that a third party can re-execute all necessary experiments to obtain results consistent with the paper, with a reasonable degree of tolerance (e.g., within relative error bounds, confidence intervals, or rank-ordering equivalence). On top of re-obtaining results that support the paper's claims, reproducibility further requires verifiable provenance (e.g., SW/HW environment characteristics, configuration parameters, experiment logs) and principled handling of non-determinism (e.g., repeated trials, fixed initial states, or variance analysis).
+
+Further reading and a detailed description of criteria for each badge can be found [here](https://sysartifacts.github.io/eurosys2026/badges) and [here](https://sysartifacts.github.io/evaluator-guide.html).
+
+#### » What makes AE challenging in practice?
+
+Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
+
+Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
\ No newline at end of file

From 2523106d261e3401ba46e1980de8f6da1140d4ef Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Thu, 20 Nov 2025 15:08:59 -0600
Subject: [PATCH 02/14] fix: add a brief explanation for 'docker_env' schema
 field

---
 benchmarks/arteval_bench/README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
index 014922e1..be3fbf94 100644
--- a/benchmarks/arteval_bench/README.md
+++ b/benchmarks/arteval_bench/README.md
@@ -26,6 +26,7 @@ Adding to the benchmark requires users to include a new entry into `ArtEvalBench
 - `artifact_url` the URL to the original artifact; 
 - `evaluator` is a path to the evaluator's `main.py` entrypoint;
 - `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages ([!NOTE] Users are encouraged not to change this value, unless they opt for another universal metric for artifact evaluation).
+- `docker_evn` (optional) points to a Docker image on Docker Hub.
 
 It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.
 

From e0059f3070646cca7a7b354e88649ccdf704ff6c Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Thu, 20 Nov 2025 15:09:50 -0600
Subject: [PATCH 03/14] refactor: update the new schema naming convention

---
 .../arteval_bench/data/benchmark/arteval_tasks.jsonl      | 2 +-
 benchmarks/arteval_bench/src/main.py                      | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl b/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
index d274d5bb..df67b992 100644
--- a/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
+++ b/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
@@ -1 +1 @@
-{"task_id": "sosp24_wasabi", "task_file": "data/benchmark/sosp24_wasabi/wasabi/README.md", "repo_name": "sosp24_wasabi", "test_method": "data/benchmark/sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "test_results": "", "difficulty": "easy", "repo_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae"}
\ No newline at end of file
+{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "data/benchmark/sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "data/benchmark/sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": ""}
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/src/main.py b/benchmarks/arteval_bench/src/main.py
index b4d40b70..07f0c23d 100644
--- a/benchmarks/arteval_bench/src/main.py
+++ b/benchmarks/arteval_bench/src/main.py
@@ -31,10 +31,10 @@ def main(file_path, model, agent, save_path):
                 continue
 
             deployment = item.get('docker_env', None)
-            project_path = f"./data/benchmark/{item.get('repo_name', None)}"
-            task_file = item.get('task_file', None)
-            task_id = item.get('task_id', None)
-            test_method = item.get('test_method', None)
+            project_path = f"./data/benchmark/{item.get('artifact_dir', None)}"
+            task_file = item.get('artifact_readme', None)
+            task_id = item.get('artifact_id', None)
+            test_method = item.get('evaluator', None)
 
             task = get_task(task_file)
 

From c72230a209fc2169f6c14b1dd1f0a92f8d31777e Mon Sep 17 00:00:00 2001
From: Bogdan Alexandru Stoica <b.al.stoica@gmail.com>
Date: Sat, 22 Nov 2025 12:39:44 -0600
Subject: [PATCH 04/14] fix: apply suggestions to WHY.md

Co-authored-by: Tarek Elsayed <60650661+tareknaser@users.noreply.github.com>
---
 benchmarks/arteval_bench/WHY.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/arteval_bench/WHY.md b/benchmarks/arteval_bench/WHY.md
index 2c639b00..27657fe1 100644
--- a/benchmarks/arteval_bench/WHY.md
+++ b/benchmarks/arteval_bench/WHY.md
@@ -1,6 +1,6 @@
 # Why Artifact Evaluation as an AI Training Task?
 
-`ArtEvalBenc`h` treats the artifact evaluation (AE) process as a training ground for AI agents to help form core [system intelligence capabilites](https://www.sigops.org/2025/defining-system-intelligence/). During AE, reviewers must reconstruct a target environment from incomplete specifications, build and configure complex software stacks with many implicit assumptions, prepare datasets and external benchmarks whose availability can change over time, run multi-stage experiments under strict resource and time constraints, and verify that reproduced results stay within acceptable margins of those reported in the paper. This makes AE a rich, realistic testbed for AI: agents must reason across all these steps, yet we believe they can be trained to reliably assist reviewers by automating most of this process.
+`ArtEvalBench` treats the artifact evaluation (AE) process as a training ground for AI agents to help form core [system intelligence capabilites](https://www.sigops.org/2025/defining-system-intelligence/). During AE, reviewers must reconstruct a target environment from incomplete specifications, build and configure complex software stacks with many implicit assumptions, prepare datasets and external benchmarks whose availability can change over time, run multi-stage experiments under strict resource and time constraints, and verify that reproduced results stay within acceptable margins of those reported in the paper. This makes AE a rich, realistic testbed for AI: agents must reason across all these steps, yet we believe they can be trained to reliably assist reviewers by automating most of this process.
 
 Want to find out more or contribute? Take a look at our [contributor's guide](README.md).
 

From 77b6e7fcab4c4e4b93ba2ae722915cb059b2734f Mon Sep 17 00:00:00 2001
From: Bogdan Alexandru Stoica <b.al.stoica@gmail.com>
Date: Sat, 22 Nov 2025 12:40:37 -0600
Subject: [PATCH 05/14] fix: a few typos in WHY.md

Co-authored-by: Tarek Elsayed <60650661+tareknaser@users.noreply.github.com>
---
 benchmarks/arteval_bench/WHY.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/arteval_bench/WHY.md b/benchmarks/arteval_bench/WHY.md
index 27657fe1..ebff367d 100644
--- a/benchmarks/arteval_bench/WHY.md
+++ b/benchmarks/arteval_bench/WHY.md
@@ -2,7 +2,7 @@
 
 `ArtEvalBench` treats the artifact evaluation (AE) process as a training ground for AI agents to help form core [system intelligence capabilites](https://www.sigops.org/2025/defining-system-intelligence/). During AE, reviewers must reconstruct a target environment from incomplete specifications, build and configure complex software stacks with many implicit assumptions, prepare datasets and external benchmarks whose availability can change over time, run multi-stage experiments under strict resource and time constraints, and verify that reproduced results stay within acceptable margins of those reported in the paper. This makes AE a rich, realistic testbed for AI: agents must reason across all these steps, yet we believe they can be trained to reliably assist reviewers by automating most of this process.
 
-Want to find out more or contribute? Take a look at our [contributor's guide](README.md).
+Want to find out more or contribute? Take a look at our [contributor's guide](README.md#contributors-guide).
 
 ## Goals and Objectives
 

From c207e5904232a99aeee211a3e242b2c5e246a969 Mon Sep 17 00:00:00 2001
From: Bogdan Alexandru Stoica <bastoica@uchicago.edu>
Date: Sun, 23 Nov 2025 00:58:10 -0600
Subject: [PATCH 06/14] refactor: minor formatting and style improvements

---
 benchmarks/arteval_bench/WHY.md | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/benchmarks/arteval_bench/WHY.md b/benchmarks/arteval_bench/WHY.md
index ebff367d..f787c888 100644
--- a/benchmarks/arteval_bench/WHY.md
+++ b/benchmarks/arteval_bench/WHY.md
@@ -38,6 +38,14 @@ Further reading and a detailed description of criteria for each badge can be fou
 
 #### » What makes AE challenging in practice?
 
-Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: (i) environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); (ii) undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); (iii) brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and (iv) unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
+Reproducibility and reusability can be obstructed by multiple factors including, but not limited to: 
 
-Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.
\ No newline at end of file
+1. environment drift (e.g., legacy libraries no longer available, drivers mismatch in newer OS versions); 
+
+2. undocumented or implicit build assumptions (e.g., hard-coded compiler flags, directory paths, IPs, or reliance on OS-wide libraries that differ across distributions); 
+
+3. brittle preprocessing of third-party benchmarks or datasets (e.g., broken download URL, non-deterministic compilation steps that silently invalidate subsequent stages); and 
+
+4. unspecified results tolerance bounds that complicate validation for non-deterministic experiments (e.g., performance claims without clarifying what constitutes an acceptable deviation when running within a similar SW/HW setup).
+
+Overcoming such challenges require persistence and careful bookkeeping, precisely where an automated AE assistant can provide leverage.

From db6a9473b48143128e056115cbf6288c91e49019 Mon Sep 17 00:00:00 2001
From: Bogdan Alexandru Stoica <bastoica@illinois.edu>
Date: Sun, 23 Nov 2025 01:15:41 -0600
Subject: [PATCH 07/14] fix: remove obsolete dependency installation
 instructions from README

---
 benchmarks/arteval_bench/README.md | 15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
index be3fbf94..200e61c2 100644
--- a/benchmarks/arteval_bench/README.md
+++ b/benchmarks/arteval_bench/README.md
@@ -99,19 +99,6 @@ Using WASABI's [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval
 
 ## Benchmark Setup
 
-#### » Install dependencies
-
-To install the benchmark, simply run the `install.sh` script to set up the environment:
- ```sh
- ./install.sh
- ```
-
- This operation will:
- - Install Python 3.12 virtual environment
- - Clone and install SWE-agent
- - Install required Python packages (pytest, pytest-cov)
- - Clone course repositories (6.5840-golabs-2024, xv6-labs-2024, etc.)
-
 #### » Run the benchmark
 
 To run the benchmark:
@@ -138,4 +125,4 @@ The benchmark supports multiple AI agents:
 - **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
 - **OpenHands**: Open-source coding agent
 
-To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
\ No newline at end of file
+To add your own agent to the benchmark, see [add_agents.md](add_agents.md).

From 8b05a67beaf8825b352cd0a0f195b4af30dae11d Mon Sep 17 00:00:00 2001
From: Bogdan Alexandru Stoica <bastoica@illinois.edu>
Date: Sun, 23 Nov 2025 01:23:32 -0600
Subject: [PATCH 08/14] refactor: rework the first paragraph and fix minor text
 redering issues

---
 benchmarks/arteval_bench/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
index 200e61c2..9faf9a25 100644
--- a/benchmarks/arteval_bench/README.md
+++ b/benchmarks/arteval_bench/README.md
@@ -1,6 +1,6 @@
 # ArtEvalBench
 
-`ArtEvalBench` is a benchmark for evaluating AI agents that support the Artifact Evaluation (AE) process by auditing research prototypes (artifacts) that accompany research papers, as part of the peer-review process ([why artifact evaluation?](WHY.md)). Artifact evaluation involves reconstructing a reference environment from (partial) specifications, building and configuring complex codebases with often implicit assumptions, preparing datasets and third-party benchmarks whose availability may change over time, orchestrating multi-stage experiments under controlled resource and time budgets, and validating that observed results fall within acceptable tolerance bounds relative to those reported in the paper. Despite the intricacy of the process, we believe AI agents can be trained to support reviewers in evaluating artifacts that accompany research papers by automating most of these stages.
+`ArtEvalBench` is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks ([why artifact evaluation?](WHY.md)). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.
 
 ## Contributor's guide
 
@@ -25,7 +25,7 @@ Adding to the benchmark requires users to include a new entry into `ArtEvalBench
 - `artifact_readme` is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
 - `artifact_url` the URL to the original artifact; 
 - `evaluator` is a path to the evaluator's `main.py` entrypoint;
-- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages ([!NOTE] Users are encouraged not to change this value, unless they opt for another universal metric for artifact evaluation).
+- `expected_score` is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).
 - `docker_evn` (optional) points to a Docker image on Docker Hub.
 
 It also requires users to extend the artifact they plan to add with a self-contained evaluator in an `_agent_eval/` directory. This evaluator encodes *minimal*, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.

From 07f76c09f9ee414c0f459215842572ffe17eb469 Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 24 Nov 2025 12:33:34 -0600
Subject: [PATCH 09/14] fix: clean-up repository, remove unnecessary or unused
 scripts

---
 benchmarks/arteval_bench/go-python.Dockerfile | 36 ------------
 benchmarks/arteval_bench/install.sh           | 57 -------------------
 benchmarks/arteval_bench/requirements.txt     |  5 --
 3 files changed, 98 deletions(-)
 delete mode 100644 benchmarks/arteval_bench/go-python.Dockerfile
 delete mode 100755 benchmarks/arteval_bench/install.sh
 delete mode 100644 benchmarks/arteval_bench/requirements.txt

diff --git a/benchmarks/arteval_bench/go-python.Dockerfile b/benchmarks/arteval_bench/go-python.Dockerfile
deleted file mode 100644
index af386424..00000000
--- a/benchmarks/arteval_bench/go-python.Dockerfile
+++ /dev/null
@@ -1,36 +0,0 @@
-FROM python:3.12.6
-
-ARG DEBIAN_FRONTEND=noninteractive
-ENV TZ=Etc/UTC
-
-WORKDIR /
-ADD . .
-
-# SWE-ReX will always attempt to install its server into your docker container
-# however, this takes a couple of seconds. If we already provide it in the image,
-# this is much faster.
-RUN pip install pipx
-RUN pipx install swe-rex 
-RUN pipx ensurepath
-
-RUN pip install flake8
-
-ENV GOLANG_VERSION=1.22.3
-
-RUN apt-get update && apt-get install -y wget tar git build-essential \
-    && wget https://go.dev/dl/go${GOLANG_VERSION}.linux-amd64.tar.gz \
-    && tar -C /usr/local -xzf go${GOLANG_VERSION}.linux-amd64.tar.gz \
-    && rm go${GOLANG_VERSION}.linux-amd64.tar.gz \
-    && apt-get clean && rm -rf /var/lib/apt/lists/*
-
-ENV PATH="/usr/local/go/bin:${PATH}"
-
-RUN python --version && go version
-
-SHELL ["/bin/bash", "-c"]
-# This is where pipx installs things
-ENV PATH="$PATH:/root/.local/bin/" 
-
-RUN python --version && go version
-
-CMD ["bash"]
diff --git a/benchmarks/arteval_bench/install.sh b/benchmarks/arteval_bench/install.sh
deleted file mode 100755
index ce58fe96..00000000
--- a/benchmarks/arteval_bench/install.sh
+++ /dev/null
@@ -1,57 +0,0 @@
-#!/bin/bash
-
-set -e  # Exit immediately on error.
-
-docker --version
-python3.12 -m venv .venv
-# python3 -m venv .venvdoc
-source .venv/bin/activate
-
-if [ ! -d "SWE-agent" ]; then
-    echo "==> Install SWE-agent and its dependencies..."
-    git clone https://github.com/SWE-agent/SWE-agent.git
-    cd SWE-agent
-    git checkout 0c27f286303a939aa868ad2003bc4b6776771791
-    pip install --editable .
-    sweagent --help
-    cd ..
-else
-    echo "==> SWE-agent repository already exists, skipping clone."
-fi
-
-pip install -r requirements.txt
-pip install pytest
-pip install pytest-cov
-deactivate
-
-echo "==> Setting up SystemCourseProject environment..."
-cd data/benchmark/projects
-if [ -d "test-repo" ]; then
-    echo "==> test-repo already exists, skipping clone."
-else
-    echo "==> Cloning test-repo... "
-    git clone https://github.com/SWE-agent/test-repo.git
-fi
-
-if [ -d "6.5840-golabs-2024" ]; then
-    echo "==> 6.5840-golabs-2024 already exists, skipping clone."
-else
-    echo "==> Cloning 6.5840-golabs-2024..."
-    git clone git://g.csail.mit.edu/6.5840-golabs-2024
-fi
-
-if [ -d "xv6-labs-2024" ]; then
-    echo "==> xv6-labs-2024 already exists, skipping clone."
-else
-    echo "==> Cloning xv6-labs-2024..."
-    git clone git://g.csail.mit.edu/xv6-labs-2024
-fi
-
-if [ -d "6.5840-golabs-2025" ]; then
-    echo "==> 6.5840-golabs-2025 already exists, skipping clone."
-else
-    echo "==> Cloning 6.5840-golabs-2025..."
-    git clone git://g.csail.mit.edu/6.5840-golabs-2025
-fi
-
-echo "==> SystemCourseProject environment is set up successfully."
diff --git a/benchmarks/arteval_bench/requirements.txt b/benchmarks/arteval_bench/requirements.txt
deleted file mode 100644
index f5e49c23..00000000
--- a/benchmarks/arteval_bench/requirements.txt
+++ /dev/null
@@ -1,5 +0,0 @@
-sentence-transformers==4.0.1
-scikit-learn==1.6.1
-requests
-azure-identity
-litellm==1.77.5
\ No newline at end of file

From 16c35b0c87a851d252afb3e2bc71e01903540ae9 Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 24 Nov 2025 14:45:23 -0600
Subject: [PATCH 10/14] fix: update Dockerfile and removed unused scripts

---
 benchmarks/arteval_bench/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/arteval_bench/Dockerfile b/benchmarks/arteval_bench/Dockerfile
index 3e1536e7..49ac4d5c 100644
--- a/benchmarks/arteval_bench/Dockerfile
+++ b/benchmarks/arteval_bench/Dockerfile
@@ -9,6 +9,6 @@ RUN apt-get update && apt-get install -y \
     python3-pip \
     python3-venv
 
-RUN chmod +x install.sh test.sh && ./install.sh
+RUN chmod +x test.sh
  
 ENTRYPOINT ["./test.sh"]

From f73cb6a7f2c5272d2a9a1e4b2cfa8e3ba2287113 Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 24 Nov 2025 15:17:17 -0600
Subject: [PATCH 11/14] feature: add updated Docker image and environment
 bootstrapt scripts

---
 benchmarks/arteval_bench/Dockerfile       |  2 +-
 benchmarks/arteval_bench/install.sh       | 19 +++++++++++++++++++
 benchmarks/arteval_bench/requirements.txt |  5 +++++
 benchmarks/arteval_bench/test.sh          |  2 +-
 4 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 benchmarks/arteval_bench/install.sh
 create mode 100644 benchmarks/arteval_bench/requirements.txt

diff --git a/benchmarks/arteval_bench/Dockerfile b/benchmarks/arteval_bench/Dockerfile
index 49ac4d5c..3e1536e7 100644
--- a/benchmarks/arteval_bench/Dockerfile
+++ b/benchmarks/arteval_bench/Dockerfile
@@ -9,6 +9,6 @@ RUN apt-get update && apt-get install -y \
     python3-pip \
     python3-venv
 
-RUN chmod +x test.sh
+RUN chmod +x install.sh test.sh && ./install.sh
  
 ENTRYPOINT ["./test.sh"]
diff --git a/benchmarks/arteval_bench/install.sh b/benchmarks/arteval_bench/install.sh
new file mode 100755
index 00000000..5eaa9831
--- /dev/null
+++ b/benchmarks/arteval_bench/install.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+set -e  # Exit immediately on error.
+
+# if .venv does not exist, create it
+if [ -d ".venv" ]; then
+    echo "==> .venv already exists, skipping creation."
+else
+    echo "==> Creating .venv directory..."
+
+    python3 -m venv .venv
+    source .venv/bin/activate
+    pip install -r requirements.txt
+    pip install pytest
+    pip install pytest-cov
+    deactivate
+fi
+
+echo "==> ArtEvalBench environment is set up successfully."
diff --git a/benchmarks/arteval_bench/requirements.txt b/benchmarks/arteval_bench/requirements.txt
new file mode 100644
index 00000000..f5e49c23
--- /dev/null
+++ b/benchmarks/arteval_bench/requirements.txt
@@ -0,0 +1,5 @@
+sentence-transformers==4.0.1
+scikit-learn==1.6.1
+requests
+azure-identity
+litellm==1.77.5
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/test.sh b/benchmarks/arteval_bench/test.sh
index 00820da4..9317b593 100755
--- a/benchmarks/arteval_bench/test.sh
+++ b/benchmarks/arteval_bench/test.sh
@@ -2,7 +2,7 @@
 
 set -e  # Exit immediately on error.
 
-source envexamplebench/bin/activate
+source .venv/bin/activate
 pytest --version
 pytest
 deactivate

From 44c37be8e56b5635c20540a953c00b9a1cbb5a7b Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 1 Dec 2025 12:14:50 -0600
Subject: [PATCH 12/14] fix: few tweaks re installation and setup

---
 benchmarks/arteval_bench/Dockerfile | 32 +++++++++++++++++++++++------
 benchmarks/arteval_bench/install.sh | 16 ++++++++++++---
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/benchmarks/arteval_bench/Dockerfile b/benchmarks/arteval_bench/Dockerfile
index 3e1536e7..9bc1b1a6 100644
--- a/benchmarks/arteval_bench/Dockerfile
+++ b/benchmarks/arteval_bench/Dockerfile
@@ -1,14 +1,34 @@
 FROM ubuntu:24.04
- 
-WORKDIR /usr/src
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+USER root
+
+WORKDIR /
 COPY . .
-RUN apt-get update && apt-get install -y \
+
+RUN rm -rf /var/lib/apt/lists/* \
+ && apt-get update -o Acquire::Retries=5 \
+ && apt-get install -y --no-install-recommends \
     build-essential \
     git \
     wget \
     python3-pip \
-    python3-venv
+    python3-venv \
+    pipx \
+ && rm -rf /var/lib/apt/lists/*
+
+# SWE-ReX will always attempt to install its server into your docker container
+# however, this takes a couple of seconds. If we already provide it in the image,
+# this is much faster.
+RUN pipx install swe-rex 
+RUN pipx ensurepath
+
+ENV PATH="/root/.local/bin:${PATH}"
+ENV PATH="/usr/local/go/bin:${PATH}"
+
+SHELL ["/bin/bash", "-c"]
 
 RUN chmod +x install.sh test.sh && ./install.sh
- 
-ENTRYPOINT ["./test.sh"]
+
+CMD ["bash"]
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/install.sh b/benchmarks/arteval_bench/install.sh
index 5eaa9831..c1060607 100755
--- a/benchmarks/arteval_bench/install.sh
+++ b/benchmarks/arteval_bench/install.sh
@@ -10,9 +10,19 @@ else
 
     python3 -m venv .venv
     source .venv/bin/activate
-    pip install -r requirements.txt
-    pip install pytest
-    pip install pytest-cov
+    
+    if [ ! -d "SWE-agent" ]; then
+        echo "==> Install SWE-agent and its dependencies..."
+        git clone https://github.com/SWE-agent/SWE-agent.git
+        cd SWE-agent
+        git checkout 0c27f286303a939aa868ad2003bc4b6776771791
+        pip install --editable .
+        sweagent --help
+        cd ..
+    else
+        echo "==> SWE-agent repository already exists, skipping clone."
+    fi
+    
     deactivate
 fi
 

From 2a8d374c606ec6734b3701e58a87b2013e8108cf Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 1 Dec 2025 12:20:17 -0600
Subject: [PATCH 13/14] refactor: add default Docker image, rewrite agent
 prompt, and remove ureachable code

---
 .../arteval_bench/src/run_eval_in_env.py      | 67 ++++++-------------
 benchmarks/arteval_bench/src/utils.py         |  9 +--
 2 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/benchmarks/arteval_bench/src/run_eval_in_env.py b/benchmarks/arteval_bench/src/run_eval_in_env.py
index 2c814443..635b09d7 100644
--- a/benchmarks/arteval_bench/src/run_eval_in_env.py
+++ b/benchmarks/arteval_bench/src/run_eval_in_env.py
@@ -6,33 +6,12 @@
 
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../')))
 
-from swerex.deployment.docker import DockerDeployment
+from swerex.deployment.docker import DockerDeploymentConfig
 from swerex.runtime.abstract import BashAction, Command, CreateBashSessionRequest, UploadRequest
 
 from sdk.logger import logger
 
 
-def get_task(file_path):
-    """Get agent task from a file"""
-    task = (f"You are an experienced software engineer.\n"
-        + f"You are asked to follow the step-by-step instructions in README.md below to set-up," 
-        + f"install, compile, and reproduce the results of Wasabi" 
-        + f"Note that you are in a docker env with root access. If sudo is needed," 
-        + f"please remove sudo command in the install file."
-        + f"Note that you can ignore branch siwitch instructions in the README as you are already" 
-        + f"in the correct branch. So do not use git branch at all."
-        + f"\nBelow is the README of the artifact:\n\n")
-    
-    try: 
-        with open(file_path, encoding='utf-8') as f:
-            lines = f.readlines()
-            task = task + "\n".join(lines)
-    except Exception as e:
-        logger.info(f'Error extracting task from {file_path}: {e}')
-
-    return task
-
-
 def write_to_file(file_path, content):
     """Write content to a file."""
     with open(file_path, 'w') as f:
@@ -44,6 +23,11 @@ async def run_eval_in_env(deployment, project_path, task_id, task, model, agent_
     await deployment.start()
     runtime = deployment.runtime
 
+    if hasattr(runtime, "_config"):
+        logger.info(f"Current RemoteRuntime timeout: {runtime._config.timeout}s")
+        runtime._config.timeout = 1800.0
+        logger.info(f"Overriding RemoteRuntime timeout to {runtime._config.timeout}s")
+
     # Issue a few one-off commands, similar to `subprocess.run()`
     logger.info(await runtime.execute(Command(command=['echo', 'Hello, world!'])))
 
@@ -64,9 +48,12 @@ async def run_eval_in_env(deployment, project_path, task_id, task, model, agent_
         )
     )
     logger.info('Project files uploaded.')
-    logger.info(await runtime.run_in_session(BashAction(command='ls /repo')))
-    logger.info(await runtime.run_in_session(BashAction(command='cd /repo')))
-    logger.info(await runtime.run_in_session(BashAction(command='ls')))
+    run_results = await runtime.run_in_session(BashAction(command='cd /repo'))
+    logger.info(run_results)
+    run_results = await runtime.run_in_session(BashAction(command='pwd'))
+    logger.info(f'Current directory: {run_results}')
+    run_results = await runtime.run_in_session(BashAction(command='ls'))
+    logger.info(f'Current directory contents: {run_results}')
 
     logger.info('Uploading agent runner script...')
     logger.info(
@@ -80,32 +67,16 @@ async def run_eval_in_env(deployment, project_path, task_id, task, model, agent_
     logger.info(await runtime.run_in_session(BashAction(command='ls /agent/runner.sh')))
     logger.info('Agent runner script uploaded.')
 
-    # logger.info("Test Python and Go environment...")
-    # logger.info(await runtime.run_in_session(BashAction(command='export PATH=/usr/local/go/bin:${PATH}')))
-    # logger.info(await runtime.run_in_session(BashAction(command='export HOME=/tmp')))
-    # logger.info(await runtime.run_in_session(BashAction(command='go version')))
-    # logger.info(await runtime.run_in_session(BashAction(command='pip install pytest')))
-    # logger.info(await runtime.run_in_session(BashAction(command="pytest -v")))
-
     logger.info('Setup the agent running environment...')
     logger.info(await runtime.run_in_session(BashAction(command='chmod +x /agent/runner.sh /agent/install.sh')))
     logger.info(await runtime.run_in_session(BashAction(command='cat /agent/runner.sh')))
     logger.info(await runtime.run_in_session(BashAction(command='/agent/install.sh')))
 
     logger.info('Running runner script...')
-    run_results = await runtime.run_in_session(BashAction(command='pwd && ls && ls /agent'))
-    logger.info(f'Current directory: {run_results}')
-    run_results = await runtime.run_in_session(BashAction(command=f'/agent/runner.sh "{model}" "{task}"'))
+    run_results = await runtime.run_in_session(BashAction(command=f'/agent/runner.sh "{model}" "{task}"', timeout=1200.0))
     logger.info(f"agent's run results: {run_results}")
     logger.info('Runner script finished.')
 
-    # logger.info('Copying outputs to save path...')
-    # a = await runtime.run_in_session(BashAction(command='cat agent_trajectory.json'))
-    # output_file = os.path.join(save_path, f'{task_id}_agent_trajectory.json')
-    # os.makedirs(os.path.dirname(output_file), exist_ok=True)
-    # write_to_file(output_file, a.output if hasattr(a, 'output') else str(a))
-    # logger.info(f'Output saved to: {output_file}')
-
     try:
         test_output = await runtime.run_in_session(BashAction(command=test_method))
         logger.info(test_output)
@@ -132,14 +103,20 @@ async def run_eval_in_env(deployment, project_path, task_id, task, model, agent_
 
 
 def run_eval(deployment, project_path, task_id, task, model, agent_path, test_method, save_path):
-    deployment = (
-        DockerDeployment(image=deployment) if deployment else DockerDeployment(image='xuafeng/swe-go-python:latest')
+    image = deployment or 'bastoica/ae-agent-ubuntu24.04:latest'
+
+    config = DockerDeploymentConfig(
+        image=image,
+        startup_timeout=1200.0,
     )
+    deployment_obj = config.get_deployment()
+
     return asyncio.run(
-        run_eval_in_env(deployment, project_path, task_id, task, model, agent_path, test_method, save_path)
+        run_eval_in_env(deployment_obj, project_path, task_id, task, model, agent_path, test_method, save_path)
     )
 
 
+
 def test():
     task = 'The java is not installed. Can you please setup it? Note: you are in a docker with root permission. DO NOT use sudo.'
     project_path = '../data/benchmark/projects/test-repo'
diff --git a/benchmarks/arteval_bench/src/utils.py b/benchmarks/arteval_bench/src/utils.py
index 4c1ad616..4d7730e4 100644
--- a/benchmarks/arteval_bench/src/utils.py
+++ b/benchmarks/arteval_bench/src/utils.py
@@ -15,11 +15,6 @@ def get_task(file_path):
         + f" instructions to set up, install, compile, and reproduce the results in" 
         + f" that code repository. You have root access inside a Docker image, which"
         + f" means you can directly proceed with executing the steps in the README"
-        + f" without asking for approval or confirmation. If needed, proceed with"
-        + f" adjusting security settings to bypass any approval requirement in"
-        + f" in the Docker environment. Also, note that current agent doesn't support"
-        + f" interactive mode. When you want to run commands in docker, please do not"
-        + f" use `docker exec -it <container_id> bash` since it will enter the interacrtive" 
-        + f" mode. Instead, you can use `docker exec <container_id> bash -c <command>`"
-        + f" to run the command directly.")
+        + f" without asking for approval or confirmation. Once you rached the end"
+        + f" of the README you must exit the Docker image gracefully.")
     return task
\ No newline at end of file

From 6409d2fe3dfc6d7b7fb69f4872b92b5b91b36e0e Mon Sep 17 00:00:00 2001
From: Bogdan-Alexandru Stoica <b.al.stoica@gmail.com>
Date: Mon, 1 Dec 2025 12:21:43 -0600
Subject: [PATCH 14/14] fix: patch the benchmark schema file and improve
 Wasabi's README instructions

---
 .../arteval_bench/data/benchmark/arteval_tasks.jsonl      | 2 +-
 .../arteval_bench/data/benchmark/env_setup_examples.jsonl | 3 ---
 .../data/benchmark/sosp24_wasabi/wasabi/README.md         | 8 +++++---
 3 files changed, 6 insertions(+), 7 deletions(-)
 delete mode 100644 benchmarks/arteval_bench/data/benchmark/env_setup_examples.jsonl

diff --git a/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl b/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
index df67b992..59b50f79 100644
--- a/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
+++ b/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
@@ -1 +1 @@
-{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "data/benchmark/sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "data/benchmark/sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": ""}
\ No newline at end of file
+{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/data/benchmark/env_setup_examples.jsonl b/benchmarks/arteval_bench/data/benchmark/env_setup_examples.jsonl
deleted file mode 100644
index 7a228a81..00000000
--- a/benchmarks/arteval_bench/data/benchmark/env_setup_examples.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"task_id": "example_1", "task_name": "problems/test-repo-problems/1.md", "task": "set up the java environment", "repo_name": "projects/test-repo", "repo_url": "https://github.com/SWE-agent/test-repo.git", "test_method": "java -version", "test_results": "", "difficulty": "easy", "docker_env": "xuafeng/swe-go-python:latest"}
-{"task_id": "example_2", "task_name": "problems/test-repo-problems/2.md", "task": "set up the rust environment", "repo_name": "projects/test-repo", "repo_url": "https://github.com/SWE-agent/test-repo.git", "test_method": "rustc --version", "test_results": "", "difficulty": "easy", "docker_env": "xuafeng/swe-go-python:latest"}
-{"task_id": "example_3", "task_name": "problems/test-repo-problems/3.md", "task": "set up the nodejs environment", "repo_name": "projects/test-repo", "repo_url": "https://github.com/SWE-agent/test-repo.git", "test_method": "node -v", "test_results": "", "difficulty": "easy", "docker_env": "xuafeng/swe-go-python:latest"}
\ No newline at end of file
diff --git a/benchmarks/arteval_bench/data/benchmark/sosp24_wasabi/wasabi/README.md b/benchmarks/arteval_bench/data/benchmark/sosp24_wasabi/wasabi/README.md
index 633b5c7a..050700df 100644
--- a/benchmarks/arteval_bench/data/benchmark/sosp24_wasabi/wasabi/README.md
+++ b/benchmarks/arteval_bench/data/benchmark/sosp24_wasabi/wasabi/README.md
@@ -4,11 +4,13 @@ The testing component of WASABI triggers retry bugs by using a combination of st
 
 ## 2. Getting Started
 
-To get started, users should create a new directory structure, clone this repository, work on the `main` branch of the repository, configure and install dependencies, by following these steps:
+To get started, users should create a new directory structure, clone this repository, work on the `main` branch of the repository, configure and install dependencies.
+
+Start by checking you have 'root' access to the system, and installing `sudo` using `apt-get install`. Then, go through the following three steps:
 
 1. If not already in place, create a the appropriate directory structure:
 
-Note that your current working directory where the `README.md` is located id `~/sosp24_wasabi/benchmarks/wasabi`
+Note that your current working directory where the `README.md` is located id `~/sosp24_wasabi/wasabi`
 ```bash
 mkdir -p ~/sosp24_wasabi/benchmarks
 cd ~/sosp24_wasabi/
@@ -29,7 +31,7 @@ The working directory structure should look similar to the one below:
       ├── src/
       └── utils/
 ```
-The `wasabi` directory contains the codebase of WASABI, while the `bugfinding` directory is where users can add applications that they want to use WASABI to find retry bugs.
+The `wasabi` directory contains the codebase of WASABI, while the `benchmarks` directory is where users can add applications that they want to use WASABI to find retry bugs.
 
 2. Set up the `WASABI_ROOT_DIR` environment variable:
 ```