Refactor benchmark packaging/runtime: uv workspace, import cleanup, and docker unification by Acture · Pull Request #139 · sys-intelligence/system-intelligence-benchmark

Acture · 2026-03-03T10:44:53Z

Description

This PR restructures benchmark development and runtime around a uv workspace + multi-package model, removes sys.path-based import hacks, and unifies local/docker installation paths.

The goal is to make SDK + benchmark packaging reusable and predictable, while keeping benchmark scripts and Docker images aligned with the same dependency contract.

Changes

Introduced a uv workspace at repo root and defined benchmark packages as workspace members.
Added/updated pyproject.toml for all 8 benchmark packages so each benchmark is an installable package depending on system-intelligence-sdk.
Kept SDK as a reusable package and updated packaging docs/CI for uv-based build flow.
Removed sys.path.append/insert patterns from benchmark code/tests/docs and switched to package-safe imports/relative imports where applicable.
Simplified benchmark install.sh / run.sh scripts:
- removed parent-project invocation style (--project ../..)
- standardized to local uv sync / uv run usage per benchmark directory
- preserved special handling only where needed (e.g., sregym_core env).
Reworked benchmark Dockerfiles to:
- build wheels from workspace packages
- install SDK wheel + benchmark wheel in image
- avoid shell activate pitfalls by using --python .venv/bin/python
- include clearer diagnostics for missing wheel artifacts
- include required system tools for git-based dependencies.
Added root .dockerignore to exclude local env/cache artifacts from image build context.
Updated ArtEval dependency compatibility:
- pinned sweagent to v1.1.0 tag
- set arteval-bench requires-python to >=3.11 to match dependency constraints.

Testing

Parsed all root/benchmark pyproject.toml files via Python tomllib.
Ran shell syntax checks (bash -n) for modified benchmark install/run scripts.
Validated Docker build flow iteratively with:
- docker build --no-cache -t arteval_bench -f benchmarks/arteval_bench/Dockerfile .
Confirmed branch commit structure after rebase/squash cleanup.

Checklist

Tests pass locally (syntax/config-level checks and Docker build-path validation)
Code follows project style guidelines
Documentation updated (packaging/structure docs and workflow updates)

Copilot

Pull request overview

This PR refactors the repository into a uv workspace with the SDK and benchmarks as installable packages, removing sys.path-based import hacks and aligning local + Docker install/run flows around a consistent packaging contract.

Changes:

Introduces a root uv workspace and packages the SDK (system-intelligence-sdk) plus benchmarks as workspace members.
Updates benchmarks to use package-safe/relative imports where applicable and standardizes install.sh/run.sh around uv sync + uv run.
Reworks multiple benchmark Dockerfiles to build/install wheels from the workspace and adds packaging/structure documentation + SDK packaging CI workflow.

Reviewed changes

Copilot reviewed 51 out of 52 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`sdk/__init__.py`	Adds SDK package `__version__` resolution via package metadata.
`sdk/README.md`	Documents SDK install/build commands with `uv`.
`pyproject.toml`	Defines `system-intelligence-sdk`, `uv_build` backend, and the root `uv` workspace + benchmark registry.
`doc/sdk_packaging.md`	Adds SDK packaging/build guidance.
`doc/project_structure.md`	Introduces canonical repo structure and boundary rules.
`doc/porting_benchmark.md`	Removes `sys.path` hacks from porting guidance.
`doc/creating_benchmark.md`	Removes `sys.path` hacks from benchmark creation guidance.
`benchmarks/toposense_bench/src/main.py`	Removes `sys.path` modification for SDK imports.
`benchmarks/toposense_bench/run.sh`	Switches execution to `uv run` and validates `.venv` presence.
`benchmarks/toposense_bench/pyproject.toml`	Adds benchmark package metadata and SDK dependency via workspace source.
`benchmarks/toposense_bench/install.sh`	Migrates install flow to `uv venv` + `uv sync`.
`benchmarks/sysmobench/tests/test_sysmobench.py`	Removes `sys.path` insertion for core imports.
`benchmarks/sysmobench/tests/test_sdk.py`	Removes `sys.path` insertion for SDK imports.
`benchmarks/sysmobench/src/main.py`	Converts local imports to relative imports and removes `sys.path` setup.
`benchmarks/sysmobench/src/executor.py`	Converts evaluator import to relative import.
`benchmarks/sysmobench/run.sh`	Uses `uv run` with module execution (`-m`).
`benchmarks/sysmobench/pyproject.toml`	Adds benchmark package metadata and SDK dependency via workspace source.
`benchmarks/sysmobench/install.sh`	Migrates to `uv` for env creation/sync + keeps editable install for `sysmobench_core`.
`benchmarks/sysmobench/Dockerfile`	Builds wheels in a builder stage and installs SDK + benchmark wheels in runtime image.
`benchmarks/sregym/src/main.py`	Removes `sys.path` usage; loads `sregym_core` entry via `importlib`.
`benchmarks/sregym/run.sh`	Runs via `uv run` using `sregym_core` venv python and sets `PYTHONPATH`.
`benchmarks/sregym/pyproject.toml`	Adds benchmark package metadata and SDK dependency via workspace source.
`benchmarks/sregym/install.sh`	Updates final dependency install step to `uv sync` for `sregym_core` venv.
`benchmarks/sregym/Dockerfile`	Builder-stage wheel build + runtime wheel install approach.
`benchmarks/example_bench/src/main.py`	Removes `sys.path` modification for SDK imports.
`benchmarks/example_bench/run.sh`	Uses `uv run` and validates `.venv` presence.
`benchmarks/example_bench/pyproject.toml`	Adds benchmark package metadata and SDK dependency via workspace source.
`benchmarks/example_bench/install.sh`	Migrates install flow to `uv venv` + `uv sync`.
`benchmarks/example_bench/Dockerfile`	Builder-stage wheel build + runtime wheel install approach.
`benchmarks/courselab_bench/pyproject.toml`	Switches build backend to `uv_build` and adds SDK dependency via workspace source.
`benchmarks/courseexam_bench/pyproject.toml`	Switches build backend to `uv_build` and adds SDK dependency via workspace source.
`benchmarks/cache_algo_bench/src/main.py`	Removes `sys.path` modification for SDK imports.
`benchmarks/cache_algo_bench/src/cache_simulator/cache/Cache.py`	Replaces local path hack with a relative import for `My`.
`benchmarks/cache_algo_bench/run.sh`	Uses `uv run` and validates `.venv` presence.
`benchmarks/cache_algo_bench/pyproject.toml`	Adds benchmark package metadata and SDK dependency via workspace source.
`benchmarks/cache_algo_bench/install.sh`	Migrates install flow to `uv venv` + `uv sync`.
`benchmarks/cache_algo_bench/Dockerfile`	Builder-stage wheel build + runtime wheel install approach.
`benchmarks/arteval_bench/src/evaluator/__init__.py`	Adds package marker for evaluator subpackage.
`benchmarks/arteval_bench/src/core/utils.py`	Removes `sys.path` manipulation.
`benchmarks/arteval_bench/src/core/run_eval_sweagent.py`	Removes `sys.path` manipulation and updates imports.
`benchmarks/arteval_bench/src/core/run_eval_in_env.py`	Removes `sys.path` manipulation.
`benchmarks/arteval_bench/src/core/main_patch.py`	Removes `sys.path` manipulation.
`benchmarks/arteval_bench/src/core/main.py`	Removes `sys.path` manipulation.
`benchmarks/arteval_bench/src/__init__.py`	Adds benchmark package marker.
`benchmarks/arteval_bench/run.sh`	Uses `uv run` and validates `.venv` presence; updates invoked entry script.
`benchmarks/arteval_bench/pyproject.toml`	Adds benchmark package metadata + sweagent pin and SDK dependency via workspace source.
`benchmarks/arteval_bench/install.sh`	Migrates install flow to `uv venv` + `uv sync`.
`benchmarks/arteval_bench/Dockerfile`	Builder-stage wheel build + runtime wheel install; adds diagnostic checks.
`README.md`	Links to new structure and SDK packaging docs.
`.gitignore`	Ignores build artifacts (`build/`, `dist/`, `*.egg-info/`).
`.github/workflows/sdk-package.yml`	Adds CI workflow to build and validate SDK distributions using `uv`.
`.dockerignore`	Excludes venv/cache/build artifacts and outputs from Docker build context.

Comments suppressed due to low confidence (2)

benchmarks/arteval_bench/src/core/utils.py:16

Typo in the generated task string: "rached" should be "reached".

        + f" without asking for approval or confirmation. Once you rached the end"
        + f" of the README you must exit the Docker image gracefully.")

benchmarks/sysmobench/tests/test_sysmobench.py:6

SYSMOBENCH_CORE is now unused after removing the sys.path insertion. Consider removing it (or using it in a more explicit install/validation check) to avoid confusing future readers about how tla_eval is discovered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T10:57:06Z

benchmarks/arteval_bench/src/core/run_eval_sweagent.py


-sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../')))
-
 from patch_evaluator import pacth_eval


run_eval_sweagent.py is inside the core package, but it imports a sibling module with from patch_evaluator import .... This relies on running the file as a script (or having src/core on sys.path) and can break once the benchmark is used as an installed package. Switch to an explicit relative import from the same package to make this workspace/packaging refactor fully sys.path-hack-free.

Suggested change

from patch_evaluator import pacth_eval

from .patch_evaluator import pacth_eval

Copilot · 2026-03-03T10:57:06Z

benchmarks/arteval_bench/pyproject.toml

+requires-python = ">=3.9"
+dependencies = [
+  "system-intelligence-sdk>=0.1.0",
+  "requests",
+  "azure-identity",
+  "sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0",
+]


PR description says arteval-bench was updated to require Python >=3.11 due to dependency constraints, but this pyproject.toml still declares requires-python = ">=3.9". Align the metadata with the actual dependency requirements (or adjust dependencies) to avoid installs that succeed initially but fail at runtime.

Copilot · 2026-03-03T10:57:06Z

benchmarks/toposense_bench/pyproject.toml

+dependencies = [
+  "system-intelligence-sdk>=0.1.0",
+  "datasets>=2.14.0",
+  "huggingface_hub>=0.16.0",
+  "pandas>=1.5.0",
+  "tqdm",
+  "loguru",
+]
+
+[project.optional-dependencies]
+dev = [
+  "pytest>=8.0.0",
+  "ruff>=0.6.0",
+]
+
+[build-system]
+requires = ["uv_build>=0.10.4,<0.11.0"]
+build-backend = "uv_build"
+
+[tool.uv.build-backend]
+module-name = "src"
+module-root = ""
+
+[tool.uv.sources]
+system-intelligence-sdk = { workspace = true }


This benchmark depends on system-intelligence-sdk>=0.1.0, but the local workspace override is expressed only via [tool.uv.sources] (ignored by pip). Any pip install -e . flow will try to resolve system-intelligence-sdk from PyPI and likely fail. If pip-based installs must remain supported, declare an explicit path/VCS URL dependency for the SDK; otherwise, update CI/docs to require uv for installs/tests.

Copilot · 2026-03-03T10:57:07Z

benchmarks/courseexam_bench/pyproject.toml

+dependencies = [
+    "inspect-ai>=0.3.0",
+    "system-intelligence-sdk>=0.1.0",
+]
 license = { text = "MIT" }
 authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }]

 [project.optional-dependencies]
 dev = ["pytest>=7.0.0", "pytest-asyncio>=0.21.0"]

 [build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
+requires = ["uv_build>=0.10.4,<0.11.0"]
+build-backend = "uv_build"

-[tool.hatch.build.targets.wheel]
-packages = ["courseexam"]
+[tool.uv.sources]
+system-intelligence-sdk = { workspace = true }
+
+[tool.uv.build-backend]
+module-name = "courseexam"
+module-root = ""


This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.

Copilot · 2026-03-03T10:57:07Z

benchmarks/courselab_bench/pyproject.toml

+dependencies = [
+    "inspect-ai>=0.3.0",
+    "system-intelligence-sdk>=0.1.0",
+]
 description = "A benchmark for evaluating AI agents on systems programming labs"
 readme = "README.md"
 license = { text = "MIT" }
 authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }]

 [build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
+requires = ["uv_build>=0.10.4,<0.11.0"]
+build-backend = "uv_build"

-[tool.hatch.build.targets.wheel]
-packages = ["courselab"]
+[tool.uv.sources]
+system-intelligence-sdk = { workspace = true }
+
+[tool.uv.build-backend]
+module-name = "courselab"
+module-root = ""


This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.

Copilot · 2026-03-03T10:57:07Z

benchmarks/example_bench/install.sh

+    echo "==> uv not found. Installing uv..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+    export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"


The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Suggested change

echo "==> uv not found. Installing uv..."

curl -LsSf https://astral.sh/uv/install.sh | sh

export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"

echo "==> uv not found."

echo "==> Please install uv manually before running this script."

echo "==> See https://astral.sh/uv/ for installation instructions."

exit 1

Copilot · 2026-03-03T10:57:08Z

benchmarks/toposense_bench/install.sh

+if ! command -v uv >/dev/null 2>&1; then
+    echo "==> uv not found. Installing uv..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+    export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
+fi


The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Copilot · 2026-03-03T10:57:08Z

benchmarks/arteval_bench/install.sh

+    echo "==> uv not found. Installing uv..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+    export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"


The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Suggested change

echo "==> uv not found. Installing uv..."

curl -LsSf https://astral.sh/uv/install.sh | sh

export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"

echo "==> Error: 'uv' command not found."

echo "Please install 'uv' manually before running this script."

echo "For installation instructions, see: https://docs.astral.sh/uv/getting-started/installation/"

exit 1

Copilot · 2026-03-03T10:57:08Z

doc/sdk_packaging.md

+uv build --package system-intelligence-sdk --wheel --sdist
+uv run python -m twine check dist/system_intelligence_sdk-*
+```


The guide uses uv run python -m twine check ..., but twine is not listed in the root dev extra (and uv run won’t automatically fetch it). Either switch the doc to uvx twine check ... (as the workflow does) or add twine to the appropriate optional dependency group so the command works as written.

Copilot · 2026-03-03T10:57:09Z

benchmarks/arteval_bench/pyproject.toml

+  "system-intelligence-sdk>=0.1.0",
+  "requests",
+  "azure-identity",
+  "sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0",


The sweagent dependency is brought in via a Git URL pinned only to a tag (v1.1.0), which is mutable and can be retargeted to arbitrary commits. If the SWE-agent repository or its release tags are compromised, future installs/builds could transparently pull and execute attacker-controlled code in environments that hold API keys or other secrets. Prefer pinning this dependency to an immutable commit SHA (or a verified release artifact) so that the exact code version being executed cannot be changed without explicitly updating this configuration.

xuafeng · 2026-03-05T23:28:16Z

@Acture Thanks a lot for help refine the code. Can you please fix the CI issues? Thanks.

xuafeng · 2026-03-05T23:29:49Z

@tareknaser Hi Tarek, Xinyu proposed some changes which affect the courselab/exam. Please take a quick look to see if it makes sense to you.

Acture added 5 commits March 3, 2026 18:31

feat(packaging): adopt uv workspace with benchmark subpackages

d448f7a

refactor(imports): remove sys.path hacks across benchmarks

1c17b67

chore(scripts): simplify benchmark install and run commands

156aed7

chore(docker): install workspace-built sdk and benchmark wheels

87adc18

fix(arteval): require python>=3.11 to match sweagent v1.1.0

b918e49

Copilot AI review requested due to automatic review settings March 3, 2026 10:44

Copilot started reviewing on behalf of Acture March 3, 2026 10:45 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

fix benchmark runtime bootstrapping and entrypoint compatibility

3127aba

Acture added 4 commits March 6, 2026 12:04

fix: align benchmark tests with uv workspace

0f128d0

fix: harden arteval benchmark runtime wiring

a13a41c

fix: drop unsupported LiteLLM params

9d10d6b

chore: remove local proxy helper

64e2ef1


		sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../')))

		from patch_evaluator import pacth_eval

	from patch_evaluator import pacth_eval
	from .patch_evaluator import pacth_eval

Conversation

Acture commented Mar 3, 2026

Description

Changes

Testing

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

xuafeng commented Mar 5, 2026

Uh oh!

xuafeng commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants