Refactor benchmark packaging/runtime: uv workspace, import cleanup, and docker unification#139
Refactor benchmark packaging/runtime: uv workspace, import cleanup, and docker unification#139Acture wants to merge 10 commits intosys-intelligence:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the repository into a uv workspace with the SDK and benchmarks as installable packages, removing sys.path-based import hacks and aligning local + Docker install/run flows around a consistent packaging contract.
Changes:
- Introduces a root
uvworkspace and packages the SDK (system-intelligence-sdk) plus benchmarks as workspace members. - Updates benchmarks to use package-safe/relative imports where applicable and standardizes
install.sh/run.sharounduv sync+uv run. - Reworks multiple benchmark Dockerfiles to build/install wheels from the workspace and adds packaging/structure documentation + SDK packaging CI workflow.
Reviewed changes
Copilot reviewed 51 out of 52 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
sdk/__init__.py |
Adds SDK package __version__ resolution via package metadata. |
sdk/README.md |
Documents SDK install/build commands with uv. |
pyproject.toml |
Defines system-intelligence-sdk, uv_build backend, and the root uv workspace + benchmark registry. |
doc/sdk_packaging.md |
Adds SDK packaging/build guidance. |
doc/project_structure.md |
Introduces canonical repo structure and boundary rules. |
doc/porting_benchmark.md |
Removes sys.path hacks from porting guidance. |
doc/creating_benchmark.md |
Removes sys.path hacks from benchmark creation guidance. |
benchmarks/toposense_bench/src/main.py |
Removes sys.path modification for SDK imports. |
benchmarks/toposense_bench/run.sh |
Switches execution to uv run and validates .venv presence. |
benchmarks/toposense_bench/pyproject.toml |
Adds benchmark package metadata and SDK dependency via workspace source. |
benchmarks/toposense_bench/install.sh |
Migrates install flow to uv venv + uv sync. |
benchmarks/sysmobench/tests/test_sysmobench.py |
Removes sys.path insertion for core imports. |
benchmarks/sysmobench/tests/test_sdk.py |
Removes sys.path insertion for SDK imports. |
benchmarks/sysmobench/src/main.py |
Converts local imports to relative imports and removes sys.path setup. |
benchmarks/sysmobench/src/executor.py |
Converts evaluator import to relative import. |
benchmarks/sysmobench/run.sh |
Uses uv run with module execution (-m). |
benchmarks/sysmobench/pyproject.toml |
Adds benchmark package metadata and SDK dependency via workspace source. |
benchmarks/sysmobench/install.sh |
Migrates to uv for env creation/sync + keeps editable install for sysmobench_core. |
benchmarks/sysmobench/Dockerfile |
Builds wheels in a builder stage and installs SDK + benchmark wheels in runtime image. |
benchmarks/sregym/src/main.py |
Removes sys.path usage; loads sregym_core entry via importlib. |
benchmarks/sregym/run.sh |
Runs via uv run using sregym_core venv python and sets PYTHONPATH. |
benchmarks/sregym/pyproject.toml |
Adds benchmark package metadata and SDK dependency via workspace source. |
benchmarks/sregym/install.sh |
Updates final dependency install step to uv sync for sregym_core venv. |
benchmarks/sregym/Dockerfile |
Builder-stage wheel build + runtime wheel install approach. |
benchmarks/example_bench/src/main.py |
Removes sys.path modification for SDK imports. |
benchmarks/example_bench/run.sh |
Uses uv run and validates .venv presence. |
benchmarks/example_bench/pyproject.toml |
Adds benchmark package metadata and SDK dependency via workspace source. |
benchmarks/example_bench/install.sh |
Migrates install flow to uv venv + uv sync. |
benchmarks/example_bench/Dockerfile |
Builder-stage wheel build + runtime wheel install approach. |
benchmarks/courselab_bench/pyproject.toml |
Switches build backend to uv_build and adds SDK dependency via workspace source. |
benchmarks/courseexam_bench/pyproject.toml |
Switches build backend to uv_build and adds SDK dependency via workspace source. |
benchmarks/cache_algo_bench/src/main.py |
Removes sys.path modification for SDK imports. |
benchmarks/cache_algo_bench/src/cache_simulator/cache/Cache.py |
Replaces local path hack with a relative import for My. |
benchmarks/cache_algo_bench/run.sh |
Uses uv run and validates .venv presence. |
benchmarks/cache_algo_bench/pyproject.toml |
Adds benchmark package metadata and SDK dependency via workspace source. |
benchmarks/cache_algo_bench/install.sh |
Migrates install flow to uv venv + uv sync. |
benchmarks/cache_algo_bench/Dockerfile |
Builder-stage wheel build + runtime wheel install approach. |
benchmarks/arteval_bench/src/evaluator/__init__.py |
Adds package marker for evaluator subpackage. |
benchmarks/arteval_bench/src/core/utils.py |
Removes sys.path manipulation. |
benchmarks/arteval_bench/src/core/run_eval_sweagent.py |
Removes sys.path manipulation and updates imports. |
benchmarks/arteval_bench/src/core/run_eval_in_env.py |
Removes sys.path manipulation. |
benchmarks/arteval_bench/src/core/main_patch.py |
Removes sys.path manipulation. |
benchmarks/arteval_bench/src/core/main.py |
Removes sys.path manipulation. |
benchmarks/arteval_bench/src/__init__.py |
Adds benchmark package marker. |
benchmarks/arteval_bench/run.sh |
Uses uv run and validates .venv presence; updates invoked entry script. |
benchmarks/arteval_bench/pyproject.toml |
Adds benchmark package metadata + sweagent pin and SDK dependency via workspace source. |
benchmarks/arteval_bench/install.sh |
Migrates install flow to uv venv + uv sync. |
benchmarks/arteval_bench/Dockerfile |
Builder-stage wheel build + runtime wheel install; adds diagnostic checks. |
README.md |
Links to new structure and SDK packaging docs. |
.gitignore |
Ignores build artifacts (build/, dist/, *.egg-info/). |
.github/workflows/sdk-package.yml |
Adds CI workflow to build and validate SDK distributions using uv. |
.dockerignore |
Excludes venv/cache/build artifacts and outputs from Docker build context. |
Comments suppressed due to low confidence (2)
benchmarks/arteval_bench/src/core/utils.py:16
- Typo in the generated task string: "rached" should be "reached".
+ f" without asking for approval or confirmation. Once you rached the end"
+ f" of the README you must exit the Docker image gracefully.")
benchmarks/sysmobench/tests/test_sysmobench.py:6
SYSMOBENCH_COREis now unused after removing thesys.pathinsertion. Consider removing it (or using it in a more explicit install/validation check) to avoid confusing future readers about howtla_evalis discovered.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../'))) | ||
|
|
||
| from patch_evaluator import pacth_eval |
There was a problem hiding this comment.
run_eval_sweagent.py is inside the core package, but it imports a sibling module with from patch_evaluator import .... This relies on running the file as a script (or having src/core on sys.path) and can break once the benchmark is used as an installed package. Switch to an explicit relative import from the same package to make this workspace/packaging refactor fully sys.path-hack-free.
| from patch_evaluator import pacth_eval | |
| from .patch_evaluator import pacth_eval |
| requires-python = ">=3.9" | ||
| dependencies = [ | ||
| "system-intelligence-sdk>=0.1.0", | ||
| "requests", | ||
| "azure-identity", | ||
| "sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0", | ||
| ] |
There was a problem hiding this comment.
PR description says arteval-bench was updated to require Python >=3.11 due to dependency constraints, but this pyproject.toml still declares requires-python = ">=3.9". Align the metadata with the actual dependency requirements (or adjust dependencies) to avoid installs that succeed initially but fail at runtime.
| dependencies = [ | ||
| "system-intelligence-sdk>=0.1.0", | ||
| "datasets>=2.14.0", | ||
| "huggingface_hub>=0.16.0", | ||
| "pandas>=1.5.0", | ||
| "tqdm", | ||
| "loguru", | ||
| ] | ||
|
|
||
| [project.optional-dependencies] | ||
| dev = [ | ||
| "pytest>=8.0.0", | ||
| "ruff>=0.6.0", | ||
| ] | ||
|
|
||
| [build-system] | ||
| requires = ["uv_build>=0.10.4,<0.11.0"] | ||
| build-backend = "uv_build" | ||
|
|
||
| [tool.uv.build-backend] | ||
| module-name = "src" | ||
| module-root = "" | ||
|
|
||
| [tool.uv.sources] | ||
| system-intelligence-sdk = { workspace = true } |
There was a problem hiding this comment.
This benchmark depends on system-intelligence-sdk>=0.1.0, but the local workspace override is expressed only via [tool.uv.sources] (ignored by pip). Any pip install -e . flow will try to resolve system-intelligence-sdk from PyPI and likely fail. If pip-based installs must remain supported, declare an explicit path/VCS URL dependency for the SDK; otherwise, update CI/docs to require uv for installs/tests.
| dependencies = [ | ||
| "inspect-ai>=0.3.0", | ||
| "system-intelligence-sdk>=0.1.0", | ||
| ] | ||
| license = { text = "MIT" } | ||
| authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }] | ||
|
|
||
| [project.optional-dependencies] | ||
| dev = ["pytest>=7.0.0", "pytest-asyncio>=0.21.0"] | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
| requires = ["uv_build>=0.10.4,<0.11.0"] | ||
| build-backend = "uv_build" | ||
|
|
||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["courseexam"] | ||
| [tool.uv.sources] | ||
| system-intelligence-sdk = { workspace = true } | ||
|
|
||
| [tool.uv.build-backend] | ||
| module-name = "courseexam" | ||
| module-root = "" |
There was a problem hiding this comment.
This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.
| dependencies = [ | ||
| "inspect-ai>=0.3.0", | ||
| "system-intelligence-sdk>=0.1.0", | ||
| ] | ||
| description = "A benchmark for evaluating AI agents on systems programming labs" | ||
| readme = "README.md" | ||
| license = { text = "MIT" } | ||
| authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }] | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
| requires = ["uv_build>=0.10.4,<0.11.0"] | ||
| build-backend = "uv_build" | ||
|
|
||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["courselab"] | ||
| [tool.uv.sources] | ||
| system-intelligence-sdk = { workspace = true } | ||
|
|
||
| [tool.uv.build-backend] | ||
| module-name = "courselab" | ||
| module-root = "" |
There was a problem hiding this comment.
This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.
| echo "==> uv not found. Installing uv..." | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" |
There was a problem hiding this comment.
The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.
| echo "==> uv not found. Installing uv..." | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" | |
| echo "==> uv not found." | |
| echo "==> Please install uv manually before running this script." | |
| echo "==> See https://astral.sh/uv/ for installation instructions." | |
| exit 1 |
| if ! command -v uv >/dev/null 2>&1; then | ||
| echo "==> uv not found. Installing uv..." | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" | ||
| fi |
There was a problem hiding this comment.
The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.
| echo "==> uv not found. Installing uv..." | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" |
There was a problem hiding this comment.
The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.
| echo "==> uv not found. Installing uv..." | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" | |
| echo "==> Error: 'uv' command not found." | |
| echo "Please install 'uv' manually before running this script." | |
| echo "For installation instructions, see: https://docs.astral.sh/uv/getting-started/installation/" | |
| exit 1 |
| uv build --package system-intelligence-sdk --wheel --sdist | ||
| uv run python -m twine check dist/system_intelligence_sdk-* | ||
| ``` |
There was a problem hiding this comment.
The guide uses uv run python -m twine check ..., but twine is not listed in the root dev extra (and uv run won’t automatically fetch it). Either switch the doc to uvx twine check ... (as the workflow does) or add twine to the appropriate optional dependency group so the command works as written.
| "system-intelligence-sdk>=0.1.0", | ||
| "requests", | ||
| "azure-identity", | ||
| "sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0", |
There was a problem hiding this comment.
The sweagent dependency is brought in via a Git URL pinned only to a tag (v1.1.0), which is mutable and can be retargeted to arbitrary commits. If the SWE-agent repository or its release tags are compromised, future installs/builds could transparently pull and execute attacker-controlled code in environments that hold API keys or other secrets. Prefer pinning this dependency to an immutable commit SHA (or a verified release artifact) so that the exact code version being executed cannot be changed without explicitly updating this configuration.
|
@Acture Thanks a lot for help refine the code. Can you please fix the CI issues? Thanks. |
|
@tareknaser Hi Tarek, Xinyu proposed some changes which affect the courselab/exam. Please take a quick look to see if it makes sense to you. |
Description
This PR restructures benchmark development and runtime around a
uvworkspace + multi-package model, removessys.path-based import hacks, and unifies local/docker installation paths.The goal is to make SDK + benchmark packaging reusable and predictable, while keeping benchmark scripts and Docker images aligned with the same dependency contract.
Changes
uvworkspace at repo root and defined benchmark packages as workspace members.pyproject.tomlfor all 8 benchmark packages so each benchmark is an installable package depending onsystem-intelligence-sdk.sys.path.append/insertpatterns from benchmark code/tests/docs and switched to package-safe imports/relative imports where applicable.install.sh/run.shscripts:--project ../..)uv sync/uv runusage per benchmark directorysregym_coreenv).activatepitfalls by using--python .venv/bin/python.dockerignoreto exclude local env/cache artifacts from image build context.sweagenttov1.1.0tagarteval-benchrequires-pythonto>=3.11to match dependency constraints.Testing
pyproject.tomlfiles via Pythontomllib.bash -n) for modified benchmark install/run scripts.docker build --no-cache -t arteval_bench -f benchmarks/arteval_bench/Dockerfile .Checklist