Skip to content

Refactor benchmark packaging/runtime: uv workspace, import cleanup, and docker unification#139

Open
Acture wants to merge 10 commits intosys-intelligence:mainfrom
Acture:reogranize_repo
Open

Refactor benchmark packaging/runtime: uv workspace, import cleanup, and docker unification#139
Acture wants to merge 10 commits intosys-intelligence:mainfrom
Acture:reogranize_repo

Conversation

@Acture
Copy link

@Acture Acture commented Mar 3, 2026

Description

This PR restructures benchmark development and runtime around a uv workspace + multi-package model, removes sys.path-based import hacks, and unifies local/docker installation paths.

The goal is to make SDK + benchmark packaging reusable and predictable, while keeping benchmark scripts and Docker images aligned with the same dependency contract.

Changes

  • Introduced a uv workspace at repo root and defined benchmark packages as workspace members.
  • Added/updated pyproject.toml for all 8 benchmark packages so each benchmark is an installable package depending on system-intelligence-sdk.
  • Kept SDK as a reusable package and updated packaging docs/CI for uv-based build flow.
  • Removed sys.path.append/insert patterns from benchmark code/tests/docs and switched to package-safe imports/relative imports where applicable.
  • Simplified benchmark install.sh / run.sh scripts:
    • removed parent-project invocation style (--project ../..)
    • standardized to local uv sync / uv run usage per benchmark directory
    • preserved special handling only where needed (e.g., sregym_core env).
  • Reworked benchmark Dockerfiles to:
    • build wheels from workspace packages
    • install SDK wheel + benchmark wheel in image
    • avoid shell activate pitfalls by using --python .venv/bin/python
    • include clearer diagnostics for missing wheel artifacts
    • include required system tools for git-based dependencies.
  • Added root .dockerignore to exclude local env/cache artifacts from image build context.
  • Updated ArtEval dependency compatibility:
    • pinned sweagent to v1.1.0 tag
    • set arteval-bench requires-python to >=3.11 to match dependency constraints.

Testing

  • Parsed all root/benchmark pyproject.toml files via Python tomllib.
  • Ran shell syntax checks (bash -n) for modified benchmark install/run scripts.
  • Validated Docker build flow iteratively with:
    • docker build --no-cache -t arteval_bench -f benchmarks/arteval_bench/Dockerfile .
  • Confirmed branch commit structure after rebase/squash cleanup.

Checklist

  • Tests pass locally (syntax/config-level checks and Docker build-path validation)
  • Code follows project style guidelines
  • Documentation updated (packaging/structure docs and workflow updates)

Copilot AI review requested due to automatic review settings March 3, 2026 10:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the repository into a uv workspace with the SDK and benchmarks as installable packages, removing sys.path-based import hacks and aligning local + Docker install/run flows around a consistent packaging contract.

Changes:

  • Introduces a root uv workspace and packages the SDK (system-intelligence-sdk) plus benchmarks as workspace members.
  • Updates benchmarks to use package-safe/relative imports where applicable and standardizes install.sh/run.sh around uv sync + uv run.
  • Reworks multiple benchmark Dockerfiles to build/install wheels from the workspace and adds packaging/structure documentation + SDK packaging CI workflow.

Reviewed changes

Copilot reviewed 51 out of 52 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
sdk/__init__.py Adds SDK package __version__ resolution via package metadata.
sdk/README.md Documents SDK install/build commands with uv.
pyproject.toml Defines system-intelligence-sdk, uv_build backend, and the root uv workspace + benchmark registry.
doc/sdk_packaging.md Adds SDK packaging/build guidance.
doc/project_structure.md Introduces canonical repo structure and boundary rules.
doc/porting_benchmark.md Removes sys.path hacks from porting guidance.
doc/creating_benchmark.md Removes sys.path hacks from benchmark creation guidance.
benchmarks/toposense_bench/src/main.py Removes sys.path modification for SDK imports.
benchmarks/toposense_bench/run.sh Switches execution to uv run and validates .venv presence.
benchmarks/toposense_bench/pyproject.toml Adds benchmark package metadata and SDK dependency via workspace source.
benchmarks/toposense_bench/install.sh Migrates install flow to uv venv + uv sync.
benchmarks/sysmobench/tests/test_sysmobench.py Removes sys.path insertion for core imports.
benchmarks/sysmobench/tests/test_sdk.py Removes sys.path insertion for SDK imports.
benchmarks/sysmobench/src/main.py Converts local imports to relative imports and removes sys.path setup.
benchmarks/sysmobench/src/executor.py Converts evaluator import to relative import.
benchmarks/sysmobench/run.sh Uses uv run with module execution (-m).
benchmarks/sysmobench/pyproject.toml Adds benchmark package metadata and SDK dependency via workspace source.
benchmarks/sysmobench/install.sh Migrates to uv for env creation/sync + keeps editable install for sysmobench_core.
benchmarks/sysmobench/Dockerfile Builds wheels in a builder stage and installs SDK + benchmark wheels in runtime image.
benchmarks/sregym/src/main.py Removes sys.path usage; loads sregym_core entry via importlib.
benchmarks/sregym/run.sh Runs via uv run using sregym_core venv python and sets PYTHONPATH.
benchmarks/sregym/pyproject.toml Adds benchmark package metadata and SDK dependency via workspace source.
benchmarks/sregym/install.sh Updates final dependency install step to uv sync for sregym_core venv.
benchmarks/sregym/Dockerfile Builder-stage wheel build + runtime wheel install approach.
benchmarks/example_bench/src/main.py Removes sys.path modification for SDK imports.
benchmarks/example_bench/run.sh Uses uv run and validates .venv presence.
benchmarks/example_bench/pyproject.toml Adds benchmark package metadata and SDK dependency via workspace source.
benchmarks/example_bench/install.sh Migrates install flow to uv venv + uv sync.
benchmarks/example_bench/Dockerfile Builder-stage wheel build + runtime wheel install approach.
benchmarks/courselab_bench/pyproject.toml Switches build backend to uv_build and adds SDK dependency via workspace source.
benchmarks/courseexam_bench/pyproject.toml Switches build backend to uv_build and adds SDK dependency via workspace source.
benchmarks/cache_algo_bench/src/main.py Removes sys.path modification for SDK imports.
benchmarks/cache_algo_bench/src/cache_simulator/cache/Cache.py Replaces local path hack with a relative import for My.
benchmarks/cache_algo_bench/run.sh Uses uv run and validates .venv presence.
benchmarks/cache_algo_bench/pyproject.toml Adds benchmark package metadata and SDK dependency via workspace source.
benchmarks/cache_algo_bench/install.sh Migrates install flow to uv venv + uv sync.
benchmarks/cache_algo_bench/Dockerfile Builder-stage wheel build + runtime wheel install approach.
benchmarks/arteval_bench/src/evaluator/__init__.py Adds package marker for evaluator subpackage.
benchmarks/arteval_bench/src/core/utils.py Removes sys.path manipulation.
benchmarks/arteval_bench/src/core/run_eval_sweagent.py Removes sys.path manipulation and updates imports.
benchmarks/arteval_bench/src/core/run_eval_in_env.py Removes sys.path manipulation.
benchmarks/arteval_bench/src/core/main_patch.py Removes sys.path manipulation.
benchmarks/arteval_bench/src/core/main.py Removes sys.path manipulation.
benchmarks/arteval_bench/src/__init__.py Adds benchmark package marker.
benchmarks/arteval_bench/run.sh Uses uv run and validates .venv presence; updates invoked entry script.
benchmarks/arteval_bench/pyproject.toml Adds benchmark package metadata + sweagent pin and SDK dependency via workspace source.
benchmarks/arteval_bench/install.sh Migrates install flow to uv venv + uv sync.
benchmarks/arteval_bench/Dockerfile Builder-stage wheel build + runtime wheel install; adds diagnostic checks.
README.md Links to new structure and SDK packaging docs.
.gitignore Ignores build artifacts (build/, dist/, *.egg-info/).
.github/workflows/sdk-package.yml Adds CI workflow to build and validate SDK distributions using uv.
.dockerignore Excludes venv/cache/build artifacts and outputs from Docker build context.
Comments suppressed due to low confidence (2)

benchmarks/arteval_bench/src/core/utils.py:16

  • Typo in the generated task string: "rached" should be "reached".
        + f" without asking for approval or confirmation. Once you rached the end"
        + f" of the README you must exit the Docker image gracefully.")

benchmarks/sysmobench/tests/test_sysmobench.py:6

  • SYSMOBENCH_CORE is now unused after removing the sys.path insertion. Consider removing it (or using it in a more explicit install/validation check) to avoid confusing future readers about how tla_eval is discovered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../')))

from patch_evaluator import pacth_eval
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_eval_sweagent.py is inside the core package, but it imports a sibling module with from patch_evaluator import .... This relies on running the file as a script (or having src/core on sys.path) and can break once the benchmark is used as an installed package. Switch to an explicit relative import from the same package to make this workspace/packaging refactor fully sys.path-hack-free.

Suggested change
from patch_evaluator import pacth_eval
from .patch_evaluator import pacth_eval

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +11
requires-python = ">=3.9"
dependencies = [
"system-intelligence-sdk>=0.1.0",
"requests",
"azure-identity",
"sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0",
]
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says arteval-bench was updated to require Python >=3.11 due to dependency constraints, but this pyproject.toml still declares requires-python = ">=3.9". Align the metadata with the actual dependency requirements (or adjust dependencies) to avoid installs that succeed initially but fail at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +30
dependencies = [
"system-intelligence-sdk>=0.1.0",
"datasets>=2.14.0",
"huggingface_hub>=0.16.0",
"pandas>=1.5.0",
"tqdm",
"loguru",
]

[project.optional-dependencies]
dev = [
"pytest>=8.0.0",
"ruff>=0.6.0",
]

[build-system]
requires = ["uv_build>=0.10.4,<0.11.0"]
build-backend = "uv_build"

[tool.uv.build-backend]
module-name = "src"
module-root = ""

[tool.uv.sources]
system-intelligence-sdk = { workspace = true }
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark depends on system-intelligence-sdk>=0.1.0, but the local workspace override is expressed only via [tool.uv.sources] (ignored by pip). Any pip install -e . flow will try to resolve system-intelligence-sdk from PyPI and likely fail. If pip-based installs must remain supported, declare an explicit path/VCS URL dependency for the SDK; otherwise, update CI/docs to require uv for installs/tests.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +25
dependencies = [
"inspect-ai>=0.3.0",
"system-intelligence-sdk>=0.1.0",
]
license = { text = "MIT" }
authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }]

[project.optional-dependencies]
dev = ["pytest>=7.0.0", "pytest-asyncio>=0.21.0"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
requires = ["uv_build>=0.10.4,<0.11.0"]
build-backend = "uv_build"

[tool.hatch.build.targets.wheel]
packages = ["courseexam"]
[tool.uv.sources]
system-intelligence-sdk = { workspace = true }

[tool.uv.build-backend]
module-name = "courseexam"
module-root = ""
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +23
dependencies = [
"inspect-ai>=0.3.0",
"system-intelligence-sdk>=0.1.0",
]
description = "A benchmark for evaluating AI agents on systems programming labs"
readme = "README.md"
license = { text = "MIT" }
authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
requires = ["uv_build>=0.10.4,<0.11.0"]
build-backend = "uv_build"

[tool.hatch.build.targets.wheel]
packages = ["courselab"]
[tool.uv.sources]
system-intelligence-sdk = { workspace = true }

[tool.uv.build-backend]
module-name = "courselab"
module-root = ""
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project now depends on system-intelligence-sdk>=0.1.0, but the local workspace override is only declared via [tool.uv.sources] (ignored by pip). Any pip-based install/test flow will attempt to fetch system-intelligence-sdk from PyPI and likely fail. Consider adding an explicit path/VCS URL dependency for pip, or ensure CI/docs are updated to use uv exclusively.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
echo "==> uv not found. Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Suggested change
echo "==> uv not found. Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
echo "==> uv not found."
echo "==> Please install uv manually before running this script."
echo "==> See https://astral.sh/uv/ for installation instructions."
exit 1

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +9
if ! command -v uv >/dev/null 2>&1; then
echo "==> uv not found. Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
fi
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
echo "==> uv not found. Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install script auto-installs uv by piping a remote shell script from the network into sh. That pattern is a supply-chain risk and also makes installs non-reproducible in locked-down environments. Prefer documenting a manual uv installation step (or at least prompting for confirmation / verifying a pinned installer checksum) instead of executing a remote script automatically.

Suggested change
echo "==> uv not found. Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
echo "==> Error: 'uv' command not found."
echo "Please install 'uv' manually before running this script."
echo "For installation instructions, see: https://docs.astral.sh/uv/getting-started/installation/"
exit 1

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
uv build --package system-intelligence-sdk --wheel --sdist
uv run python -m twine check dist/system_intelligence_sdk-*
```
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide uses uv run python -m twine check ..., but twine is not listed in the root dev extra (and uv run won’t automatically fetch it). Either switch the doc to uvx twine check ... (as the workflow does) or add twine to the appropriate optional dependency group so the command works as written.

Copilot uses AI. Check for mistakes.
"system-intelligence-sdk>=0.1.0",
"requests",
"azure-identity",
"sweagent @ git+https://github.com/SWE-agent/SWE-agent.git@v1.1.0",
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sweagent dependency is brought in via a Git URL pinned only to a tag (v1.1.0), which is mutable and can be retargeted to arbitrary commits. If the SWE-agent repository or its release tags are compromised, future installs/builds could transparently pull and execute attacker-controlled code in environments that hold API keys or other secrets. Prefer pinning this dependency to an immutable commit SHA (or a verified release artifact) so that the exact code version being executed cannot be changed without explicitly updating this configuration.

Copilot uses AI. Check for mistakes.
@xuafeng
Copy link
Collaborator

xuafeng commented Mar 5, 2026

@Acture Thanks a lot for help refine the code. Can you please fix the CI issues? Thanks.

@xuafeng
Copy link
Collaborator

xuafeng commented Mar 5, 2026

@tareknaser Hi Tarek, Xinyu proposed some changes which affect the courselab/exam. Please take a quick look to see if it makes sense to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants