Prepare GSE274058 reference release artifacts by hutaobo · Pull Request #1 · hutaobo/SpatialPerturb

hutaobo · 2026-04-28T15:26:46Z

Summary

package and document the GSE274058 reference-side run for GitHub + RTD publication
add release packaging and A100 comparison scripts plus provenance-aware rerun support
stabilize the current SpatialPerturb benchmark/io/signature stack so tests and Sphinx builds pass

Validation

pytest -q
python -m sphinx -b html docs .tmp_sphinx_html
python scripts/package_gse274058_reference_release.py
python scripts/compare_gse274058_reference_runs.py --baseline-dir reports/gse274058_reference_run --candidate-dir reports/gse274058_reference_run --output-json artifacts/gse274058_reference_release/self_compare.json

Notes

public RTD latest and final GitHub release should still wait for merge plus the authoritative A100 rerun under /data/taobo.hu/SpatialPerturb
the current docs page points at the expected
eleases/latest/download/... asset URLs so it will light up once the post-merge release is created

sourcery-ai

Sorry @hutaobo, your pull request is larger than the review limit of 150000 diff characters

sourcery-ai

Sorry @hutaobo, your pull request is larger than the review limit of 150000 diff characters

Copilot

Pull request overview

Packages and documents the GSE274058 (reference-side) release artifacts while stabilizing the SpatialPerturb benchmark / I/O / signature stack so pytest + Sphinx builds pass.

Changes:

Introduces a cohesive spatialperturb package API (schema, I/O, preprocessing, graphs, analysis tools, benchmarks, plotting, CLI).
Adds end-to-end benchmark + CLI + A100 workflow scripts plus release packaging/comparison utilities.
Adds broad unit/smoke test coverage and publishes rendered docs pages + precomputed reference-result summaries.

Reviewed changes

Copilot reviewed 62 out of 64 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tests/test_tools.py	Exercises core analysis + plotting helpers on demo fixtures.
tests/test_smoke.py	CLI + dataset lifecycle smoke tests (datasets/benchmarks/prepare-xenium/reference benchmark).
tests/test_signatures.py	Tests signature/program building, scoring, and aggregation helpers.
tests/test_schema.py	Tests schema default filling + validation.
tests/test_pp.py	Tests perturbation assignment + QC tables.
tests/test_io.py	Tests `from_tables`, Xenium/Stereo-seq directory readers, ROI/cell-group annotations, and 10x H5 reader.
tests/test_gr.py	Tests spatial graph construction and neighbor collection behavior.
tests/test_datasets.py	Tests dataset preparation parsing for GSE241115 breast CROP-seq (tar + sparse + control inputs).
tests/test_benchmarks.py	Tests core benchmark reporting and reference projection benchmark outputs.
tests/test_a100_workflow_scripts.py	Validates A100 workflow helper scripts via import+in-memory fixtures.
tests/conftest.py	Adds demo dataset fixtures and ensures `src/` importability in tests.
src/spatialperturb/tl.py	Core analysis tools: intrinsic/neighbor DE, LR scoring, concordance, and power curve.
src/spatialperturb/signatures.py	Program/signature derivation, scoring, neighbor scoring, aggregation, reference program building.
src/spatialperturb/schema.py	Defines/validates required AnnData schema + provenance metadata.
src/spatialperturb/resources/convert_seurat_to_tables.R	R helper to export Seurat objects into tables/MTX for ingestion.
src/spatialperturb/reports.py	Paper-figure rendering helper and manifest writing for benchmark outputs.
src/spatialperturb/py.typed	Marks the package as typed for type checkers.
src/spatialperturb/pp.py	Perturbation assignment from barcode features + QC summaries.
src/spatialperturb/pl.py	Plotting utilities for benchmark figures.
src/spatialperturb/io.py	AnnData construction and readers for Xenium/Stereo-seq-style exports and 10x H5 matrices; ROI/cell-group annotation.
src/spatialperturb/gr.py	Spatial graph construction and neighbor-edge collection.
src/spatialperturb/cli.py	Typer-based CLI exposing dataset lifecycle and benchmark workflows.
src/spatialperturb/benchmarks.py	Benchmark orchestration, report generation, manifests, and reference projection pipeline.
src/spatialperturb/_utils.py	Shared utilities: matrix extraction, BH correction, log2fc, dict merge.
src/spatialperturb/init.py	Public API surface + module re-exports; bumps version to 0.3.0.
src/SpatialPerturb/signatures.py	Removes legacy duplicate module.
src/SpatialPerturb/cli.py	Removes legacy duplicate module.
src/SpatialPerturb/init.py	Removes legacy duplicate module.
scripts/run_gse274058_reference.py	Script to run and summarize the GSE274058 reference-side analysis.
scripts/run_breast_reference_projection.py	A100-friendly breast Xenium reference projection workflow runner.
scripts/package_gse274058_reference_release.py	Packages reference run outputs into release assets + docs-ready summaries.
scripts/interpret_breast_reference_projection.py	Produces a biological interpretation markdown + “top programs” tables.
scripts/compare_gse274058_reference_runs.py	Compares two run directories and optionally replaces baseline.
scripts/a100_sync_xenium_minimal.ps1	Syncs minimal Xenium inputs to the remote A100 host.
scripts/a100_setup_env.sh	Creates/updates an A100 venv and writes environment status JSON.
scripts/a100_run_breast_reference_projection.sh	End-to-end A100 runner (git checkout, env setup, run workflow, monitor).
scripts/a100_monitor_status.py	Generates JSON/Markdown status snapshots for the A100 run directory.
scripts/a100_monitor_breast_reference_projection.sh	Convenience wrapper to run the monitor once or in watch mode.
pyproject.toml	Updates package metadata/version and dependencies (adds statsmodels/h5py, extras, script entrypoint).
mkdocs.yml	Adds nav entries for new workflow/benchmark/reference-result docs pages.
docs/workflow.md	Documents the standardized workflow and example code snippets.
docs/results/gse274058_reference/valid_perturbations.tsv	Published reference-run table artifact.
docs/results/gse274058_reference/valid_perturbations.md	Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/top_hits.tsv	Published reference-run table artifact.
docs/results/gse274058_reference/top_hits.md	Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/target_gene_sanity.md	Rendered target-gene sanity table for docs inclusion.
docs/results/gse274058_reference/qc_summary.md	Rendered QC summary snippet for docs inclusion.
docs/results/gse274058_reference/program_summary.tsv	Published reference-run table artifact.
docs/results/gse274058_reference/program_summary.md	Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/overview.md	Rendered provenance overview snippet for docs inclusion.
docs/results/gse274058_reference/improvement.md	Rendered “how to improve” snippet for docs inclusion.
docs/results/gse274058_reference/dataset_summary.json	Published reference-run provenance JSON artifact.
docs/results/gse274058_reference/a100_status.md	Rendered A100 confirmation status snippet for docs inclusion.
docs/requirements.txt	Adjusts doc build requirements.
docs/paper-repro.md	Documents paper-grade CLI/Python reproduction workflow.
docs/index.md	Updates docs landing page and Sphinx toctree.
docs/gse274058-reference-results.md	Adds a docs page publishing the GSE274058 reference run assets + summaries.
docs/benchmarks.md	Documents benchmark tracks and expected outputs.
docs/api.md	Adds Sphinx autodoc API reference pages.
README.md	Updates README to match new API/CLI/benchmark workflows.
CITATION.cff	Updates version and repository URL.
.gitignore	Ignores caches, reports, artifacts, and temp Sphinx build output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T18:56:32Z

+        neighbors = []
+        for neighbor_idx, distance in zip(row.indices, row.data, strict=False):
+            if not include_self and neighbor_idx == idx:


zip(..., strict=False) requires Python 3.10+. With requires-python = ">=3.9" in pyproject.toml, this will raise TypeError: zip() takes no keyword arguments on Python 3.9. Either bump the minimum supported Python version to >=3.10 or remove the strict= usage here (and elsewhere).

Copilot · 2026-04-28T18:56:32Z

+    x_positions = [0.08, 0.24, 0.40, 0.58, 0.76, 0.92]
+    for idx, ((title, body), x) in enumerate(zip(steps, x_positions, strict=False)):
+        ax.text(


zip(..., strict=False) requires Python 3.10+. With requires-python >=3.9, importing this module will fail on Python 3.9. Either bump minimum Python to >=3.10 or remove the strict= argument here.

Copilot · 2026-04-28T18:56:33Z

+        if not isinstance(group_values, tuple):
+            group_values = (group_values,)
+        group_label = " | ".join(f"{column}={value}" for column, value in zip(group_cols, group_values, strict=False))
+        n_cells = int(len(frame))
+        means = frame.loc[:, scores.columns].mean(axis=0)


zip(..., strict=False) is only available in Python 3.10+. Since pyproject.toml declares requires-python >=3.9, this will crash on Python 3.9. Either bump the minimum Python version to >=3.10 or replace zip(..., strict=False) with plain zip(...) and validate lengths explicitly if needed.

Copilot · 2026-04-28T18:56:33Z

+            scores = pd.DataFrame(raw_scores, index=adata.obs_names.astype(str))
+    scores.index = adata.obs_names.astype(str)
+    scores.columns = scores.columns.astype(str)


When a user passes a DataFrame for score_key, scores.index = adata.obs_names.astype(str) overwrites its index and can silently misalign rows if the DataFrame isn’t already in the exact same order as adata.obs_names. Prefer scores = scores.reindex(adata.obs_names.astype(str)) (and error if labels are missing/extra) instead of blindly replacing the index.

Copilot · 2026-04-28T18:56:33Z

+    target_rows = intrinsic_de.loc[
+        intrinsic_de.apply(lambda row: str(row["perturbation"]) == str(row["gene"]), axis=1),
+        ["perturbation", "gene", "log2fc", "fdr", "mean_case", "mean_control"],
+    ].copy()


intrinsic_de.apply(lambda row: ..., axis=1) is row-wise and can be noticeably slow on larger DE tables. This filter can be vectorized (e.g., compare the perturbation and gene columns directly after casting to string) to keep packaging fast and avoid unnecessary CPU time.

Copilot · 2026-04-28T18:56:34Z

+        raise ValueError("cell_group_path must contain at least 'cell_id' and 'group' columns.")
+    mapping = dict(zip(table["cell_id"].astype(str), table["group"].astype(str), strict=False))
+    matched = 0
+    cell_types = []
+    for cell_id, current in zip(adata.obs_names.astype(str), adata.obs["cell_type"].astype(str), strict=False):


zip(..., strict=False) is Python 3.10+ only. With requires-python >=3.9, calling this code on Python 3.9 will raise TypeError: zip() takes no keyword arguments. Either bump minimum Python to >=3.10 or replace these zip(..., strict=False) calls with plain zip(...) and validate lengths explicitly.

Copilot · 2026-04-28T18:56:34Z

+    if dst == dst.anchor:
+        raise ValueError(f"Refusing to replace filesystem root: {dst}")
+    if dst.exists():
+        shutil.rmtree(dst)
+    shutil.copytree(src, dst)


The root-safety check doesn’t work: dst.anchor is a string, so dst == dst.anchor is always False (e.g. Path('/') != '/'). With --replace-baseline, this can allow shutil.rmtree(dst) to delete a filesystem root if a bad path is provided. Compare against Path(dst.anchor) (or use dst.parent == dst) and consider an additional sentinel check before deletion.

Copilot · 2026-04-28T18:56:34Z

+        "scipy",
+        "seaborn",
+        "sklearn",
+        "spatialperturb",
+        "statsmodels",


The dependency-version probe uses importlib.metadata.version('sklearn'), but the installed distribution is typically named scikit-learn (while sklearn is just the import namespace). This makes the report incorrectly show scikit-learn as not-installed. Consider querying scikit-learn here (and optionally still label it as sklearn in the output).

Copilot · 2026-04-28T18:56:35Z

+        manifest = {
+            "benchmark": "breast_reference_projection",
+            "dataset": dataset_name,
+            "generated_at": datetime.now(timezone.utc).isoformat(),


run_reference_projection_benchmark() always writes manifest['benchmark'] = 'breast_reference_projection', but _BENCHMARK_CATALOG also advertises a separate reference_projection benchmark. This makes the manifest misleading for non-breast runs and complicates downstream tooling that keys off the benchmark name. Consider using reference_projection here (or selecting based on config / reference_datasets).

hutaobo added 2 commits April 10, 2026 15:56

align package layout and metadata

e41839f

Prepare GSE274058 reference release artifacts

18a07c1

sourcery-ai Bot reviewed Apr 28, 2026

View reviewed changes

hutaobo added 3 commits April 28, 2026 20:43

Add A100 breast reference workflow

84c536a

Record A100 GSE274058 rerun confirmation

11dc68c

Normalize GSE274058 variable names on import

6cda2e8

hutaobo marked this pull request as ready for review April 28, 2026 18:50

Copilot AI review requested due to automatic review settings April 28, 2026 18:50

hutaobo merged commit 754fb4d into master Apr 28, 2026
1 check passed

sourcery-ai Bot reviewed Apr 28, 2026

View reviewed changes

Copilot started reviewing on behalf of hutaobo April 28, 2026 18:51 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare GSE274058 reference release artifacts#1

Prepare GSE274058 reference release artifacts#1
hutaobo merged 5 commits intomasterfrom
codex/package-foundation-fixes

hutaobo commented Apr 28, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hutaobo commented Apr 28, 2026

Summary

Validation

Notes

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants