Skip to content

Prepare GSE274058 reference release artifacts#1

Merged
hutaobo merged 5 commits intomasterfrom
codex/package-foundation-fixes
Apr 28, 2026
Merged

Prepare GSE274058 reference release artifacts#1
hutaobo merged 5 commits intomasterfrom
codex/package-foundation-fixes

Conversation

@hutaobo
Copy link
Copy Markdown
Owner

@hutaobo hutaobo commented Apr 28, 2026

Summary

  • package and document the GSE274058 reference-side run for GitHub + RTD publication
  • add release packaging and A100 comparison scripts plus provenance-aware rerun support
  • stabilize the current SpatialPerturb benchmark/io/signature stack so tests and Sphinx builds pass

Validation

  • pytest -q
  • python -m sphinx -b html docs .tmp_sphinx_html
  • python scripts/package_gse274058_reference_release.py
  • python scripts/compare_gse274058_reference_runs.py --baseline-dir reports/gse274058_reference_run --candidate-dir reports/gse274058_reference_run --output-json artifacts/gse274058_reference_release/self_compare.json

Notes

  • public RTD latest and final GitHub release should still wait for merge plus the authoritative A100 rerun under /data/taobo.hu/SpatialPerturb
  • the current docs page points at the expected
    eleases/latest/download/... asset URLs so it will light up once the post-merge release is created

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hutaobo, your pull request is larger than the review limit of 150000 diff characters

@hutaobo hutaobo marked this pull request as ready for review April 28, 2026 18:50
Copilot AI review requested due to automatic review settings April 28, 2026 18:50
@hutaobo hutaobo merged commit 754fb4d into master Apr 28, 2026
1 check passed
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hutaobo, your pull request is larger than the review limit of 150000 diff characters

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Packages and documents the GSE274058 (reference-side) release artifacts while stabilizing the SpatialPerturb benchmark / I/O / signature stack so pytest + Sphinx builds pass.

Changes:

  • Introduces a cohesive spatialperturb package API (schema, I/O, preprocessing, graphs, analysis tools, benchmarks, plotting, CLI).
  • Adds end-to-end benchmark + CLI + A100 workflow scripts plus release packaging/comparison utilities.
  • Adds broad unit/smoke test coverage and publishes rendered docs pages + precomputed reference-result summaries.

Reviewed changes

Copilot reviewed 62 out of 64 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_tools.py Exercises core analysis + plotting helpers on demo fixtures.
tests/test_smoke.py CLI + dataset lifecycle smoke tests (datasets/benchmarks/prepare-xenium/reference benchmark).
tests/test_signatures.py Tests signature/program building, scoring, and aggregation helpers.
tests/test_schema.py Tests schema default filling + validation.
tests/test_pp.py Tests perturbation assignment + QC tables.
tests/test_io.py Tests from_tables, Xenium/Stereo-seq directory readers, ROI/cell-group annotations, and 10x H5 reader.
tests/test_gr.py Tests spatial graph construction and neighbor collection behavior.
tests/test_datasets.py Tests dataset preparation parsing for GSE241115 breast CROP-seq (tar + sparse + control inputs).
tests/test_benchmarks.py Tests core benchmark reporting and reference projection benchmark outputs.
tests/test_a100_workflow_scripts.py Validates A100 workflow helper scripts via import+in-memory fixtures.
tests/conftest.py Adds demo dataset fixtures and ensures src/ importability in tests.
src/spatialperturb/tl.py Core analysis tools: intrinsic/neighbor DE, LR scoring, concordance, and power curve.
src/spatialperturb/signatures.py Program/signature derivation, scoring, neighbor scoring, aggregation, reference program building.
src/spatialperturb/schema.py Defines/validates required AnnData schema + provenance metadata.
src/spatialperturb/resources/convert_seurat_to_tables.R R helper to export Seurat objects into tables/MTX for ingestion.
src/spatialperturb/reports.py Paper-figure rendering helper and manifest writing for benchmark outputs.
src/spatialperturb/py.typed Marks the package as typed for type checkers.
src/spatialperturb/pp.py Perturbation assignment from barcode features + QC summaries.
src/spatialperturb/pl.py Plotting utilities for benchmark figures.
src/spatialperturb/io.py AnnData construction and readers for Xenium/Stereo-seq-style exports and 10x H5 matrices; ROI/cell-group annotation.
src/spatialperturb/gr.py Spatial graph construction and neighbor-edge collection.
src/spatialperturb/cli.py Typer-based CLI exposing dataset lifecycle and benchmark workflows.
src/spatialperturb/benchmarks.py Benchmark orchestration, report generation, manifests, and reference projection pipeline.
src/spatialperturb/_utils.py Shared utilities: matrix extraction, BH correction, log2fc, dict merge.
src/spatialperturb/init.py Public API surface + module re-exports; bumps version to 0.3.0.
src/SpatialPerturb/signatures.py Removes legacy duplicate module.
src/SpatialPerturb/cli.py Removes legacy duplicate module.
src/SpatialPerturb/init.py Removes legacy duplicate module.
scripts/run_gse274058_reference.py Script to run and summarize the GSE274058 reference-side analysis.
scripts/run_breast_reference_projection.py A100-friendly breast Xenium reference projection workflow runner.
scripts/package_gse274058_reference_release.py Packages reference run outputs into release assets + docs-ready summaries.
scripts/interpret_breast_reference_projection.py Produces a biological interpretation markdown + “top programs” tables.
scripts/compare_gse274058_reference_runs.py Compares two run directories and optionally replaces baseline.
scripts/a100_sync_xenium_minimal.ps1 Syncs minimal Xenium inputs to the remote A100 host.
scripts/a100_setup_env.sh Creates/updates an A100 venv and writes environment status JSON.
scripts/a100_run_breast_reference_projection.sh End-to-end A100 runner (git checkout, env setup, run workflow, monitor).
scripts/a100_monitor_status.py Generates JSON/Markdown status snapshots for the A100 run directory.
scripts/a100_monitor_breast_reference_projection.sh Convenience wrapper to run the monitor once or in watch mode.
pyproject.toml Updates package metadata/version and dependencies (adds statsmodels/h5py, extras, script entrypoint).
mkdocs.yml Adds nav entries for new workflow/benchmark/reference-result docs pages.
docs/workflow.md Documents the standardized workflow and example code snippets.
docs/results/gse274058_reference/valid_perturbations.tsv Published reference-run table artifact.
docs/results/gse274058_reference/valid_perturbations.md Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/top_hits.tsv Published reference-run table artifact.
docs/results/gse274058_reference/top_hits.md Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/target_gene_sanity.md Rendered target-gene sanity table for docs inclusion.
docs/results/gse274058_reference/qc_summary.md Rendered QC summary snippet for docs inclusion.
docs/results/gse274058_reference/program_summary.tsv Published reference-run table artifact.
docs/results/gse274058_reference/program_summary.md Rendered markdown table for docs inclusion.
docs/results/gse274058_reference/overview.md Rendered provenance overview snippet for docs inclusion.
docs/results/gse274058_reference/improvement.md Rendered “how to improve” snippet for docs inclusion.
docs/results/gse274058_reference/dataset_summary.json Published reference-run provenance JSON artifact.
docs/results/gse274058_reference/a100_status.md Rendered A100 confirmation status snippet for docs inclusion.
docs/requirements.txt Adjusts doc build requirements.
docs/paper-repro.md Documents paper-grade CLI/Python reproduction workflow.
docs/index.md Updates docs landing page and Sphinx toctree.
docs/gse274058-reference-results.md Adds a docs page publishing the GSE274058 reference run assets + summaries.
docs/benchmarks.md Documents benchmark tracks and expected outputs.
docs/api.md Adds Sphinx autodoc API reference pages.
README.md Updates README to match new API/CLI/benchmark workflows.
CITATION.cff Updates version and repository URL.
.gitignore Ignores caches, reports, artifacts, and temp Sphinx build output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/spatialperturb/gr.py
Comment on lines +101 to +103
neighbors = []
for neighbor_idx, distance in zip(row.indices, row.data, strict=False):
if not include_self and neighbor_idx == idx:
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip(..., strict=False) requires Python 3.10+. With requires-python = ">=3.9" in pyproject.toml, this will raise TypeError: zip() takes no keyword arguments on Python 3.9. Either bump the minimum supported Python version to >=3.10 or remove the strict= usage here (and elsewhere).

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +28
x_positions = [0.08, 0.24, 0.40, 0.58, 0.76, 0.92]
for idx, ((title, body), x) in enumerate(zip(steps, x_positions, strict=False)):
ax.text(
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip(..., strict=False) requires Python 3.10+. With requires-python >=3.9, importing this module will fail on Python 3.9. Either bump minimum Python to >=3.10 or remove the strict= argument here.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +141
if not isinstance(group_values, tuple):
group_values = (group_values,)
group_label = " | ".join(f"{column}={value}" for column, value in zip(group_cols, group_values, strict=False))
n_cells = int(len(frame))
means = frame.loc[:, scores.columns].mean(axis=0)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip(..., strict=False) is only available in Python 3.10+. Since pyproject.toml declares requires-python >=3.9, this will crash on Python 3.9. Either bump the minimum Python version to >=3.10 or replace zip(..., strict=False) with plain zip(...) and validate lengths explicitly if needed.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +92
scores = pd.DataFrame(raw_scores, index=adata.obs_names.astype(str))
scores.index = adata.obs_names.astype(str)
scores.columns = scores.columns.astype(str)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user passes a DataFrame for score_key, scores.index = adata.obs_names.astype(str) overwrites its index and can silently misalign rows if the DataFrame isn’t already in the exact same order as adata.obs_names. Prefer scores = scores.reindex(adata.obs_names.astype(str)) (and error if labels are missing/extra) instead of blindly replacing the index.

Copilot uses AI. Check for mistakes.
Comment on lines +175 to +178
target_rows = intrinsic_de.loc[
intrinsic_de.apply(lambda row: str(row["perturbation"]) == str(row["gene"]), axis=1),
["perturbation", "gene", "log2fc", "fdr", "mean_case", "mean_control"],
].copy()
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intrinsic_de.apply(lambda row: ..., axis=1) is row-wise and can be noticeably slow on larger DE tables. This filter can be vectorized (e.g., compare the perturbation and gene columns directly after casting to string) to keep packaging fast and avoid unnecessary CPU time.

Copilot uses AI. Check for mistakes.
Comment thread src/spatialperturb/io.py
Comment on lines +252 to +256
raise ValueError("cell_group_path must contain at least 'cell_id' and 'group' columns.")
mapping = dict(zip(table["cell_id"].astype(str), table["group"].astype(str), strict=False))
matched = 0
cell_types = []
for cell_id, current in zip(adata.obs_names.astype(str), adata.obs["cell_type"].astype(str), strict=False):
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip(..., strict=False) is Python 3.10+ only. With requires-python >=3.9, calling this code on Python 3.9 will raise TypeError: zip() takes no keyword arguments. Either bump minimum Python to >=3.10 or replace these zip(..., strict=False) calls with plain zip(...) and validate lengths explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +80
if dst == dst.anchor:
raise ValueError(f"Refusing to replace filesystem root: {dst}")
if dst.exists():
shutil.rmtree(dst)
shutil.copytree(src, dst)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root-safety check doesn’t work: dst.anchor is a string, so dst == dst.anchor is always False (e.g. Path('/') != '/'). With --replace-baseline, this can allow shutil.rmtree(dst) to delete a filesystem root if a bad path is provided. Compare against Path(dst.anchor) (or use dst.parent == dst) and consider an additional sentinel check before deletion.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +131
"scipy",
"seaborn",
"sklearn",
"spatialperturb",
"statsmodels",
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency-version probe uses importlib.metadata.version('sklearn'), but the installed distribution is typically named scikit-learn (while sklearn is just the import namespace). This makes the report incorrectly show scikit-learn as not-installed. Consider querying scikit-learn here (and optionally still label it as sklearn in the output).

Copilot uses AI. Check for mistakes.
Comment on lines +533 to +536
manifest = {
"benchmark": "breast_reference_projection",
"dataset": dataset_name,
"generated_at": datetime.now(timezone.utc).isoformat(),
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_reference_projection_benchmark() always writes manifest['benchmark'] = 'breast_reference_projection', but _BENCHMARK_CATALOG also advertises a separate reference_projection benchmark. This makes the manifest misleading for non-breast runs and complicates downstream tooling that keys off the benchmark name. Consider using reference_projection here (or selecting based on config / reference_datasets).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants