From b051462d3fbffbb8a948bd8c2c5042872c2b0069 Mon Sep 17 00:00:00 2001 From: d33bs Date: Tue, 2 Dec 2025 09:42:16 -0700 Subject: [PATCH 01/13] expand tutorials --- docs/source/index.md | 1 + docs/source/software_engineering.md | 122 ++++++++++++++++++ docs/source/tutorial.md | 34 +++-- .../cellprofiler_sqlite_to_parquet.md | 107 +++++++++++++++ 4 files changed, 251 insertions(+), 13 deletions(-) create mode 100644 docs/source/software_engineering.md create mode 100644 docs/source/tutorials/cellprofiler_sqlite_to_parquet.md diff --git a/docs/source/index.md b/docs/source/index.md index f7bb0da7..492ab310 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -20,4 +20,5 @@ contributing Code of Conduct architecture python-api +software_engineering ``` diff --git a/docs/source/software_engineering.md b/docs/source/software_engineering.md new file mode 100644 index 00000000..14146f0e --- /dev/null +++ b/docs/source/software_engineering.md @@ -0,0 +1,122 @@ +# Software Engineering Guide + +This page is for engineers and power users who want to tune CytoTable beyond the narrative tutorials. It focuses on performance, reliability, and integration patterns. + +## Performance and scaling + +- **Chunk size (`chunk_size`)**: Larger chunks reduce overhead but increase peak memory. Start at 30k (default in examples), adjust down for memory-constrained environments, up for fast disks/large RAM. +- **Threads (DuckDB)**: We set `PRAGMA threads` based on `cytotable.constants.MAX_THREADS`. Override via env var `CYTOTABLE_MAX_THREADS` to align with container CPU limits. +- **I/O locality**: For remote SQLite/NPZ, always set `local_cache_dir` to a stable, non-tmpfs path. Reuse the cache across runs to avoid redundant downloads. + +Example: tuned convert with explicit threads and chunk size + +```python +import os +import cytotable + +os.environ["CYTOTABLE_MAX_THREADS"] = "4" + +cytotable.convert( + source_path="s3://my-bucket/plate.sqlite", + source_datatype="sqlite", + dest_path="./out/plate", + dest_datatype="parquet", + preset="cellprofiler_sqlite", + local_cache_dir="./cache/sqlite", + chunk_size=50000, # larger chunks, more RAM, faster on beefy nodes + no_sign_request=True, +) +``` + +## Cloud paths and auth + +- **Unsigned/public S3**: use `no_sign_request=True`. This keeps DuckDB + cloudpathlib using unsigned clients consistently. +- **Signed/private S3**: rely on ambient AWS creds or pass `profile_name`, `aws_access_key_id`, `aws_secret_access_key`, `aws_session_token`. These kwargs flow into cloudpathlib’s client via `_build_path`. +- **GCS/Azure**: supported through cloudpathlib; pass provider-specific kwargs the same way you would construct the CloudPath client. + +Signed S3 example with a specific profile + +```python +import cytotable + +cytotable.convert( + source_path="s3://my-private-bucket/exports/plate.sqlite", + source_datatype="sqlite", + dest_path="./out/private-plate", + dest_datatype="parquet", + preset="cellprofiler_sqlite", + local_cache_dir="./cache/private", + profile_name="science-prod", +) +``` + +## Data layout and presets + +- Prefer presets when available (for example, `cellprofiler_sqlite_cpg0016_jump`, `cellprofiler_csv`) because they set table names and page keys. For custom layouts, pass `targets=[...]` and `page_keys={...}` to `convert`. +- Multi-plate runs: point `source_path` to a parent directory; CytoTable will glob and group per-table. Keep per-run `dest_path` directories to avoid mixing outputs. + +Custom layout example with explicit targets and page keys + +```python +import cytotable + +cytotable.convert( + source_path="/data/plates/", + source_datatype="sqlite", + dest_path="./out/plates", + dest_datatype="parquet", + targets=["cells", "cytoplasm", "nuclei"], # which tables to include + page_keys={"cells": "ImageNumber", "cytoplasm": "ImageNumber", "nuclei": "ImageNumber"}, + add_tablenumber=True, + chunk_size=20000, +) +``` + +## Reliability tips + +- **Stable cache**: If you see “unable to open database file” on cloud SQLite, ensure `local_cache_dir` is set and writable. DuckDB reads from the cached path. +- **Disk space**: Parquet output size ~10–30% of CSV; SQLite is denser. Ensure the cache volume can hold both the source and outputs simultaneously. +- **Restartability**: `dest_path` is overwritten per run; use unique destination directories for incremental runs to avoid partial-output confusion. + +## Testing and CI entry points + +- Unit tests live under `tests/`; sample datasets are in `tests/data/`. Add targeted fixtures when introducing new formats/presets. +- For quick smoke tests, run `python -m pytest tests/test_convert_threaded.py -k convert` and a docs build `sphinx-build docs/source docs/build` to ensure examples render. +- Keep new presets documented in `docs/source/overview.md` and mention edge cases (auth, cache, table naming). + +Smoke-test commands + +```bash +python -m pytest tests/test_convert_threaded.py -k convert +sphinx-build docs/source docs/build +``` + +## Embedding CytoTable in pipelines + +- **Python API**: `cytotable.convert(...)` is synchronous; wrap in your workflow engine (Airflow, Prefect, Nextflow via Python) as a task step. +- **CLI wrapper**: not bundled; if you add one, surface the same flags as `convert` and mirror logging levels. +- **Logging**: uses the standard logging system. Set `CYTOTABLE_LOG_LEVEL=INFO` (or `DEBUG`) in container/CI to capture more detail during runs. + +Simple function you can call from any orchestrator (Airflow task, Nextflow Python, shell) + +```python +import cytotable + +def run_cytotable(source, dest, cache): + return cytotable.convert( + source_path=source, + source_datatype="sqlite", + dest_path=dest, + dest_datatype="parquet", + preset="cellprofiler_sqlite", + local_cache_dir=cache, + chunk_size=30000, + ) + +if __name__ == "__main__": + run_cytotable( + "s3://my-bucket/plate.sqlite", + "./out/plate", + "./cache/sqlite", + ) +``` diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md index 6a30eb3f..43600eca 100644 --- a/docs/source/tutorial.md +++ b/docs/source/tutorial.md @@ -1,24 +1,33 @@ -# Tutorial +# Tutorials and How-to Guides -This page covers brief tutorials and notes on how to use CytoTable. +Start here if you are new to CytoTable. We’ve split material by audience: -## CellProfiler CSV Output to Parquet +- **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. +- **Engineers / power users:** see the Software Engineering Guide for tuning and integration details, or use the quick recipe below. -[CellProfiler](https://cellprofiler.org/) pipelines or projects may produce various CSV-based compartment output (for example, "Cells.csv", "Cytoplasm.csv", etc.). -CytoTable converts this data to Parquet from local or object-storage based locations. +```{toctree} +--- +maxdepth: 2 +caption: Narrative tutorials (start here) +--- +tutorials/cellprofiler_sqlite_to_parquet +software_engineering +``` + +## Quick how-to: CellProfiler CSV to Parquet (recipe) + +This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. If you prefer a guided, narrative walkthrough with downloadable inputs and expected outputs, use the tutorial above. -Files with similar names nested within sub-folders will be concatenated by default (appended to the end of each data file) together and used to create a single Parquet file per compartment. -For example: if we have `folder/subfolder_a/cells.csv` and `folder/subfolder_b/cells.csv`, using `convert(source_path="folder", ...)` will result in `folder.cells.parquet` (unless `concat=False`). +[CellProfiler](https://cellprofiler.org/) exports compartment CSVs (for example, "Cells.csv", "Cytoplasm.csv"). CytoTable converts this data to Parquet from local or object-storage locations. -Note: The `dest_path` parameter (`convert(dest_path="")`) will be used for intermediary data work and must be a new file or directory path. -This path will result directory output on `join=False` and a single file output on `join=True`. +Files with similar names nested within sub-folders are concatenated by default (for example, `folder/sub_a/cells.csv` and `folder/sub_b/cells.csv` become a single `folder.cells.parquet` unless `concat=False`). -For example, see below: +The `dest_path` parameter is used for intermediary work and must be a new file or directory path. It will be a directory when `join=False` and a single file when `join=True`. ```python from cytotable import convert -# using a local path with cellprofiler csv presets +# Local CSVs with CellProfiler preset convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", @@ -27,8 +36,7 @@ convert( preset="cellprofiler_csv", ) -# using an s3-compatible path with no signature for client -# and cellprofiler csv presets +# S3 CSVs (unsigned) with CellProfiler preset convert( source_path="s3://s3path", source_datatype="csv", diff --git a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md new file mode 100644 index 00000000..ebe51529 --- /dev/null +++ b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md @@ -0,0 +1,107 @@ +# Tutorial: CellProfiler SQLite on S3 to Parquet + +A narrative, start-to-finish walkthrough for image analysts who want a working Parquet export from a CellProfiler SQLite file stored in the cloud. + +## What you will accomplish + +- Pull a CellProfiler SQLite file directly from S3 (unsigned/public) and convert each compartment table to Parquet. +- Keep a persistent local cache so the download is reused and avoids “file vanished” errors on temp disks. +- Verify the outputs quickly (file names and row counts) without needing to understand the internals. + +## Inputs and outputs + +- **Input:** A single-plate CellProfiler SQLite file from the open Cell Painting Gallery + `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` + No credentials are required (`no_sign_request=True`). +- **Output:** Four Parquet files (Image, Cells, Cytoplasm, Nuclei) in `./outputs/br00126114`. + +## Before you start + +- Install Cytotable (and DuckDB is bundled): + `pip install cytotable` +- Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. +- If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. + +## Step 1: define your paths + +```bash +export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite" +export DEST_PATH="./outputs/br00126114" +export CACHE_DIR="./sqlite_s3_cache" +mkdir -p "$DEST_PATH" "$CACHE_DIR" +``` + +## Step 2: run the conversion (minimal Python) + +```python +import os +import cytotable + +# If you used the bash exports above: +SOURCE_PATH = os.environ["SOURCE_PATH"] +DEST_PATH = os.environ["DEST_PATH"] +CACHE_DIR = os.environ["CACHE_DIR"] + +# (Alternatively, set them directly as strings in Python.) + +result = cytotable.convert( + source_path=SOURCE_PATH, + source_datatype="sqlite", + dest_path=DEST_PATH, + dest_datatype="parquet", + # Preset matches common CellProfiler SQLite layout from the Cell Painting Gallery + preset="cellprofiler_sqlite_cpg0016_jump", + # Use a cache directory you control so the downloaded SQLite is reusable + local_cache_dir=CACHE_DIR, + # This dataset is public; unsigned requests avoid credential prompts + no_sign_request=True, + # Reasonable chunking for large tables; adjust up/down if you hit memory limits + chunk_size=30000, +) + +print(result) +``` + +Why these flags matter (in plain language): + +- `local_cache_dir`: keeps the downloaded SQLite file somewhere predictable so DuckDB can open it reliably. +- `preset`: selects the right table names and page keys for this dataset. +- `chunk_size`: processes data in pieces so you don’t need excessive RAM. +- `no_sign_request`: needed because the sample bucket is public and unsigned. + +## Step 3: check that the outputs look right + +You should see four Parquet files in the destination directory: + +```bash +ls "$DEST_PATH" +# Image.parquet Cells.parquet Cytoplasm.parquet Nuclei.parquet +``` + +Quick sanity-check on row counts (uses DuckDB for convenience): + +```python +import duckdb, os + +dest = os.environ["DEST_PATH"] +for table in ["Image", "Cells", "Cytoplasm", "Nuclei"]: + count = duckdb.query( + f"select count(*) from parquet_scan('{dest}/{table}.parquet')" + ).fetchone()[0] + print(f"{table}: {count} rows") +``` + +Counts will vary by plate, but non-zero values confirm the export worked. If you want to peek at schema, run `duckdb.query("describe select * from parquet_scan(...) limit 0")`. + +## Common adjustments for your own data + +- **Local SQLite file:** set `source_path` to the local file, remove `no_sign_request`, keep `local_cache_dir` if you want a stable working copy. +- **Different table names/compartments:** provide `targets=["cells", "cytoplasm", ...]` or use another preset (`preset="cellprofiler_sqlite"`). +- **Multiple plates in one folder:** point `source_path` to the folder; Cytotable will glob and merge matching tables. Keep a per-run `dest_path` to avoid mixing outputs. +- **Tight disk space:** point `local_cache_dir` to a larger external volume or clean it after the run finishes. + +## What success looks like + +- A stable local cache of the SQLite file remains in `CACHE_DIR` (useful for repeated runs). +- Four Parquet files exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow. +- No temporary-file or “unable to open database file” errors occur during the run. From fb03f9c627e803e518ef381538c8472f912f6b32 Mon Sep 17 00:00:00 2001 From: d33bs Date: Tue, 2 Dec 2025 14:55:39 -0700 Subject: [PATCH 02/13] updates to tutorials and language --- ...table_mise_en_place_general_overview.ipynb | 47 ++++++------ docs/source/overview.md | 1 + docs/source/software_engineering.md | 5 ++ docs/source/tutorial.md | 16 +++- .../cellprofiler_sqlite_to_parquet.md | 37 ++++----- .../multi_plate_merge_tablenumber.md | 74 ++++++++++++++++++ .../tutorials/npz_embeddings_to_parquet.md | 75 +++++++++++++++++++ 7 files changed, 207 insertions(+), 48 deletions(-) create mode 100644 docs/source/tutorials/multi_plate_merge_tablenumber.md create mode 100644 docs/source/tutorials/npz_embeddings_to_parquet.md diff --git a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb index b04a604d..e6e5655d 100644 --- a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb +++ b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb @@ -7,7 +7,7 @@ "source": [ "# CytoTable mise en place (general overview)\n", "\n", - "This notebook includes a quick demonstration of CytoTable to help you understand the basics of using the package and the biological basis of each step.\n", + "This notebook will help you understand the basics of using CytoTable and the biological basis of each step.\n", "We provide a high-level overview of the related concepts to give greater context about where and how the data are changed in order to gain new insights.\n", "\n", "The name of the notebook comes from the french _mise en place_:\n", @@ -89,17 +89,18 @@ "id": "832c700f-63e0-4f22-853c-9bf6d5328a5c", "metadata": {}, "source": [ - "## Phase 1: Cells are stained and images are captured by microscopes\n", + "## Phase 1: Cells are imaged by microscopes, with optional fluorescence staining\n", "\n", "![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png)\n", "\n", - "__Figure 1.__ _Cells are stained in order to highlight cellular compartments and organelles. Microscopes are used to observe and capture data for later use._\n", + "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._\n", "\n", - "CytoTable uses data created from multiple upstream steps involving images of \n", - "stained biological objects (typically cells).\n", - "Cells are cultured in multi-well plates, perturbed, and then fixed before being stained with a panel of six fluorescent dyes that highlight key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane. These multiplexed stains are imaged across fluorescence channels using automated high-content microscopy, producing rich images that capture the morphology of individual cells for downstream analysis ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)).\n", + "CytoTable processes microscopy-based data that are created from multiple upstream steps.\n", + "CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design.\n", + "However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay.\n", + "In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis .\n", "\n", - "We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to help describe this process below." + "We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to describe this process below." ] }, { @@ -185,17 +186,17 @@ "id": "23897ed5-53aa-41a2-a8b2-494498045262", "metadata": {}, "source": [ - "## Phase 2: Images are segmented to build numeric feature datasets via CellProfiler\n", + "## Phase 2: CellProfiler segments cells and measures numeric features\n", "\n", "![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png)\n", "\n", "\n", - "__Figure 2.__ _CellProfiler is configured to use images and performs segmentation to evaluate numeric representations of cells. This data is captured for later use in tabular file formats such as CSV or SQLite tables._\n", + "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._\n", "\n", - "After acquisition, the multiplexed images are processed using image-analysis software such as CellProfiler, which segments cells and their compartments into distinct regions of interest. From these segmented images, hundreds to thousands of quantitative features are extracted per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", + "After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", "These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)).\n", "\n", - "CellProfiler was used in conjunction with the `.cppipe` file to produce the following images and data tables from the ExampleHuman dataset." + "We use CellProfiler (with a prespecified configuration `.cppipe` file) to produce the following images and data tables from the ExampleHuman dataset." ] }, { @@ -1266,7 +1267,7 @@ } ], "source": [ - "# show the tables generated from the resulting CSV files\n", + "# show the tables generated from the resulting CSV files\n", "for profiles in pathlib.Path(source_path).glob(\"*.csv\"):\n", " print(f\"\\nProfiles from CellProfiler: {profiles}\")\n", " display(pd.read_csv(profiles).head())" @@ -1278,13 +1279,13 @@ "id": "5f5b7cd6-9511-4349-bacf-e6304a099025", "metadata": {}, "source": [ - "## Phase 3: Numeric feature datasets from CellProfiler are harmonized by CytoTable\n", + "## Phase 3: CytoTable harmonizes the feature datasets that CellProfiler generates\n", "\n", "![Image showing feature data being read by CytoTable and exported to a CytoTable file.](../_static/features_to_cytotable.png)\n", "\n", - "The high-dimensional feature tables produced by CellProfiler often vary in format depending on the imaging pipeline, experiment, or storage system. CytoTable standardizes these single-cell morphology datasets by harmonizing outputs into consistent, analysis-ready formats such as Parquet or AnnData. This unification ensures that data from diverse experiments can be readily integrated and processed by downstream profiling tools like Pycytominer or coSMicQC, enabling scalable and reproducible cytomining workflows.\n", + "CellProfiler produces high-dimensional feature tables that vary in format depending on the imaging pipeline, experiment, or storage system. Sometimes these feature tables are thousands of columns and hundreds of thousands of rows. CytoTable harmonizes these outputs into consistent, analysis-ready formats such as Parquet or AnnData. This unification ensures that data from diverse experiments can be readily integrated and processed by downstream profiling tools like Pycytominer or coSMicQC, enabling scalable and reproducible bioinformatics workflows.\n", "\n", - "We use CytoTable below to process the numeric feature data observed above." + "We use CytoTable below to process the numeric feature data we generated above." ] }, { @@ -1298,8 +1299,8 @@ "output_type": "stream", "text": [ "example.parquet\n", - "CPU times: user 215 ms, sys: 159 ms, total: 374 ms\n", - "Wall time: 13.1 s\n" + "CPU times: user 239 ms, sys: 167 ms, total: 406 ms\n", + "Wall time: 13.3 s\n" ] } ], @@ -1594,13 +1595,13 @@ { "data": { "text/plain": [ - "\n", + "\n", " created_by: parquet-cpp-arrow version 21.0.0\n", " num_columns: 312\n", " num_rows: 289\n", " num_row_groups: 1\n", " format_version: 2.6\n", - " serialized_size: 87760" + " serialized_size: 87761" ] }, "execution_count": 9, @@ -1623,7 +1624,7 @@ "data": { "text/plain": [ "{b'data-producer': b'https://github.com/cytomining/CytoTable',\n", - " b'data-producer-version': b'1.1.0.post6.dev0+4ddbbe1'}" + " b'data-producer-version': b'1.1.0.post13.dev0+2f51ec3'}" ] }, "execution_count": 10, @@ -1990,7 +1991,7 @@ "Nuclei_Number_Object_Number: int64\n", "-- schema metadata --\n", "data-producer: 'https://github.com/cytomining/CytoTable'\n", - "data-producer-version: '1.1.0.post6.dev0+4ddbbe1'" + "data-producer-version: '1.1.0.post13.dev0+2f51ec3'" ] }, "execution_count": 12, @@ -2020,9 +2021,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.11" + "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/docs/source/overview.md b/docs/source/overview.md index 05eda380..4cd243fa 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -113,6 +113,7 @@ Data source compatibility for CytoTable is focused (but not explicitly limited t ```{eval-rst} * **Manual specification:** NPZ data source types may be manually specified by using :code:`convert(..., source_datatype="npz", ...)` (:mod:`convert() `). * **Preset specification:** NPZ data from DeepProfiler may be converted through CytoTable by using the following preset :code:`convert(..., preset="deepprofiler", ...)` (:mod:`convert() `). + * **Not covered:** `.npy` feature dumps or CSV-only outputs; use the CellProfiler CSV/SQLite presets for those formats. ``` #### IN Carta Data Sources diff --git a/docs/source/software_engineering.md b/docs/source/software_engineering.md index 14146f0e..fc88dc87 100644 --- a/docs/source/software_engineering.md +++ b/docs/source/software_engineering.md @@ -54,6 +54,11 @@ cytotable.convert( - Prefer presets when available (for example, `cellprofiler_sqlite_cpg0016_jump`, `cellprofiler_csv`) because they set table names and page keys. For custom layouts, pass `targets=[...]` and `page_keys={...}` to `convert`. - Multi-plate runs: point `source_path` to a parent directory; CytoTable will glob and group per-table. Keep per-run `dest_path` directories to avoid mixing outputs. +- Common variants: + - **Local SQLite:** set `source_path` to the local file, drop `no_sign_request`, keep `local_cache_dir` for stability. + - **Different table names/compartments:** set `targets=[...]` or choose the matching preset. + - **Multiple plates in one folder:** point `source_path` to the folder; use unique `dest_path` per run to avoid mixing outputs. + - **Tight disk space:** set `local_cache_dir` to a larger volume and clean it after the run. Custom layout example with explicit targets and page keys diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md index 43600eca..b73cb5e9 100644 --- a/docs/source/tutorial.md +++ b/docs/source/tutorial.md @@ -1,20 +1,30 @@ -# Tutorials and How-to Guides +# Tutorials Start here if you are new to CytoTable. We’ve split material by audience: - **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. - **Engineers / power users:** see the Software Engineering Guide for tuning and integration details, or use the quick recipe below. +```{admonition} Who this helps (and doesn’t) +- Helps: image analysts who want to get CellProfiler/NPZ outputs into Parquet with minimal coding; people comfortable running a few commands. +- Not ideal: raw image ingestion or pipeline authoring (use CellProfiler/DeepProfiler upstream); workflows needing a GUI-only experience. +- Effort: install, copy/paste a few commands, validate outputs in minutes. +``` + ```{toctree} --- maxdepth: 2 -caption: Narrative tutorials (start here) +caption: Tutorials (start here) --- tutorials/cellprofiler_sqlite_to_parquet +tutorials/npz_embeddings_to_parquet +tutorials/multi_plate_merge_tablenumber software_engineering ``` -## Quick how-to: CellProfiler CSV to Parquet (recipe) +Looking for variations or troubleshooting? See the Software Engineering Guide. + +## Quick recipe: CellProfiler CSV to Parquet This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. If you prefer a guided, narrative walkthrough with downloadable inputs and expected outputs, use the tutorial above. diff --git a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md index ebe51529..94bc7d15 100644 --- a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md @@ -8,6 +8,21 @@ A narrative, start-to-finish walkthrough for image analysts who want a working P - Keep a persistent local cache so the download is reused and avoids “file vanished” errors on temp disks. - Verify the outputs quickly (file names and row counts) without needing to understand the internals. +```{admonition} If your data looks like this, change... +- Local SQLite instead of S3: set `source_path` to the local `.sqlite` file; remove `no_sign_request`; keep `local_cache_dir`. +- Only certain compartments: add `targets=["cells", "nuclei"]` (case-insensitive). +- Memory constrained: lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` has space. +``` + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + ## Inputs and outputs - **Input:** A single-plate CellProfiler SQLite file from the open Cell Painting Gallery @@ -78,28 +93,6 @@ ls "$DEST_PATH" # Image.parquet Cells.parquet Cytoplasm.parquet Nuclei.parquet ``` -Quick sanity-check on row counts (uses DuckDB for convenience): - -```python -import duckdb, os - -dest = os.environ["DEST_PATH"] -for table in ["Image", "Cells", "Cytoplasm", "Nuclei"]: - count = duckdb.query( - f"select count(*) from parquet_scan('{dest}/{table}.parquet')" - ).fetchone()[0] - print(f"{table}: {count} rows") -``` - -Counts will vary by plate, but non-zero values confirm the export worked. If you want to peek at schema, run `duckdb.query("describe select * from parquet_scan(...) limit 0")`. - -## Common adjustments for your own data - -- **Local SQLite file:** set `source_path` to the local file, remove `no_sign_request`, keep `local_cache_dir` if you want a stable working copy. -- **Different table names/compartments:** provide `targets=["cells", "cytoplasm", ...]` or use another preset (`preset="cellprofiler_sqlite"`). -- **Multiple plates in one folder:** point `source_path` to the folder; Cytotable will glob and merge matching tables. Keep a per-run `dest_path` to avoid mixing outputs. -- **Tight disk space:** point `local_cache_dir` to a larger external volume or clean it after the run finishes. - ## What success looks like - A stable local cache of the SQLite file remains in `CACHE_DIR` (useful for repeated runs). diff --git a/docs/source/tutorials/multi_plate_merge_tablenumber.md b/docs/source/tutorials/multi_plate_merge_tablenumber.md new file mode 100644 index 00000000..e50e5c5b --- /dev/null +++ b/docs/source/tutorials/multi_plate_merge_tablenumber.md @@ -0,0 +1,74 @@ +# Tutorial: Merging multiple plates with Tablenumber + +Goal: combine multiple CellProfiler SQLite exports (plates) into a single Parquet output while preserving plate identity via `TableNumber`. + +## What you will accomplish + +- Point Cytotable at a folder of multiple plate exports. +- Add `TableNumber` so downstream analyses can distinguish rows from different plates. +- Verify merged outputs. + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **Input:** A folder of CellProfiler SQLite files (example structure): + `data/plates/PlateA.sqlite` + `data/plates/PlateB.sqlite` +- **Output:** Parquet files (Image/Cells/Cytoplasm/Nuclei) under `./outputs/multi_plate`, with a `Metadata_TableNumber` column indicating plate. + +## Step 1: define your paths + +```bash +export SOURCE_PATH="./data/plates" +export DEST_PATH="./outputs/multi_plate" +export CACHE_DIR="./sqlite_cache" +mkdir -p "$DEST_PATH" "$CACHE_DIR" +``` + +## Step 2: run the conversion with tablenumber + +```python +import os +import cytotable + +source_path = os.environ["SOURCE_PATH"] +dest_path = os.environ["DEST_PATH"] +cache_dir = os.environ["CACHE_DIR"] + +result = cytotable.convert( + source_path=source_path, + source_datatype="sqlite", + dest_path=dest_path, + dest_datatype="parquet", + preset="cellprofiler_sqlite", + local_cache_dir=cache_dir, + add_tablenumber=True, # key for multi-plate merges + chunk_size=30000, +) + +print(result) +``` + +Why this matters: + +- `add_tablenumber=True` adds `Metadata_TableNumber` so you can filter/group by plate later. +- Pointing `source_path` to a folder makes Cytotable glob multiple plates. +- `local_cache_dir` keeps each plate cached locally for reliable DuckDB access. + +## Step 3: validate plate separation + +You should see one Parquet per compartment (`Cells`, `Cytoplasm`, `Nuclei`, etc.) in `DEST_PATH`. Opening a file with Pandas or PyArrow should show `Metadata_TableNumber` present and non-zero rows. If you processed multiple plates, expect multiple distinct values in that column. + +## Scenario callouts (“if your data looks like this...”) + +- **Local SQLite files:** set `source_path` to the folder of local `.sqlite` files; remove `no_sign_request`. +- **Only certain compartments:** pass `targets=["cells", "nuclei"]` to limit tables. +- **Memory constrained:** lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` is on a disk with enough space for all plates + parquet output. diff --git a/docs/source/tutorials/npz_embeddings_to_parquet.md b/docs/source/tutorials/npz_embeddings_to_parquet.md new file mode 100644 index 00000000..39c4c73e --- /dev/null +++ b/docs/source/tutorials/npz_embeddings_to_parquet.md @@ -0,0 +1,75 @@ +# Tutorial: NPZ embeddings + metadata to Parquet + +A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. This uses a small example bundled in the repo. + +## What you will accomplish + +- Read NPZ feature files and matching metadata from disk. +- Combine them into Parquet with a preset that aligns common keys. +- Validate the output shape and schema. + +```{admonition} If your data looks like this, change... +- NPZ in a different folder: point `source_path` there; keep `preset="deepprofiler"`. +- Memory constrained: add `chunk_size=10000` to the convert call. +- `.npy` files or plain CSV feature tables: this tutorial/preset does not cover them; use the CellProfiler CSV/SQLite flows instead. +``` + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **Input:** Example NPZ + metadata in this repo: `tests/data/deepprofiler/pycytominer_example` +- **Output:** A Parquet file under `./outputs/deepprofiler_example` + +## Step 1: define your paths + +```bash +export SOURCE_PATH="tests/data/deepprofiler/pycytominer_example" +export DEST_PATH="./outputs/deepprofiler_example" +mkdir -p "$DEST_PATH" +``` + +## Step 2: run the conversion + +```python +import os +import cytotable + +source_path = os.environ["SOURCE_PATH"] +dest_path = os.environ["DEST_PATH"] + +result = cytotable.convert( + source_path=source_path, + source_datatype="npz", + dest_path=dest_path, + dest_datatype="parquet", + preset="deepprofiler", + concat=True, + join=False, +) + +print(result) +``` + +Notes (why these flags matter): + +- `preset="deepprofiler"` aligns NPZ feature arrays with metadata columns. +- `concat=True` merges multiple NPZ shards. +- `join=False` writes per-table Parquet files (the preset produces `all_files.npz` as the logical table). + +## Step 3: validate the output + +You should see `all_files.npz.parquet` in `DEST_PATH`. Opening it with Pandas or PyArrow should show non-zero rows and both feature (`efficientnet_*`) and metadata columns. + +## What success looks like + +- A Parquet file `all_files.npz.parquet` exists in `DEST_PATH`. +- DuckDB/Pandas can read the file; row count is non-zero. +- Feature columns (for example, `efficientnet_*`) and metadata columns (plate/well/site) both appear. From 6b9ec12c42ca74e4b2e55015c1187236d3ce5369 Mon Sep 17 00:00:00 2001 From: d33bs Date: Tue, 2 Dec 2025 14:58:22 -0700 Subject: [PATCH 03/13] linting --- .../cytotable_mise_en_place_general_overview.ipynb | 2 +- docs/source/software_engineering.md | 8 +++++++- docs/source/tutorials/cellprofiler_sqlite_to_parquet.md | 6 +++--- docs/source/tutorials/multi_plate_merge_tablenumber.md | 4 ++-- 4 files changed, 13 insertions(+), 7 deletions(-) diff --git a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb index e6e5655d..ae893390 100644 --- a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb +++ b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb @@ -2026,4 +2026,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/docs/source/software_engineering.md b/docs/source/software_engineering.md index fc88dc87..4ac2ca3e 100644 --- a/docs/source/software_engineering.md +++ b/docs/source/software_engineering.md @@ -71,7 +71,11 @@ cytotable.convert( dest_path="./out/plates", dest_datatype="parquet", targets=["cells", "cytoplasm", "nuclei"], # which tables to include - page_keys={"cells": "ImageNumber", "cytoplasm": "ImageNumber", "nuclei": "ImageNumber"}, + page_keys={ + "cells": "ImageNumber", + "cytoplasm": "ImageNumber", + "nuclei": "ImageNumber", + }, add_tablenumber=True, chunk_size=20000, ) @@ -107,6 +111,7 @@ Simple function you can call from any orchestrator (Airflow task, Nextflow Pytho ```python import cytotable + def run_cytotable(source, dest, cache): return cytotable.convert( source_path=source, @@ -118,6 +123,7 @@ def run_cytotable(source, dest, cache): chunk_size=30000, ) + if __name__ == "__main__": run_cytotable( "s3://my-bucket/plate.sqlite", diff --git a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md index 94bc7d15..0ee1f7c2 100644 --- a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md @@ -25,14 +25,14 @@ pip install cytotable ## Inputs and outputs -- **Input:** A single-plate CellProfiler SQLite file from the open Cell Painting Gallery - `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` +- **Input:** A single-plate CellProfiler SQLite file from the open Cell Painting Gallery + `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` No credentials are required (`no_sign_request=True`). - **Output:** Four Parquet files (Image, Cells, Cytoplasm, Nuclei) in `./outputs/br00126114`. ## Before you start -- Install Cytotable (and DuckDB is bundled): +- Install Cytotable (and DuckDB is bundled): `pip install cytotable` - Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. - If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. diff --git a/docs/source/tutorials/multi_plate_merge_tablenumber.md b/docs/source/tutorials/multi_plate_merge_tablenumber.md index e50e5c5b..f1408579 100644 --- a/docs/source/tutorials/multi_plate_merge_tablenumber.md +++ b/docs/source/tutorials/multi_plate_merge_tablenumber.md @@ -19,8 +19,8 @@ pip install cytotable ## Inputs and outputs -- **Input:** A folder of CellProfiler SQLite files (example structure): - `data/plates/PlateA.sqlite` +- **Input:** A folder of CellProfiler SQLite files (example structure): + `data/plates/PlateA.sqlite` `data/plates/PlateB.sqlite` - **Output:** Parquet files (Image/Cells/Cytoplasm/Nuclei) under `./outputs/multi_plate`, with a `Metadata_TableNumber` column indicating plate. From bf1c5187b138eae42e8b1c3086b15e65f6a32cde Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 13:36:36 -0700 Subject: [PATCH 04/13] updates based on Jenna's suggestions Co-Authored-By: Jenna Tomkinson <107513215+jenna-tomkinson@users.noreply.github.com> --- ...table_mise_en_place_general_overview.ipynb | 6 ++-- ...ytotable_mise_en_place_general_overview.py | 34 +++++++------------ docs/source/tutorial.md | 8 +++-- .../cellprofiler_sqlite_to_parquet.md | 7 ++-- 4 files changed, 24 insertions(+), 31 deletions(-) diff --git a/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb b/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb index ae893390..8cefc3de 100644 --- a/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb +++ b/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb @@ -93,9 +93,9 @@ "\n", "![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png)\n", "\n", - "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._\n", + "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes stain the cells to mark specific proteins, compartments, and/or organelles._\n", "\n", - "CytoTable processes microscopy-based data that are created from multiple upstream steps.\n", + "CytoTable processes microscopy-based data that are created from multiple upstream steps (image analysis).\n", "CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design.\n", "However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay.\n", "In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis .\n", @@ -191,7 +191,7 @@ "![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png)\n", "\n", "\n", - "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._\n", + "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in multiple CSV tables or SQLite database._\n", "\n", "After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", "These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)).\n", diff --git a/docs/source/examples/cytotable_mise_en_place_general_overview.py b/docs/source/examples/cytotable_mise_en_place_general_overview.py index 5e3fcf96..67d5dfe9 100644 --- a/docs/source/examples/cytotable_mise_en_place_general_overview.py +++ b/docs/source/examples/cytotable_mise_en_place_general_overview.py @@ -3,16 +3,15 @@ # jupytext: # text_representation: # extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.17.2 +# format_name: light +# format_version: '1.5' +# jupytext_version: 1.17.3 # kernelspec: # display_name: Python 3 (ipykernel) # language: python # name: python3 # --- -# %% [markdown] # # CytoTable mise en place (general overview) # # This notebook will help you understand the basics of using CytoTable and the biological basis of each step. @@ -24,7 +23,7 @@ # > refer to organizing and arranging the ingredients ..." # > - [Wikipedia](https://en.wikipedia.org/wiki/Mise_en_place) -# %% +# + import pathlib from collections import Counter @@ -38,31 +37,29 @@ # setup variables for use throughout the notebook source_path = "../../../tests/data/cellprofiler/examplehuman" dest_path = "./example.parquet" +# - -# %% # remove the dest_path if it's present if pathlib.Path(dest_path).is_file(): pathlib.Path(dest_path).unlink() -# %% # show the files we will use as source data with CytoTable list(pathlib.Path(source_path).glob("*")) -# %% [markdown] # ## Phase 1: Cells are imaged by microscopes, with optional fluorescence staining # # ![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png) # -# __Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._ +# __Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes stain the cells to mark specific proteins, compartments, and/or organelles._ # -# CytoTable processes microscopy-based data that are created from multiple upstream steps. +# CytoTable processes microscopy-based data that are created from multiple upstream steps (image analysis). # CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design. # However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay. # In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis . # # We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to describe this process below. -# %% +# + # display the images we will gather features from image_name_map = {"d0.tif": "DNA", "d1.tif": "PH3", "d2.tif": "Cells"} @@ -73,34 +70,31 @@ stain = val print(f"\nImage with stain: {stain}") display(Image.open(image)) +# - -# %% [markdown] # ## Phase 2: CellProfiler segments cells and measures numeric features # # ![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png) # # -# __Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures "hand-engineered" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._ +# __Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures "hand-engineered" computer vision features from every single cell. These data are captured for later use in multiple CSV tables or SQLite database._ # # After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization. # These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)). # # We use CellProfiler (with a prespecified configuration `.cppipe` file) to produce the following images and data tables from the ExampleHuman dataset. -# %% # show the segmentations through an overlay with outlines for image in pathlib.Path(source_path).glob("*Overlay.png"): print(f"Image outlines from segmentation (composite)") print("Color key: (dark blue: nuclei, light blue: cells, yellow: PH3)") display(Image.open(image)) -# %% # show the tables generated from the resulting CSV files for profiles in pathlib.Path(source_path).glob("*.csv"): print(f"\nProfiles from CellProfiler: {profiles}") display(pd.read_csv(profiles).head()) -# %% [markdown] # ## Phase 3: CytoTable harmonizes the feature datasets that CellProfiler generates # # ![Image showing feature data being read by CytoTable and exported to a CytoTable file.](../_static/features_to_cytotable.png) @@ -109,7 +103,7 @@ # # We use CytoTable below to process the numeric feature data we generated above. -# %% +# + # %%time # run cytotable convert @@ -122,25 +116,21 @@ preset="cellprofiler_csv", ) print(pathlib.Path(result).name) +# - -# %% # show the table head using pandas pq.read_table(source=result).to_pandas().head() -# %% # show metadata for the result file pq.read_metadata(result) -# %% # show schema metadata which includes CytoTable information # note: this information will travel with the file. pq.read_schema(result).metadata -# %% # show schema column name summaries print("Column name prefix counts:") dict(Counter(w.split("_", 1)[0] for w in pq.read_schema(result).names)) -# %% # show full schema details pq.read_schema(result) diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md index b73cb5e9..b35910b9 100644 --- a/docs/source/tutorial.md +++ b/docs/source/tutorial.md @@ -6,7 +6,7 @@ Start here if you are new to CytoTable. We’ve split material by audience: - **Engineers / power users:** see the Software Engineering Guide for tuning and integration details, or use the quick recipe below. ```{admonition} Who this helps (and doesn’t) -- Helps: image analysts who want to get CellProfiler/NPZ outputs into Parquet with minimal coding; people comfortable running a few commands. +- Helps: image analysts who want to get CellProfiler/DeepProfiler/InCarta outputs into Parquet with minimal coding; people comfortable running a few commands. - Not ideal: raw image ingestion or pipeline authoring (use CellProfiler/DeepProfiler upstream); workflows needing a GUI-only experience. - Effort: install, copy/paste a few commands, validate outputs in minutes. ``` @@ -26,9 +26,11 @@ Looking for variations or troubleshooting? See the Software Engineering Guide. ## Quick recipe: CellProfiler CSV to Parquet -This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. If you prefer a guided, narrative walkthrough with downloadable inputs and expected outputs, use the tutorial above. +This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. +If you prefer a guided, narrative walkthrough with downloadable inputs and expected outputs, use the tutorial above. -[CellProfiler](https://cellprofiler.org/) exports compartment CSVs (for example, "Cells.csv", "Cytoplasm.csv"). CytoTable converts this data to Parquet from local or object-storage locations. +[CellProfiler](https://cellprofiler.org/) exports compartment CSVs (for example, "Cells.csv", "Cytoplasm.csv"). +CytoTable converts this data to Parquet from local or object-storage locations. Files with similar names nested within sub-folders are concatenated by default (for example, `folder/sub_a/cells.csv` and `folder/sub_b/cells.csv` become a single `folder.cells.parquet` unless `concat=False`). diff --git a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md index 0ee1f7c2..1b6f1ef6 100644 --- a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md @@ -41,7 +41,7 @@ pip install cytotable ```bash export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite" -export DEST_PATH="./outputs/br00126114" +export DEST_PATH="./outputs/br00126114.parquet" export CACHE_DIR="./sqlite_s3_cache" mkdir -p "$DEST_PATH" "$CACHE_DIR" ``` @@ -86,11 +86,12 @@ Why these flags matter (in plain language): ## Step 3: check that the outputs look right -You should see four Parquet files in the destination directory: +You should see a Parquet file in the destination directory. +This Parquet file should include all compartment (nuclei, cytoplasm, cell, etc.) data in addition to metadata about the features. ```bash ls "$DEST_PATH" -# Image.parquet Cells.parquet Cytoplasm.parquet Nuclei.parquet +# br00126114.parquet ``` ## What success looks like From a695a01d298512c6db108405f5883d076e1e7794 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 13:58:06 -0700 Subject: [PATCH 05/13] one sentence per line --- docs/source/tutorials/multi_plate_merge_tablenumber.md | 4 +++- docs/source/tutorials/npz_embeddings_to_parquet.md | 6 ++++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/source/tutorials/multi_plate_merge_tablenumber.md b/docs/source/tutorials/multi_plate_merge_tablenumber.md index f1408579..08b6e670 100644 --- a/docs/source/tutorials/multi_plate_merge_tablenumber.md +++ b/docs/source/tutorials/multi_plate_merge_tablenumber.md @@ -65,7 +65,9 @@ Why this matters: ## Step 3: validate plate separation -You should see one Parquet per compartment (`Cells`, `Cytoplasm`, `Nuclei`, etc.) in `DEST_PATH`. Opening a file with Pandas or PyArrow should show `Metadata_TableNumber` present and non-zero rows. If you processed multiple plates, expect multiple distinct values in that column. +You should see one Parquet per compartment (`Cells`, `Cytoplasm`, `Nuclei`, etc.) in `DEST_PATH`. +Opening a file with Pandas or PyArrow should show `Metadata_TableNumber` present and non-zero rows. +If you processed multiple plates, expect multiple distinct values in that column. ## Scenario callouts (“if your data looks like this...”) diff --git a/docs/source/tutorials/npz_embeddings_to_parquet.md b/docs/source/tutorials/npz_embeddings_to_parquet.md index 39c4c73e..9d05ad58 100644 --- a/docs/source/tutorials/npz_embeddings_to_parquet.md +++ b/docs/source/tutorials/npz_embeddings_to_parquet.md @@ -1,6 +1,7 @@ # Tutorial: NPZ embeddings + metadata to Parquet -A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. This uses a small example bundled in the repo. +A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. +This uses a small example bundled in the repo. ## What you will accomplish @@ -66,7 +67,8 @@ Notes (why these flags matter): ## Step 3: validate the output -You should see `all_files.npz.parquet` in `DEST_PATH`. Opening it with Pandas or PyArrow should show non-zero rows and both feature (`efficientnet_*`) and metadata columns. +You should see `all_files.npz.parquet` in `DEST_PATH`. +Opening it with Pandas or PyArrow should show non-zero rows and both feature (`efficientnet_*`) and metadata columns. ## What success looks like From f74617376b3a591b6e88a0ce5727ccfa194bbd96 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:02:14 -0700 Subject: [PATCH 06/13] rename page --- docs/source/index.md | 2 +- docs/source/{tutorial.md => tutorials.md} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/source/{tutorial.md => tutorials.md} (100%) diff --git a/docs/source/index.md b/docs/source/index.md index 492ab310..a7142eb3 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -13,7 +13,7 @@ caption: 'Contents:' maxdepth: 3 --- overview -tutorial +tutorials examples presentations contributing diff --git a/docs/source/tutorial.md b/docs/source/tutorials.md similarity index 100% rename from docs/source/tutorial.md rename to docs/source/tutorials.md From 41e3923498deb9fc85c061e40632542d5b2a9497 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:02:30 -0700 Subject: [PATCH 07/13] one sentence per line --- docs/source/tutorials.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index b35910b9..fb2b7a7d 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -1,6 +1,7 @@ # Tutorials -Start here if you are new to CytoTable. We’ve split material by audience: +Start here if you are new to CytoTable. +We’ve split material by audience: - **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. - **Engineers / power users:** see the Software Engineering Guide for tuning and integration details, or use the quick recipe below. @@ -22,7 +23,8 @@ tutorials/multi_plate_merge_tablenumber software_engineering ``` -Looking for variations or troubleshooting? See the Software Engineering Guide. +Looking for variations or troubleshooting? +See the Software Engineering Guide. ## Quick recipe: CellProfiler CSV to Parquet @@ -34,7 +36,8 @@ CytoTable converts this data to Parquet from local or object-storage locations. Files with similar names nested within sub-folders are concatenated by default (for example, `folder/sub_a/cells.csv` and `folder/sub_b/cells.csv` become a single `folder.cells.parquet` unless `concat=False`). -The `dest_path` parameter is used for intermediary work and must be a new file or directory path. It will be a directory when `join=False` and a single file when `join=True`. +The `dest_path` parameter is used for intermediary work and must be a new file or directory path. +It will be a directory when `join=False` and a single file when `join=True`. ```python from cytotable import convert From 93ce1ad1dfe255bec8ee961f3c4c43b2e9c2a8f1 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:18:22 -0700 Subject: [PATCH 08/13] various improvements --- docs/source/overview.md | 1 - docs/source/software_engineering.md | 3 ++- docs/source/tutorials.md | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/overview.md b/docs/source/overview.md index 4cd243fa..05eda380 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -113,7 +113,6 @@ Data source compatibility for CytoTable is focused (but not explicitly limited t ```{eval-rst} * **Manual specification:** NPZ data source types may be manually specified by using :code:`convert(..., source_datatype="npz", ...)` (:mod:`convert() `). * **Preset specification:** NPZ data from DeepProfiler may be converted through CytoTable by using the following preset :code:`convert(..., preset="deepprofiler", ...)` (:mod:`convert() `). - * **Not covered:** `.npy` feature dumps or CSV-only outputs; use the CellProfiler CSV/SQLite presets for those formats. ``` #### IN Carta Data Sources diff --git a/docs/source/software_engineering.md b/docs/source/software_engineering.md index 4ac2ca3e..5a7679cc 100644 --- a/docs/source/software_engineering.md +++ b/docs/source/software_engineering.md @@ -1,6 +1,7 @@ # Software Engineering Guide -This page is for engineers and power users who want to tune CytoTable beyond the narrative tutorials. It focuses on performance, reliability, and integration patterns. +This page is for engineers and power users who want to tune CytoTable beyond the narrative tutorials. +It focuses on performance, reliability, and integration patterns. ## Performance and scaling diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index fb2b7a7d..0aed8fa1 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -4,7 +4,7 @@ Start here if you are new to CytoTable. We’ve split material by audience: - **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. -- **Engineers / power users:** see the Software Engineering Guide for tuning and integration details, or use the quick recipe below. +- **Engineers / power users:** see the [Software Engineering Guide](software_engineering.md) for tuning and integration details, or use the quick recipe below. ```{admonition} Who this helps (and doesn’t) - Helps: image analysts who want to get CellProfiler/DeepProfiler/InCarta outputs into Parquet with minimal coding; people comfortable running a few commands. @@ -24,7 +24,7 @@ software_engineering ``` Looking for variations or troubleshooting? -See the Software Engineering Guide. +See the [Software Engineering Guide](software_engineering.md). ## Quick recipe: CellProfiler CSV to Parquet From 0ac6f6914e3b097e8dc988dea82ce728e3dd9153 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:25:00 -0700 Subject: [PATCH 09/13] add csv guide to cellprofiler tutorial Co-Authored-By: Jenna Tomkinson <107513215+jenna-tomkinson@users.noreply.github.com> --- docs/source/tutorials.md | 2 +- .../cellprofiler_sqlite_to_parquet.md | 101 --------------- .../tutorials/cellprofiler_to_parquet.md | 122 ++++++++++++++++++ 3 files changed, 123 insertions(+), 102 deletions(-) delete mode 100644 docs/source/tutorials/cellprofiler_sqlite_to_parquet.md create mode 100644 docs/source/tutorials/cellprofiler_to_parquet.md diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index 0aed8fa1..e1ba4805 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -17,7 +17,7 @@ We’ve split material by audience: maxdepth: 2 caption: Tutorials (start here) --- -tutorials/cellprofiler_sqlite_to_parquet +tutorials/cellprofiler_to_parquet tutorials/npz_embeddings_to_parquet tutorials/multi_plate_merge_tablenumber software_engineering diff --git a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md b/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md deleted file mode 100644 index 1b6f1ef6..00000000 --- a/docs/source/tutorials/cellprofiler_sqlite_to_parquet.md +++ /dev/null @@ -1,101 +0,0 @@ -# Tutorial: CellProfiler SQLite on S3 to Parquet - -A narrative, start-to-finish walkthrough for image analysts who want a working Parquet export from a CellProfiler SQLite file stored in the cloud. - -## What you will accomplish - -- Pull a CellProfiler SQLite file directly from S3 (unsigned/public) and convert each compartment table to Parquet. -- Keep a persistent local cache so the download is reused and avoids “file vanished” errors on temp disks. -- Verify the outputs quickly (file names and row counts) without needing to understand the internals. - -```{admonition} If your data looks like this, change... -- Local SQLite instead of S3: set `source_path` to the local `.sqlite` file; remove `no_sign_request`; keep `local_cache_dir`. -- Only certain compartments: add `targets=["cells", "nuclei"]` (case-insensitive). -- Memory constrained: lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` has space. -``` - -## Setup (copy-paste) - -```bash -python -m venv .venv -source .venv/bin/activate -pip install --upgrade pip -pip install cytotable -``` - -## Inputs and outputs - -- **Input:** A single-plate CellProfiler SQLite file from the open Cell Painting Gallery - `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` - No credentials are required (`no_sign_request=True`). -- **Output:** Four Parquet files (Image, Cells, Cytoplasm, Nuclei) in `./outputs/br00126114`. - -## Before you start - -- Install Cytotable (and DuckDB is bundled): - `pip install cytotable` -- Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. -- If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. - -## Step 1: define your paths - -```bash -export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite" -export DEST_PATH="./outputs/br00126114.parquet" -export CACHE_DIR="./sqlite_s3_cache" -mkdir -p "$DEST_PATH" "$CACHE_DIR" -``` - -## Step 2: run the conversion (minimal Python) - -```python -import os -import cytotable - -# If you used the bash exports above: -SOURCE_PATH = os.environ["SOURCE_PATH"] -DEST_PATH = os.environ["DEST_PATH"] -CACHE_DIR = os.environ["CACHE_DIR"] - -# (Alternatively, set them directly as strings in Python.) - -result = cytotable.convert( - source_path=SOURCE_PATH, - source_datatype="sqlite", - dest_path=DEST_PATH, - dest_datatype="parquet", - # Preset matches common CellProfiler SQLite layout from the Cell Painting Gallery - preset="cellprofiler_sqlite_cpg0016_jump", - # Use a cache directory you control so the downloaded SQLite is reusable - local_cache_dir=CACHE_DIR, - # This dataset is public; unsigned requests avoid credential prompts - no_sign_request=True, - # Reasonable chunking for large tables; adjust up/down if you hit memory limits - chunk_size=30000, -) - -print(result) -``` - -Why these flags matter (in plain language): - -- `local_cache_dir`: keeps the downloaded SQLite file somewhere predictable so DuckDB can open it reliably. -- `preset`: selects the right table names and page keys for this dataset. -- `chunk_size`: processes data in pieces so you don’t need excessive RAM. -- `no_sign_request`: needed because the sample bucket is public and unsigned. - -## Step 3: check that the outputs look right - -You should see a Parquet file in the destination directory. -This Parquet file should include all compartment (nuclei, cytoplasm, cell, etc.) data in addition to metadata about the features. - -```bash -ls "$DEST_PATH" -# br00126114.parquet -``` - -## What success looks like - -- A stable local cache of the SQLite file remains in `CACHE_DIR` (useful for repeated runs). -- Four Parquet files exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow. -- No temporary-file or “unable to open database file” errors occur during the run. diff --git a/docs/source/tutorials/cellprofiler_to_parquet.md b/docs/source/tutorials/cellprofiler_to_parquet.md new file mode 100644 index 00000000..a601611e --- /dev/null +++ b/docs/source/tutorials/cellprofiler_to_parquet.md @@ -0,0 +1,122 @@ +# Tutorial: CellProfiler SQLite or CSV to Parquet + +A start-to-finish walkthrough for image analysts who want a working Parquet export from CellProfiler outputs (SQLite or CSV), including public S3 and local data. + +## What you will accomplish + +- Convert CellProfiler outputs to Parquet with a preset that matches common table/column layouts. +- Handle both SQLite (typical Cell Painting Gallery exports) and CSV folder outputs. +- Keep a persistent local cache so downloads are reused and avoid “file vanished” errors on temp disks. +- Verify the outputs quickly (file names and row counts) without needing to understand the internals. + +```{admonition} If your data looks like this, change... +- Local SQLite instead of S3: set `source_path` to the local `.sqlite` file; remove `no_sign_request`; keep `local_cache_dir`. +- CellProfiler CSV folders: point `source_path` to the folder that contains `Cells.csv`, `Cytoplasm.csv`, etc.; set `source_datatype="csv"` and `preset="cellprofiler_csv"`. +- Only certain compartments: add `targets=["cells", "nuclei"]` (case-insensitive). +- Memory constrained: lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` has space. +``` + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **SQLite example (public S3):** `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` + No credentials are required (`no_sign_request=True`). +- **CSV example (local folder):** `./tests/data/cellprofiler/ExampleHuman` which contains `Cells.csv`, `Cytoplasm.csv`, `Nuclei.csv`, etc. +- **Outputs:** Parquet files for each compartment (Image, Cells, Cytoplasm, Nuclei) in `./outputs/...`. + +## Before you start + +- Install Cytotable (and DuckDB is bundled): + `pip install cytotable` +- Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. +- If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. + +## Step 1: choose your input type + +Pick one of the two setups below. + +**SQLite from public S3 (Cell Painting Gallery)** + +```bash +export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite" +export SOURCE_DATATYPE="sqlite" +export PRESET="cellprofiler_sqlite_cpg0016_jump" +export DEST_PATH="./outputs/br00126114.parquet" +export CACHE_DIR="./sqlite_s3_cache" +mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR" +``` + +**CellProfiler CSV folder (local or mounted storage)** + +```bash +export SOURCE_PATH="./tests/data/cellprofiler/ExampleHuman" +export SOURCE_DATATYPE="csv" +export PRESET="cellprofiler_csv" +export DEST_PATH="./outputs/examplehuman.parquet" +export CACHE_DIR="./csv_cache" +mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR" +``` + +## Step 2: run the conversion (minimal Python) + +```python +import os +import cytotable + +# If you used the bash exports above: +SOURCE_PATH = os.environ["SOURCE_PATH"] +SOURCE_DATATYPE = os.environ["SOURCE_DATATYPE"] +DEST_PATH = os.environ["DEST_PATH"] +PRESET = os.environ["PRESET"] +CACHE_DIR = os.environ["CACHE_DIR"] + +# (Alternatively, set them directly as strings in Python.) + +result = cytotable.convert( + source_path=SOURCE_PATH, + source_datatype=SOURCE_DATATYPE, + dest_path=DEST_PATH, + dest_datatype="parquet", + preset=PRESET, + local_cache_dir=CACHE_DIR, + # For public S3 (SQLite or CSV) add: + no_sign_request=True, + # Reasonable chunking for large tables; adjust up/down if you hit memory limits + chunk_size=30000, +) + +print(result) +``` + +Why these flags matter (in plain language): + +- `local_cache_dir`: keeps downloaded data somewhere predictable so DuckDB can open it reliably. +- `preset`: selects the right table names and page keys for this dataset (SQLite or CSV). +- `chunk_size`: processes data in pieces so you don’t need excessive RAM. +- `no_sign_request`: needed because the sample bucket is public and unsigned. + +## Step 3: check that the outputs look right + +You should see Parquet files in the destination directory. +If you set `join=True` (handy for the SQLite example), you get a single `.parquet` file containing all compartments. +If you set `join=False` (handy for CSV folders), you get separate Parquet files for each compartment. + +```bash +ls "$DEST_PATH" +# SQLite example: br00126114.parquet +# CSV example: examplehuman.parquet/Cells.parquet (and Cytoplasm, Nuclei, Image) +``` + +## What success looks like + +- A stable local cache of the SQLite file or CSV downloads remains in `CACHE_DIR` (useful for repeated runs). +- Parquet outputs exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow. +- No temporary-file or “unable to open database file” errors occur during the run. From 77f835d70d95c1e0bb506b614e0d838b239a6af9 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:26:50 -0700 Subject: [PATCH 10/13] one sentence per line --- docs/source/tutorials/cellprofiler_to_parquet.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/tutorials/cellprofiler_to_parquet.md b/docs/source/tutorials/cellprofiler_to_parquet.md index a601611e..d8fa083c 100644 --- a/docs/source/tutorials/cellprofiler_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_to_parquet.md @@ -106,7 +106,8 @@ Why these flags matter (in plain language): ## Step 3: check that the outputs look right You should see Parquet files in the destination directory. -If you set `join=True` (handy for the SQLite example), you get a single `.parquet` file containing all compartments. +If you set `join=True` (handy for the SQLite example), you get a single `. +parquet` file containing all compartments. If you set `join=False` (handy for CSV folders), you get separate Parquet files for each compartment. ```bash From 66a108c86fb6da486af4fcdeba21b5815d9eab4d Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 14:27:03 -0700 Subject: [PATCH 11/13] Update cellprofiler_to_parquet.md --- docs/source/tutorials/cellprofiler_to_parquet.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/tutorials/cellprofiler_to_parquet.md b/docs/source/tutorials/cellprofiler_to_parquet.md index d8fa083c..27195763 100644 --- a/docs/source/tutorials/cellprofiler_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_to_parquet.md @@ -106,8 +106,7 @@ Why these flags matter (in plain language): ## Step 3: check that the outputs look right You should see Parquet files in the destination directory. -If you set `join=True` (handy for the SQLite example), you get a single `. -parquet` file containing all compartments. +If you set `join=True` (handy for the SQLite example), you get a single `. parquet` file containing all compartments. If you set `join=False` (handy for CSV folders), you get separate Parquet files for each compartment. ```bash From f36389ab3e87a08eef445a0ffe6484d0bce8a80c Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 15:24:28 -0700 Subject: [PATCH 12/13] fix link --- docs/source/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/overview.md b/docs/source/overview.md index 05eda380..657cff00 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -1,7 +1,7 @@ # Overview This page provides a brief overview of CytoTable topics. -For a brief introduction on how to use CytoTable, please see the [tutorial](tutorial.md) page. +For a brief introduction on how to use CytoTable, please see the [tutorials](tutorials.md) page. ## Presets and Manual Overrides From 3b759c97300916bdb6638c7141e1feacceec0eb5 Mon Sep 17 00:00:00 2001 From: d33bs Date: Fri, 5 Dec 2025 15:44:19 -0700 Subject: [PATCH 13/13] remove irrelevant docs --- docs/source/index.md | 1 - docs/source/software_engineering.md | 134 ------------------ docs/source/tutorials.md | 8 +- .../tutorials/cellprofiler_to_parquet.md | 7 +- .../multi_plate_merge_tablenumber.md | 6 +- .../tutorials/npz_embeddings_to_parquet.md | 8 +- 6 files changed, 12 insertions(+), 152 deletions(-) delete mode 100644 docs/source/software_engineering.md diff --git a/docs/source/index.md b/docs/source/index.md index a7142eb3..38e11936 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -20,5 +20,4 @@ contributing Code of Conduct architecture python-api -software_engineering ``` diff --git a/docs/source/software_engineering.md b/docs/source/software_engineering.md deleted file mode 100644 index 5a7679cc..00000000 --- a/docs/source/software_engineering.md +++ /dev/null @@ -1,134 +0,0 @@ -# Software Engineering Guide - -This page is for engineers and power users who want to tune CytoTable beyond the narrative tutorials. -It focuses on performance, reliability, and integration patterns. - -## Performance and scaling - -- **Chunk size (`chunk_size`)**: Larger chunks reduce overhead but increase peak memory. Start at 30k (default in examples), adjust down for memory-constrained environments, up for fast disks/large RAM. -- **Threads (DuckDB)**: We set `PRAGMA threads` based on `cytotable.constants.MAX_THREADS`. Override via env var `CYTOTABLE_MAX_THREADS` to align with container CPU limits. -- **I/O locality**: For remote SQLite/NPZ, always set `local_cache_dir` to a stable, non-tmpfs path. Reuse the cache across runs to avoid redundant downloads. - -Example: tuned convert with explicit threads and chunk size - -```python -import os -import cytotable - -os.environ["CYTOTABLE_MAX_THREADS"] = "4" - -cytotable.convert( - source_path="s3://my-bucket/plate.sqlite", - source_datatype="sqlite", - dest_path="./out/plate", - dest_datatype="parquet", - preset="cellprofiler_sqlite", - local_cache_dir="./cache/sqlite", - chunk_size=50000, # larger chunks, more RAM, faster on beefy nodes - no_sign_request=True, -) -``` - -## Cloud paths and auth - -- **Unsigned/public S3**: use `no_sign_request=True`. This keeps DuckDB + cloudpathlib using unsigned clients consistently. -- **Signed/private S3**: rely on ambient AWS creds or pass `profile_name`, `aws_access_key_id`, `aws_secret_access_key`, `aws_session_token`. These kwargs flow into cloudpathlib’s client via `_build_path`. -- **GCS/Azure**: supported through cloudpathlib; pass provider-specific kwargs the same way you would construct the CloudPath client. - -Signed S3 example with a specific profile - -```python -import cytotable - -cytotable.convert( - source_path="s3://my-private-bucket/exports/plate.sqlite", - source_datatype="sqlite", - dest_path="./out/private-plate", - dest_datatype="parquet", - preset="cellprofiler_sqlite", - local_cache_dir="./cache/private", - profile_name="science-prod", -) -``` - -## Data layout and presets - -- Prefer presets when available (for example, `cellprofiler_sqlite_cpg0016_jump`, `cellprofiler_csv`) because they set table names and page keys. For custom layouts, pass `targets=[...]` and `page_keys={...}` to `convert`. -- Multi-plate runs: point `source_path` to a parent directory; CytoTable will glob and group per-table. Keep per-run `dest_path` directories to avoid mixing outputs. -- Common variants: - - **Local SQLite:** set `source_path` to the local file, drop `no_sign_request`, keep `local_cache_dir` for stability. - - **Different table names/compartments:** set `targets=[...]` or choose the matching preset. - - **Multiple plates in one folder:** point `source_path` to the folder; use unique `dest_path` per run to avoid mixing outputs. - - **Tight disk space:** set `local_cache_dir` to a larger volume and clean it after the run. - -Custom layout example with explicit targets and page keys - -```python -import cytotable - -cytotable.convert( - source_path="/data/plates/", - source_datatype="sqlite", - dest_path="./out/plates", - dest_datatype="parquet", - targets=["cells", "cytoplasm", "nuclei"], # which tables to include - page_keys={ - "cells": "ImageNumber", - "cytoplasm": "ImageNumber", - "nuclei": "ImageNumber", - }, - add_tablenumber=True, - chunk_size=20000, -) -``` - -## Reliability tips - -- **Stable cache**: If you see “unable to open database file” on cloud SQLite, ensure `local_cache_dir` is set and writable. DuckDB reads from the cached path. -- **Disk space**: Parquet output size ~10–30% of CSV; SQLite is denser. Ensure the cache volume can hold both the source and outputs simultaneously. -- **Restartability**: `dest_path` is overwritten per run; use unique destination directories for incremental runs to avoid partial-output confusion. - -## Testing and CI entry points - -- Unit tests live under `tests/`; sample datasets are in `tests/data/`. Add targeted fixtures when introducing new formats/presets. -- For quick smoke tests, run `python -m pytest tests/test_convert_threaded.py -k convert` and a docs build `sphinx-build docs/source docs/build` to ensure examples render. -- Keep new presets documented in `docs/source/overview.md` and mention edge cases (auth, cache, table naming). - -Smoke-test commands - -```bash -python -m pytest tests/test_convert_threaded.py -k convert -sphinx-build docs/source docs/build -``` - -## Embedding CytoTable in pipelines - -- **Python API**: `cytotable.convert(...)` is synchronous; wrap in your workflow engine (Airflow, Prefect, Nextflow via Python) as a task step. -- **CLI wrapper**: not bundled; if you add one, surface the same flags as `convert` and mirror logging levels. -- **Logging**: uses the standard logging system. Set `CYTOTABLE_LOG_LEVEL=INFO` (or `DEBUG`) in container/CI to capture more detail during runs. - -Simple function you can call from any orchestrator (Airflow task, Nextflow Python, shell) - -```python -import cytotable - - -def run_cytotable(source, dest, cache): - return cytotable.convert( - source_path=source, - source_datatype="sqlite", - dest_path=dest, - dest_datatype="parquet", - preset="cellprofiler_sqlite", - local_cache_dir=cache, - chunk_size=30000, - ) - - -if __name__ == "__main__": - run_cytotable( - "s3://my-bucket/plate.sqlite", - "./out/plate", - "./cache/sqlite", - ) -``` diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md index e1ba4805..59968cfa 100644 --- a/docs/source/tutorials.md +++ b/docs/source/tutorials.md @@ -3,8 +3,8 @@ Start here if you are new to CytoTable. We’ve split material by audience: -- **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. -- **Engineers / power users:** see the [Software Engineering Guide](software_engineering.md) for tuning and integration details, or use the quick recipe below. +- **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. Please also feel free to reference the [example notebooks](examples.md). +- **Engineers / power users:** see any documentation, including the [example notebooks](examples.md), for tuning and integration details, or use the quick recipe below. ```{admonition} Who this helps (and doesn’t) - Helps: image analysts who want to get CellProfiler/DeepProfiler/InCarta outputs into Parquet with minimal coding; people comfortable running a few commands. @@ -20,12 +20,8 @@ caption: Tutorials (start here) tutorials/cellprofiler_to_parquet tutorials/npz_embeddings_to_parquet tutorials/multi_plate_merge_tablenumber -software_engineering ``` -Looking for variations or troubleshooting? -See the [Software Engineering Guide](software_engineering.md). - ## Quick recipe: CellProfiler CSV to Parquet This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. diff --git a/docs/source/tutorials/cellprofiler_to_parquet.md b/docs/source/tutorials/cellprofiler_to_parquet.md index 27195763..c924c3de 100644 --- a/docs/source/tutorials/cellprofiler_to_parquet.md +++ b/docs/source/tutorials/cellprofiler_to_parquet.md @@ -34,7 +34,7 @@ pip install cytotable ## Before you start -- Install Cytotable (and DuckDB is bundled): +- Install Cytotable: `pip install cytotable` - Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. - If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. @@ -98,7 +98,7 @@ print(result) Why these flags matter (in plain language): -- `local_cache_dir`: keeps downloaded data somewhere predictable so DuckDB can open it reliably. +- `local_cache_dir`: keeps downloaded data somewhere predictable. - `preset`: selects the right table names and page keys for this dataset (SQLite or CSV). - `chunk_size`: processes data in pieces so you don’t need excessive RAM. - `no_sign_request`: needed because the sample bucket is public and unsigned. @@ -112,11 +112,10 @@ If you set `join=False` (handy for CSV folders), you get separate Parquet files ```bash ls "$DEST_PATH" # SQLite example: br00126114.parquet -# CSV example: examplehuman.parquet/Cells.parquet (and Cytoplasm, Nuclei, Image) +# CSV example: examplehuman.parquet ``` ## What success looks like - A stable local cache of the SQLite file or CSV downloads remains in `CACHE_DIR` (useful for repeated runs). - Parquet outputs exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow. -- No temporary-file or “unable to open database file” errors occur during the run. diff --git a/docs/source/tutorials/multi_plate_merge_tablenumber.md b/docs/source/tutorials/multi_plate_merge_tablenumber.md index 08b6e670..f5a460a6 100644 --- a/docs/source/tutorials/multi_plate_merge_tablenumber.md +++ b/docs/source/tutorials/multi_plate_merge_tablenumber.md @@ -22,13 +22,13 @@ pip install cytotable - **Input:** A folder of CellProfiler SQLite files (example structure): `data/plates/PlateA.sqlite` `data/plates/PlateB.sqlite` -- **Output:** Parquet files (Image/Cells/Cytoplasm/Nuclei) under `./outputs/multi_plate`, with a `Metadata_TableNumber` column indicating plate. +- **Output:** Parquet file under `./outputs/multi_plate.parquet`, with a `Metadata_TableNumber` column indicating plate. ## Step 1: define your paths ```bash export SOURCE_PATH="./data/plates" -export DEST_PATH="./outputs/multi_plate" +export DEST_PATH="./outputs/multi_plate.parquet" export CACHE_DIR="./sqlite_cache" mkdir -p "$DEST_PATH" "$CACHE_DIR" ``` @@ -65,7 +65,7 @@ Why this matters: ## Step 3: validate plate separation -You should see one Parquet per compartment (`Cells`, `Cytoplasm`, `Nuclei`, etc.) in `DEST_PATH`. +You should see one Parquet file (`multi_plate.parquet`) in `DEST_PATH`. Opening a file with Pandas or PyArrow should show `Metadata_TableNumber` present and non-zero rows. If you processed multiple plates, expect multiple distinct values in that column. diff --git a/docs/source/tutorials/npz_embeddings_to_parquet.md b/docs/source/tutorials/npz_embeddings_to_parquet.md index 9d05ad58..28400109 100644 --- a/docs/source/tutorials/npz_embeddings_to_parquet.md +++ b/docs/source/tutorials/npz_embeddings_to_parquet.md @@ -27,13 +27,13 @@ pip install cytotable ## Inputs and outputs - **Input:** Example NPZ + metadata in this repo: `tests/data/deepprofiler/pycytominer_example` -- **Output:** A Parquet file under `./outputs/deepprofiler_example` +- **Output:** A Parquet file under `./outputs/deepprofiler_example.parquet` ## Step 1: define your paths ```bash export SOURCE_PATH="tests/data/deepprofiler/pycytominer_example" -export DEST_PATH="./outputs/deepprofiler_example" +export DEST_PATH="./outputs/deepprofiler_example.parquet" mkdir -p "$DEST_PATH" ``` @@ -67,11 +67,11 @@ Notes (why these flags matter): ## Step 3: validate the output -You should see `all_files.npz.parquet` in `DEST_PATH`. +You should see `deepprofiler_example.parquet` in `DEST_PATH`. Opening it with Pandas or PyArrow should show non-zero rows and both feature (`efficientnet_*`) and metadata columns. ## What success looks like -- A Parquet file `all_files.npz.parquet` exists in `DEST_PATH`. +- A Parquet file `deepprofiler_example.parquet` exists in `DEST_PATH`. - DuckDB/Pandas can read the file; row count is non-zero. - Feature columns (for example, `efficientnet_*`) and metadata columns (plate/well/site) both appear.