diff --git a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb index b04a604d..ae893390 100644 --- a/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb +++ b/docs/jupyter_execute/examples/cytotable_mise_en_place_general_overview.ipynb @@ -7,7 +7,7 @@ "source": [ "# CytoTable mise en place (general overview)\n", "\n", - "This notebook includes a quick demonstration of CytoTable to help you understand the basics of using the package and the biological basis of each step.\n", + "This notebook will help you understand the basics of using CytoTable and the biological basis of each step.\n", "We provide a high-level overview of the related concepts to give greater context about where and how the data are changed in order to gain new insights.\n", "\n", "The name of the notebook comes from the french _mise en place_:\n", @@ -89,17 +89,18 @@ "id": "832c700f-63e0-4f22-853c-9bf6d5328a5c", "metadata": {}, "source": [ - "## Phase 1: Cells are stained and images are captured by microscopes\n", + "## Phase 1: Cells are imaged by microscopes, with optional fluorescence staining\n", "\n", "![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png)\n", "\n", - "__Figure 1.__ _Cells are stained in order to highlight cellular compartments and organelles. Microscopes are used to observe and capture data for later use._\n", + "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._\n", "\n", - "CytoTable uses data created from multiple upstream steps involving images of \n", - "stained biological objects (typically cells).\n", - "Cells are cultured in multi-well plates, perturbed, and then fixed before being stained with a panel of six fluorescent dyes that highlight key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane. These multiplexed stains are imaged across fluorescence channels using automated high-content microscopy, producing rich images that capture the morphology of individual cells for downstream analysis ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)).\n", + "CytoTable processes microscopy-based data that are created from multiple upstream steps.\n", + "CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design.\n", + "However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay.\n", + "In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis .\n", "\n", - "We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to help describe this process below." + "We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to describe this process below." ] }, { @@ -185,17 +186,17 @@ "id": "23897ed5-53aa-41a2-a8b2-494498045262", "metadata": {}, "source": [ - "## Phase 2: Images are segmented to build numeric feature datasets via CellProfiler\n", + "## Phase 2: CellProfiler segments cells and measures numeric features\n", "\n", "![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png)\n", "\n", "\n", - "__Figure 2.__ _CellProfiler is configured to use images and performs segmentation to evaluate numeric representations of cells. This data is captured for later use in tabular file formats such as CSV or SQLite tables._\n", + "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._\n", "\n", - "After acquisition, the multiplexed images are processed using image-analysis software such as CellProfiler, which segments cells and their compartments into distinct regions of interest. From these segmented images, hundreds to thousands of quantitative features are extracted per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", + "After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", "These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)).\n", "\n", - "CellProfiler was used in conjunction with the `.cppipe` file to produce the following images and data tables from the ExampleHuman dataset." + "We use CellProfiler (with a prespecified configuration `.cppipe` file) to produce the following images and data tables from the ExampleHuman dataset." ] }, { @@ -1266,7 +1267,7 @@ } ], "source": [ - "# show the tables generated from the resulting CSV files\n", + "# show the tables generated from the resulting CSV files\n", "for profiles in pathlib.Path(source_path).glob(\"*.csv\"):\n", " print(f\"\\nProfiles from CellProfiler: {profiles}\")\n", " display(pd.read_csv(profiles).head())" @@ -1278,13 +1279,13 @@ "id": "5f5b7cd6-9511-4349-bacf-e6304a099025", "metadata": {}, "source": [ - "## Phase 3: Numeric feature datasets from CellProfiler are harmonized by CytoTable\n", + "## Phase 3: CytoTable harmonizes the feature datasets that CellProfiler generates\n", "\n", "![Image showing feature data being read by CytoTable and exported to a CytoTable file.](../_static/features_to_cytotable.png)\n", "\n", - "The high-dimensional feature tables produced by CellProfiler often vary in format depending on the imaging pipeline, experiment, or storage system. CytoTable standardizes these single-cell morphology datasets by harmonizing outputs into consistent, analysis-ready formats such as Parquet or AnnData. This unification ensures that data from diverse experiments can be readily integrated and processed by downstream profiling tools like Pycytominer or coSMicQC, enabling scalable and reproducible cytomining workflows.\n", + "CellProfiler produces high-dimensional feature tables that vary in format depending on the imaging pipeline, experiment, or storage system. Sometimes these feature tables are thousands of columns and hundreds of thousands of rows. CytoTable harmonizes these outputs into consistent, analysis-ready formats such as Parquet or AnnData. This unification ensures that data from diverse experiments can be readily integrated and processed by downstream profiling tools like Pycytominer or coSMicQC, enabling scalable and reproducible bioinformatics workflows.\n", "\n", - "We use CytoTable below to process the numeric feature data observed above." + "We use CytoTable below to process the numeric feature data we generated above." ] }, { @@ -1298,8 +1299,8 @@ "output_type": "stream", "text": [ "example.parquet\n", - "CPU times: user 215 ms, sys: 159 ms, total: 374 ms\n", - "Wall time: 13.1 s\n" + "CPU times: user 239 ms, sys: 167 ms, total: 406 ms\n", + "Wall time: 13.3 s\n" ] } ], @@ -1594,13 +1595,13 @@ { "data": { "text/plain": [ - "\n", + "\n", " created_by: parquet-cpp-arrow version 21.0.0\n", " num_columns: 312\n", " num_rows: 289\n", " num_row_groups: 1\n", " format_version: 2.6\n", - " serialized_size: 87760" + " serialized_size: 87761" ] }, "execution_count": 9, @@ -1623,7 +1624,7 @@ "data": { "text/plain": [ "{b'data-producer': b'https://github.com/cytomining/CytoTable',\n", - " b'data-producer-version': b'1.1.0.post6.dev0+4ddbbe1'}" + " b'data-producer-version': b'1.1.0.post13.dev0+2f51ec3'}" ] }, "execution_count": 10, @@ -1990,7 +1991,7 @@ "Nuclei_Number_Object_Number: int64\n", "-- schema metadata --\n", "data-producer: 'https://github.com/cytomining/CytoTable'\n", - "data-producer-version: '1.1.0.post6.dev0+4ddbbe1'" + "data-producer-version: '1.1.0.post13.dev0+2f51ec3'" ] }, "execution_count": 12, @@ -2020,7 +2021,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.11" + "version": "3.10.16" } }, "nbformat": 4, diff --git a/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb b/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb index ae893390..8cefc3de 100644 --- a/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb +++ b/docs/source/examples/cytotable_mise_en_place_general_overview.ipynb @@ -93,9 +93,9 @@ "\n", "![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png)\n", "\n", - "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._\n", + "__Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes stain the cells to mark specific proteins, compartments, and/or organelles._\n", "\n", - "CytoTable processes microscopy-based data that are created from multiple upstream steps.\n", + "CytoTable processes microscopy-based data that are created from multiple upstream steps (image analysis).\n", "CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design.\n", "However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay.\n", "In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis .\n", @@ -191,7 +191,7 @@ "![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png)\n", "\n", "\n", - "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._\n", + "__Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures \"hand-engineered\" computer vision features from every single cell. These data are captured for later use in multiple CSV tables or SQLite database._\n", "\n", "After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization.\n", "These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)).\n", diff --git a/docs/source/examples/cytotable_mise_en_place_general_overview.py b/docs/source/examples/cytotable_mise_en_place_general_overview.py index 5e3fcf96..67d5dfe9 100644 --- a/docs/source/examples/cytotable_mise_en_place_general_overview.py +++ b/docs/source/examples/cytotable_mise_en_place_general_overview.py @@ -3,16 +3,15 @@ # jupytext: # text_representation: # extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.17.2 +# format_name: light +# format_version: '1.5' +# jupytext_version: 1.17.3 # kernelspec: # display_name: Python 3 (ipykernel) # language: python # name: python3 # --- -# %% [markdown] # # CytoTable mise en place (general overview) # # This notebook will help you understand the basics of using CytoTable and the biological basis of each step. @@ -24,7 +23,7 @@ # > refer to organizing and arranging the ingredients ..." # > - [Wikipedia](https://en.wikipedia.org/wiki/Mise_en_place) -# %% +# + import pathlib from collections import Counter @@ -38,31 +37,29 @@ # setup variables for use throughout the notebook source_path = "../../../tests/data/cellprofiler/examplehuman" dest_path = "./example.parquet" +# - -# %% # remove the dest_path if it's present if pathlib.Path(dest_path).is_file(): pathlib.Path(dest_path).unlink() -# %% # show the files we will use as source data with CytoTable list(pathlib.Path(source_path).glob("*")) -# %% [markdown] # ## Phase 1: Cells are imaged by microscopes, with optional fluorescence staining # # ![Image showing cells being stained and captured as images using a microscope.](../_static/cell_to_image.png) # -# __Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes paint the cells to mark specific proteins, compartments, and/or organelles._ +# __Figure 1.__ _A microscope images cells to highlight cell processes. Often, fluorescence dyes stain the cells to mark specific proteins, compartments, and/or organelles._ # -# CytoTable processes microscopy-based data that are created from multiple upstream steps. +# CytoTable processes microscopy-based data that are created from multiple upstream steps (image analysis). # CytoTable does not require any specific sample preparation, and can work with any microscopy experimental design. # However, most often, CytoTable processes fluorescence microscopy images from the Cell Painting assay. # In the Cell Painting assay, scientists stain cells with a panel of six fluorescent dyes that mark key cellular compartments and organelles, including the nucleus, nucleoli/RNA, endoplasmic reticulum, mitochondria, actin cytoskeleton, Golgi apparatus, and plasma membrane ([Bray et al., 2016](https://doi.org/10.1038/nprot.2016.105); [Gustafsdottir et al., 2013](https://doi.org/10.1371/journal.pone.0080999)). Scientists then use microscopes to image these cells across fluorescence channels, and use image analysis software to produce high-content morphology profiles of individual cells for downstream analysis . # # We use the ExampleHuman dataset provided from CellProfiler Examples ([Moffat et al., 2006](https://doi.org/10.1016/j.cell.2006.01.040), [CellProfiler Examples Link](https://github.com/CellProfiler/examples/tree/master/ExampleHuman)) to describe this process below. -# %% +# + # display the images we will gather features from image_name_map = {"d0.tif": "DNA", "d1.tif": "PH3", "d2.tif": "Cells"} @@ -73,34 +70,31 @@ stain = val print(f"\nImage with stain: {stain}") display(Image.open(image)) +# - -# %% [markdown] # ## Phase 2: CellProfiler segments cells and measures numeric features # # ![Image showing CellProfiler being used to create image segmentations, measurements, and exporting numeric feature data to a file.](../_static/image_to_features.png) # # -# __Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures "hand-engineered" computer vision features from every single cell. These data are captured for later use in a CSV table or SQLite database._ +# __Figure 2.__ _CellProfiler takes in microscopy images and performs single-cell segmentation to distinguish cells from background. CellProfiler then measures "hand-engineered" computer vision features from every single cell. These data are captured for later use in multiple CSV tables or SQLite database._ # # After acquisition, scientists process the images using image-analysis software such as CellProfiler. CellProfiler segments single cells and their biological compartments into distinct regions of interest. From these segmented cells, CellProfiler extracts hundreds to thousands of quantitative features per cell, capturing properties such as size, shape, intensity, texture, and spatial organization. # These high-dimensional feature datasets provide a numerical representation of cell morphology that serves as the foundation for downstream profiling and analysis ([Carpenter et al., 2006](https://doi.org/10.1186/gb-2006-7-10-r100)). # # We use CellProfiler (with a prespecified configuration `.cppipe` file) to produce the following images and data tables from the ExampleHuman dataset. -# %% # show the segmentations through an overlay with outlines for image in pathlib.Path(source_path).glob("*Overlay.png"): print(f"Image outlines from segmentation (composite)") print("Color key: (dark blue: nuclei, light blue: cells, yellow: PH3)") display(Image.open(image)) -# %% # show the tables generated from the resulting CSV files for profiles in pathlib.Path(source_path).glob("*.csv"): print(f"\nProfiles from CellProfiler: {profiles}") display(pd.read_csv(profiles).head()) -# %% [markdown] # ## Phase 3: CytoTable harmonizes the feature datasets that CellProfiler generates # # ![Image showing feature data being read by CytoTable and exported to a CytoTable file.](../_static/features_to_cytotable.png) @@ -109,7 +103,7 @@ # # We use CytoTable below to process the numeric feature data we generated above. -# %% +# + # %%time # run cytotable convert @@ -122,25 +116,21 @@ preset="cellprofiler_csv", ) print(pathlib.Path(result).name) +# - -# %% # show the table head using pandas pq.read_table(source=result).to_pandas().head() -# %% # show metadata for the result file pq.read_metadata(result) -# %% # show schema metadata which includes CytoTable information # note: this information will travel with the file. pq.read_schema(result).metadata -# %% # show schema column name summaries print("Column name prefix counts:") dict(Counter(w.split("_", 1)[0] for w in pq.read_schema(result).names)) -# %% # show full schema details pq.read_schema(result) diff --git a/docs/source/index.md b/docs/source/index.md index f7bb0da7..38e11936 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -13,7 +13,7 @@ caption: 'Contents:' maxdepth: 3 --- overview -tutorial +tutorials examples presentations contributing diff --git a/docs/source/overview.md b/docs/source/overview.md index 05eda380..657cff00 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -1,7 +1,7 @@ # Overview This page provides a brief overview of CytoTable topics. -For a brief introduction on how to use CytoTable, please see the [tutorial](tutorial.md) page. +For a brief introduction on how to use CytoTable, please see the [tutorials](tutorials.md) page. ## Presets and Manual Overrides diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md deleted file mode 100644 index 6a30eb3f..00000000 --- a/docs/source/tutorial.md +++ /dev/null @@ -1,41 +0,0 @@ -# Tutorial - -This page covers brief tutorials and notes on how to use CytoTable. - -## CellProfiler CSV Output to Parquet - -[CellProfiler](https://cellprofiler.org/) pipelines or projects may produce various CSV-based compartment output (for example, "Cells.csv", "Cytoplasm.csv", etc.). -CytoTable converts this data to Parquet from local or object-storage based locations. - -Files with similar names nested within sub-folders will be concatenated by default (appended to the end of each data file) together and used to create a single Parquet file per compartment. -For example: if we have `folder/subfolder_a/cells.csv` and `folder/subfolder_b/cells.csv`, using `convert(source_path="folder", ...)` will result in `folder.cells.parquet` (unless `concat=False`). - -Note: The `dest_path` parameter (`convert(dest_path="")`) will be used for intermediary data work and must be a new file or directory path. -This path will result directory output on `join=False` and a single file output on `join=True`. - -For example, see below: - -```python -from cytotable import convert - -# using a local path with cellprofiler csv presets -convert( - source_path="./tests/data/cellprofiler/ExampleHuman", - source_datatype="csv", - dest_path="ExampleHuman.parquet", - dest_datatype="parquet", - preset="cellprofiler_csv", -) - -# using an s3-compatible path with no signature for client -# and cellprofiler csv presets -convert( - source_path="s3://s3path", - source_datatype="csv", - dest_path="s3_local_result", - dest_datatype="parquet", - concat=True, - preset="cellprofiler_csv", - no_sign_request=True, -) -``` diff --git a/docs/source/tutorials.md b/docs/source/tutorials.md new file mode 100644 index 00000000..59968cfa --- /dev/null +++ b/docs/source/tutorials.md @@ -0,0 +1,60 @@ +# Tutorials + +Start here if you are new to CytoTable. +We’ve split material by audience: + +- **Image analysts (no engineering background required):** follow the narrative tutorials below. They include downloadable data, exact commands, and what to expect. Please also feel free to reference the [example notebooks](examples.md). +- **Engineers / power users:** see any documentation, including the [example notebooks](examples.md), for tuning and integration details, or use the quick recipe below. + +```{admonition} Who this helps (and doesn’t) +- Helps: image analysts who want to get CellProfiler/DeepProfiler/InCarta outputs into Parquet with minimal coding; people comfortable running a few commands. +- Not ideal: raw image ingestion or pipeline authoring (use CellProfiler/DeepProfiler upstream); workflows needing a GUI-only experience. +- Effort: install, copy/paste a few commands, validate outputs in minutes. +``` + +```{toctree} +--- +maxdepth: 2 +caption: Tutorials (start here) +--- +tutorials/cellprofiler_to_parquet +tutorials/npz_embeddings_to_parquet +tutorials/multi_plate_merge_tablenumber +``` + +## Quick recipe: CellProfiler CSV to Parquet + +This short recipe is for people comfortable with Python/CLI and parallels our older tutorial. +If you prefer a guided, narrative walkthrough with downloadable inputs and expected outputs, use the tutorial above. + +[CellProfiler](https://cellprofiler.org/) exports compartment CSVs (for example, "Cells.csv", "Cytoplasm.csv"). +CytoTable converts this data to Parquet from local or object-storage locations. + +Files with similar names nested within sub-folders are concatenated by default (for example, `folder/sub_a/cells.csv` and `folder/sub_b/cells.csv` become a single `folder.cells.parquet` unless `concat=False`). + +The `dest_path` parameter is used for intermediary work and must be a new file or directory path. +It will be a directory when `join=False` and a single file when `join=True`. + +```python +from cytotable import convert + +# Local CSVs with CellProfiler preset +convert( + source_path="./tests/data/cellprofiler/ExampleHuman", + source_datatype="csv", + dest_path="ExampleHuman.parquet", + dest_datatype="parquet", + preset="cellprofiler_csv", +) + +# S3 CSVs (unsigned) with CellProfiler preset +convert( + source_path="s3://s3path", + source_datatype="csv", + dest_path="s3_local_result", + dest_datatype="parquet", + concat=True, + preset="cellprofiler_csv", + no_sign_request=True, +) +``` diff --git a/docs/source/tutorials/cellprofiler_to_parquet.md b/docs/source/tutorials/cellprofiler_to_parquet.md new file mode 100644 index 00000000..c924c3de --- /dev/null +++ b/docs/source/tutorials/cellprofiler_to_parquet.md @@ -0,0 +1,121 @@ +# Tutorial: CellProfiler SQLite or CSV to Parquet + +A start-to-finish walkthrough for image analysts who want a working Parquet export from CellProfiler outputs (SQLite or CSV), including public S3 and local data. + +## What you will accomplish + +- Convert CellProfiler outputs to Parquet with a preset that matches common table/column layouts. +- Handle both SQLite (typical Cell Painting Gallery exports) and CSV folder outputs. +- Keep a persistent local cache so downloads are reused and avoid “file vanished” errors on temp disks. +- Verify the outputs quickly (file names and row counts) without needing to understand the internals. + +```{admonition} If your data looks like this, change... +- Local SQLite instead of S3: set `source_path` to the local `.sqlite` file; remove `no_sign_request`; keep `local_cache_dir`. +- CellProfiler CSV folders: point `source_path` to the folder that contains `Cells.csv`, `Cytoplasm.csv`, etc.; set `source_datatype="csv"` and `preset="cellprofiler_csv"`. +- Only certain compartments: add `targets=["cells", "nuclei"]` (case-insensitive). +- Memory constrained: lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` has space. +``` + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **SQLite example (public S3):** `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite` + No credentials are required (`no_sign_request=True`). +- **CSV example (local folder):** `./tests/data/cellprofiler/ExampleHuman` which contains `Cells.csv`, `Cytoplasm.csv`, `Nuclei.csv`, etc. +- **Outputs:** Parquet files for each compartment (Image, Cells, Cytoplasm, Nuclei) in `./outputs/...`. + +## Before you start + +- Install Cytotable: + `pip install cytotable` +- Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs. +- If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`. + +## Step 1: choose your input type + +Pick one of the two setups below. + +**SQLite from public S3 (Cell Painting Gallery)** + +```bash +export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite" +export SOURCE_DATATYPE="sqlite" +export PRESET="cellprofiler_sqlite_cpg0016_jump" +export DEST_PATH="./outputs/br00126114.parquet" +export CACHE_DIR="./sqlite_s3_cache" +mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR" +``` + +**CellProfiler CSV folder (local or mounted storage)** + +```bash +export SOURCE_PATH="./tests/data/cellprofiler/ExampleHuman" +export SOURCE_DATATYPE="csv" +export PRESET="cellprofiler_csv" +export DEST_PATH="./outputs/examplehuman.parquet" +export CACHE_DIR="./csv_cache" +mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR" +``` + +## Step 2: run the conversion (minimal Python) + +```python +import os +import cytotable + +# If you used the bash exports above: +SOURCE_PATH = os.environ["SOURCE_PATH"] +SOURCE_DATATYPE = os.environ["SOURCE_DATATYPE"] +DEST_PATH = os.environ["DEST_PATH"] +PRESET = os.environ["PRESET"] +CACHE_DIR = os.environ["CACHE_DIR"] + +# (Alternatively, set them directly as strings in Python.) + +result = cytotable.convert( + source_path=SOURCE_PATH, + source_datatype=SOURCE_DATATYPE, + dest_path=DEST_PATH, + dest_datatype="parquet", + preset=PRESET, + local_cache_dir=CACHE_DIR, + # For public S3 (SQLite or CSV) add: + no_sign_request=True, + # Reasonable chunking for large tables; adjust up/down if you hit memory limits + chunk_size=30000, +) + +print(result) +``` + +Why these flags matter (in plain language): + +- `local_cache_dir`: keeps downloaded data somewhere predictable. +- `preset`: selects the right table names and page keys for this dataset (SQLite or CSV). +- `chunk_size`: processes data in pieces so you don’t need excessive RAM. +- `no_sign_request`: needed because the sample bucket is public and unsigned. + +## Step 3: check that the outputs look right + +You should see Parquet files in the destination directory. +If you set `join=True` (handy for the SQLite example), you get a single `. parquet` file containing all compartments. +If you set `join=False` (handy for CSV folders), you get separate Parquet files for each compartment. + +```bash +ls "$DEST_PATH" +# SQLite example: br00126114.parquet +# CSV example: examplehuman.parquet +``` + +## What success looks like + +- A stable local cache of the SQLite file or CSV downloads remains in `CACHE_DIR` (useful for repeated runs). +- Parquet outputs exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow. diff --git a/docs/source/tutorials/multi_plate_merge_tablenumber.md b/docs/source/tutorials/multi_plate_merge_tablenumber.md new file mode 100644 index 00000000..f5a460a6 --- /dev/null +++ b/docs/source/tutorials/multi_plate_merge_tablenumber.md @@ -0,0 +1,76 @@ +# Tutorial: Merging multiple plates with Tablenumber + +Goal: combine multiple CellProfiler SQLite exports (plates) into a single Parquet output while preserving plate identity via `TableNumber`. + +## What you will accomplish + +- Point Cytotable at a folder of multiple plate exports. +- Add `TableNumber` so downstream analyses can distinguish rows from different plates. +- Verify merged outputs. + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **Input:** A folder of CellProfiler SQLite files (example structure): + `data/plates/PlateA.sqlite` + `data/plates/PlateB.sqlite` +- **Output:** Parquet file under `./outputs/multi_plate.parquet`, with a `Metadata_TableNumber` column indicating plate. + +## Step 1: define your paths + +```bash +export SOURCE_PATH="./data/plates" +export DEST_PATH="./outputs/multi_plate.parquet" +export CACHE_DIR="./sqlite_cache" +mkdir -p "$DEST_PATH" "$CACHE_DIR" +``` + +## Step 2: run the conversion with tablenumber + +```python +import os +import cytotable + +source_path = os.environ["SOURCE_PATH"] +dest_path = os.environ["DEST_PATH"] +cache_dir = os.environ["CACHE_DIR"] + +result = cytotable.convert( + source_path=source_path, + source_datatype="sqlite", + dest_path=dest_path, + dest_datatype="parquet", + preset="cellprofiler_sqlite", + local_cache_dir=cache_dir, + add_tablenumber=True, # key for multi-plate merges + chunk_size=30000, +) + +print(result) +``` + +Why this matters: + +- `add_tablenumber=True` adds `Metadata_TableNumber` so you can filter/group by plate later. +- Pointing `source_path` to a folder makes Cytotable glob multiple plates. +- `local_cache_dir` keeps each plate cached locally for reliable DuckDB access. + +## Step 3: validate plate separation + +You should see one Parquet file (`multi_plate.parquet`) in `DEST_PATH`. +Opening a file with Pandas or PyArrow should show `Metadata_TableNumber` present and non-zero rows. +If you processed multiple plates, expect multiple distinct values in that column. + +## Scenario callouts (“if your data looks like this...”) + +- **Local SQLite files:** set `source_path` to the folder of local `.sqlite` files; remove `no_sign_request`. +- **Only certain compartments:** pass `targets=["cells", "nuclei"]` to limit tables. +- **Memory constrained:** lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` is on a disk with enough space for all plates + parquet output. diff --git a/docs/source/tutorials/npz_embeddings_to_parquet.md b/docs/source/tutorials/npz_embeddings_to_parquet.md new file mode 100644 index 00000000..28400109 --- /dev/null +++ b/docs/source/tutorials/npz_embeddings_to_parquet.md @@ -0,0 +1,77 @@ +# Tutorial: NPZ embeddings + metadata to Parquet + +A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. +This uses a small example bundled in the repo. + +## What you will accomplish + +- Read NPZ feature files and matching metadata from disk. +- Combine them into Parquet with a preset that aligns common keys. +- Validate the output shape and schema. + +```{admonition} If your data looks like this, change... +- NPZ in a different folder: point `source_path` there; keep `preset="deepprofiler"`. +- Memory constrained: add `chunk_size=10000` to the convert call. +- `.npy` files or plain CSV feature tables: this tutorial/preset does not cover them; use the CellProfiler CSV/SQLite flows instead. +``` + +## Setup (copy-paste) + +```bash +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +pip install cytotable +``` + +## Inputs and outputs + +- **Input:** Example NPZ + metadata in this repo: `tests/data/deepprofiler/pycytominer_example` +- **Output:** A Parquet file under `./outputs/deepprofiler_example.parquet` + +## Step 1: define your paths + +```bash +export SOURCE_PATH="tests/data/deepprofiler/pycytominer_example" +export DEST_PATH="./outputs/deepprofiler_example.parquet" +mkdir -p "$DEST_PATH" +``` + +## Step 2: run the conversion + +```python +import os +import cytotable + +source_path = os.environ["SOURCE_PATH"] +dest_path = os.environ["DEST_PATH"] + +result = cytotable.convert( + source_path=source_path, + source_datatype="npz", + dest_path=dest_path, + dest_datatype="parquet", + preset="deepprofiler", + concat=True, + join=False, +) + +print(result) +``` + +Notes (why these flags matter): + +- `preset="deepprofiler"` aligns NPZ feature arrays with metadata columns. +- `concat=True` merges multiple NPZ shards. +- `join=False` writes per-table Parquet files (the preset produces `all_files.npz` as the logical table). + +## Step 3: validate the output + +You should see `deepprofiler_example.parquet` in `DEST_PATH`. +Opening it with Pandas or PyArrow should show non-zero rows and both feature (`efficientnet_*`) and metadata columns. + +## What success looks like + +- A Parquet file `deepprofiler_example.parquet` exists in `DEST_PATH`. +- DuckDB/Pandas can read the file; row count is non-zero. +- Feature columns (for example, `efficientnet_*`) and metadata columns (plate/well/site) both appear.