DistributedScience · ErinWeisbart · Jan 24, 2023 · Oct 27, 2022 · Nov 15, 2022 · Dec 16, 2022
diff --git a/documentation/DCP-documentation/AWS_hygiene_scripts.md b/documentation/DCP-documentation/AWS_hygiene_scripts.md
@@ -1,6 +1,6 @@
 # AWS Hygiene Scripts
 
-See also (AUSPICES)[https://github.com/broadinstitute/AuSPICES] for setting up various hygiene scripts to automatically run in your AWS account.
+See also [AUSPICES](https://github.com/broadinstitute/AuSPICES) for setting up various hygiene scripts to automatically run in your AWS account.
 
 ## Clean out old alarms
 

diff --git a/documentation/DCP-documentation/_toc.yml b/documentation/DCP-documentation/_toc.yml
@@ -11,6 +11,8 @@ parts:
     - file: config_examples
     - file: SQS_QUEUE_information
   - file: step_2_submit_jobs
+    sections:
+    - file: passing_files_to_DCP
   - file: step_3_start_cluster
   - file: step_4_monitor
 - caption:

diff --git a/documentation/DCP-documentation/images/LoadDataCSV.png b/documentation/DCP-documentation/images/LoadDataCSV.png
diff --git a/documentation/DCP-documentation/passing_files_to_DCP.md b/documentation/DCP-documentation/passing_files_to_DCP.md
@@ -0,0 +1,72 @@
+# Passing Files to DCP
+
+Distributed-CellProfiler can be told what files to use through LoadData.csv, Batch Files, or file lists.
+
+## Load Data
+
+![LoadData.csv](images/LoadDataCSV.png)
+
+LoadData.csv are CSVs that tell CellProfiler how the images should be parsed.
+At a minimum, this CSV should contain PathName_{NameOfChannel} and FileName_{NameOfChannel} columns for each of your channels, as well as Metadata_{PieceOfMetadata} for each kind of metadata being used to group your image sets.
+It can contain any other metadata you would like to track.
+Some users have reported issues with using relative paths in the PathName columns; using absolute paths beginning with `/home/ubuntu/bucket/{relativepath}` may increase your odds of success.
+
+### Creating LoadData.csv
+
+You can create this CSV yourself via your favorite scripting language.
+We maintain a script for creating LoadData.csv from Phenix metadata XML files called [pe2loaddata](https://github.com/broadinstitute/pe2loaddata).
+
+You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. 
+More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput).
+After loading in your images, use the Export->Image Set Listing command.
+You will then need to replace the local paths with the paths where the files can be found in the cloud.
+If your files are in the same structure, this can be done with a simple find and replace in any text editing software.
+(e.g. Find '/Users/eweisbar/Desktop' and replace with '/home/ubuntu/bucket')
+
+### Using LoadData.csv
+
+To use a LoadData.csv with submitJobs, put the path to the LoadData.csv in **data_file:**.
+
+To use a LoadData.csv with run_batch_general.py, enter the name of the LoadData.csv under **#project specific stuff** in `{STEP}name`.
+At the bottom of the file, make sure there are no arguments or `batch=False` in the command for the step you are running.
+(e.g. `MakeAnalysisJobs()` or `MakeAnalysisJobs(batch=False)`)
+Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `datafilepath`.
+
+## Batch Files
+
+Batch files are an easy way to transition from running locally to distributed.
+A batch file is an `.h5` file created by CellProfiler which captures all the data needed to run your workflow - pipeline and file information are packaged together.
+To use a batch file, your data needs to have the same structure in the cloud as on your local machine.
+
+### Creating batch files
+
+To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. 
+More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput).
+Put the `CreateBatchFiles` module at the end of your pipeline and ensure that it is selected.
+Add a path mapping and edit the `Local root path` and `Cluster root path`.
+Run the CellProfiler pipeline by pressing the `Analyze Images` button; note that it won't actually run your pipeline but will instead create a batch file.
+More information on the `CreateBatchFiles` module can be found [here](https://cellprofiler-manual.s3.amazonaws.com/CellProfiler-4.2.4/modules/fileprocessing.html).
+
+### Using batch files
+
+To use a batch file with submitJobs, put the path to the `.h5` file in **data_file:** and **pipeline:**.
+
+To use a batch file with run_batch_general.py, enter the name of the batch file under **#project specific stuff** in `batchpipename{STEP}`.
+At the bottom of the file, set `batch=True` in the command for the step you are running.
+(e.g. `MakeAnalysisJobs(batch=True)`)
+Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `batchpath`.
+
+## File lists
+
+You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format.
+Note that file lists themselves do not associate metadata with file paths (in contrast to LoadData.csv files where you can enter any metadata columns you desire.)
+Therefore, you need to extract metadata for Distributed-CellProfiler to use for grouping by extracting metadata from file and folder names in the Metadata module in your CellProfiler pipeline. 
+You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`.
+
+### Creating File Lists
+
+Use any text editing software to create a `.txt` file where each line of the file is a path to a single image that you want to process.
+
+### Using File Lists
+
+To use a file list with submitJobs, put the path to the `.txt` file in **data_file:**.
diff --git a/documentation/DCP-documentation/step_2_submit_jobs.md b/documentation/DCP-documentation/step_2_submit_jobs.md
@@ -25,12 +25,9 @@ If using LoadData, make sure your "Base image location" is set to "None".
 ## Configuring your job file
 
 * **pipeline:** The path to your pipeline file.
-* **data_file:** The path to your CSV.  
-At a minimum, this CSV should contain PathName_{NameOfChannel} and FileName_{NameOfChannel} columns for each of your channels, as well as Metadata_{PieceOfMetadata} for each kind of metadata being used to group your image sets.  
-You can create this CSV yourself via your favorite scripting language or by using the Images, Metadata, and NamesAndTypes modules in CellProfiler to generate image sets then using the Export->Image Set Listing command.  
-Some users have reported issues with using relative paths in the PathName columns; using absolute paths beginning with `/home/ubuntu/bucket/{relativepath}` may increase your odds of success.
+* **data_file:** The path to your LoadData.csv, batch file, or file list file.
 * **input:** The path to your default input directory.
-This is not necessary for every pipeline but can be helpful when non-image files are needed in the pipeline (such as a text file containing quality control rules for the FlagImage module).
+This is not necessary for every pipeline but can be helpful when non-image files are needed in the pipeline (such as a text file containing quality control rules for the FlagImage module or a metadata file for use with file lists).
 DO NOT set this to a large directory, or CellProfiler will try to scan the entire thing before running your pipeline.
 * **output:** The top output directory you'd like your files placed in.
 * **output_structure:** By default, Distributed-CellProfiler will put your output in subfolders created by hyphenating all your Metadata entries (see below) in order (e.g. if the individual group being processed was `{"Metadata": "Metadata_Plate=Plate1,Metadata_Well=A01"}`, the output would be placed in `output_top_directory/Plate1-A01`.)

diff --git a/documentation/DCP-documentation/troubleshooting_runs.md b/documentation/DCP-documentation/troubleshooting_runs.md
@@ -15,7 +15,7 @@
 |  Jobs completing(total messages decreasing)  much more quickly than expected. |"==OUT, SUCCESS"| No outcome/saved files on s3 |   |  There is a mismatch in your metadata somewhere. |Check the Metadata_ columns in your LoadData.csv for typos or a mismatch with your jobs file. The most common sources of mismatch are case and zero padding (e.g. A01 vs a01 vs A1). Check for these mismatches and edit the job file accordingly. If you use pe2loaddata to create your csvs and the plate was imaged multiple times, pay particular attention to the Metadata_Plate column as numbering reflecting this will be automatically passed into the Load_data.csv |
 |   | Your specified output structure does not match the Metadata passed.  |Expected output is seen.|   | This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the Cloudwatch log that matches the input structure but actual job progress will print in the Cloudwatch log that matches the output structure.  |   |
 |   | Your perinstance logs have an IOError indicating that an .h5 batchfile does not exist  | No outcome/saved files on s3  |   |  No batchfiles exist for your project. | Either you need to create the batch files and make sure that they are in the appropriate directory OR re-start and use MakeAnalysisJobs() instead of MakeAnalysisJobs(mode=‘batch’) in run_batch_general.py  |
-|   |   |   | Machines made in EC2 and dockers are made in ECS but the dockers are not placed on the machines  |  There is a mismatch in your DCP config file. |  Confirm that the MEMORY matches the MACHINE_TYPE  set in your config. |
+|   |   |   | Machines made in EC2 and dockers are made in ECS but the dockers are not placed on the machines  |  There is a mismatch in your DCP config file. |  Confirm that the MEMORY matches the MACHINE_TYPE  set in your config. Confirm that there are no typos in your DOCKERHUB_TAG set in your config. |
 |   | Your perinstance logs have an IOError indicating that CellProfiler cannot open your pipeline  |   |   | You have a corrupted pipeline.  | Check if you can open your pipeline locally. It may have been corrupted on upload or it may have an error within the pipeline itself.  |
 |   |"== ERR move failed:An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate." Error may not show initially and may become more prevalent with time. |   |   | Too many jobs are finishing too quickly creating a backlog of jobs waiting to upload to S3. | You can 1) check out fewer machines at a time, 2) check out smaller machines and run fewer copies of DCP at the same time, or 3) group jobs in larger groupings (e.g. by Plate instead of Well or Site). If this happens because you have many jobs finishing at the same time (but not finishing very rapidly such that it's not creating an increasing backlog) you can increase SECONDS_TO_START in config.py so there is more separation between jobs finishing.|
 |   | "/home/ubuntu/bucket: Transport endpoint is not connected" | Cannot be accessed by fleet. |   | S3FS has stochastically dropped/failed to connect. | Perform your run without using S3FS by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup,  you may need to increase the size of your EBS volume to account for the files being downloaded. |

diff --git a/example_project/README.md b/example_project/README.md
@@ -0,0 +1,54 @@
+Included in this folder is all of the resources for running a complete mini-example of Distributed-Cellprofiler.
+It includes 3 sample image sets and a CellProfiler pipeline that identifies cells within the images and makes measuremements.
+It also includes the Distributed-CellProfiler files pre-configured to create a queue of all 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set.
+
+## Running example project
+
+### Step 0
+
+Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html).
+This includes creating the fleet file that you will use in Step 3.
+
+Upload the 'sample_project' folder to the top level of your bucket. 
+While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name:
+
+```bash
+# Copy example files to S3
+BUCKET=yourbucket
+aws s3 sync example_project/project_folder s3://${BUCKET}/project_folder
+
+# Replace the default config with the example config
+cp example_project/config.py config.py
+```
+
+### Step 1
+In config.py you will need to update the following fields specific to your AWS configuration:
+```
+# AWS GENERAL SETTINGS:
+AWS_REGION = 'us-east-1'
+AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
+SSH_KEY_NAME = 'your-key-file.pem'      # Expected to be in ~/.ssh
+AWS_BUCKET = 'your-bucket-name'
+SOURCE_BUCKET = 'your-bucket-name'      # Only differs from AWS_BUCKET with advanced configuration
+DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
+```
+Then run `python3 run.py setup`
+
+### Step 2
+This command points to the job file created for this demonstartion and should be run as-is.
+`python3 run.py submitJob example_project/files/exampleJob.json`
+
+### Step 3
+This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name.
+`python3 run.py startCluster files/exampleFleet.json`
+
+### Step 4
+This command points to the monitor file that is automatically created with your run and should be run as-is.
+`python3 run.py monitor files/FlyExampleSpotFleetRequestId.json`
+
+## Results
+
+While the run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch).
+Note that the metrics update at intervals that may not be helpful with this fast, minimal example.
+
+After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders.
diff --git a/example_project/config.py b/example_project/config.py
@@ -0,0 +1,54 @@
+# Constants (User configurable)
+
+APP_NAME = 'FlyExample'                # Used to generate derivative names unique to the application.
+
+# DOCKER REGISTRY INFORMATION:
+DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.2.4'
+
+# AWS GENERAL SETTINGS:
+AWS_REGION = 'us-east-1'
+AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
+SSH_KEY_NAME = 'your-key-file.pem'      # Expected to be in ~/.ssh
+AWS_BUCKET = 'your-bucket-name'         # Bucket to use for logging (likely all three buckets the same for this example)
+SOURCE_BUCKET = 'your-bucket-name'      # Bucket to download files from (likely all three buckets the same for this example)
+DESTINATION_BUCKET = 'your-bucket-name' # Bucket to upload files to (likely all three buckets the same for this example)
+
+# EC2 AND ECS INFORMATION:
+ECS_CLUSTER = 'default'
+CLUSTER_MACHINES = 3
+TASKS_PER_MACHINE = 1
+MACHINE_TYPE = ['c4.xlarge']
+MACHINE_PRICE = 0.10
+EBS_VOL_SIZE = 22                       # In GB.  Minimum allowed is 22.
+DOWNLOAD_FILES = 'False'
+
+# DOCKER INSTANCE RUNNING ENVIRONMENT:
+DOCKER_CORES = 1                        # Number of CellProfiler processes to run inside a docker container
+CPU_SHARES = DOCKER_CORES * 1024        # ECS computing units assigned to each docker container (1024 units = 1 core)
+MEMORY = 7500                           # Memory assigned to the docker container in MB
+SECONDS_TO_START = 3*60                 # Wait before the next CP process is initiated to avoid memory collisions
+
+# SQS QUEUE INFORMATION:
+SQS_QUEUE_NAME = APP_NAME + 'Queue'
+SQS_MESSAGE_VISIBILITY = 10*60           # Timeout (secs) for messages in flight (average time to be processed)
+SQS_DEAD_LETTER_QUEUE = 'ExampleProject_DeadMessages'
+
+# LOG GROUP INFORMATION:
+LOG_GROUP_NAME = APP_NAME 
+
+# CLOUDWATCH DASHBOARD CREATION
+CREATE_DASHBOARD = 'True'           # Create a dashboard in Cloudwatch for run
+CLEAN_DASHBOARD = 'True'            # Automatically remove dashboard at end of run with Monitor
+
+# REDUNDANCY CHECKS
+CHECK_IF_DONE_BOOL = 'False'  #True or False- should it check if there are a certain number of non-empty files and delete the job if yes?
+EXPECTED_NUMBER_FILES = 7    #What is the number of files that trigger skipping a job?
+MIN_FILE_SIZE_BYTES = 1      #What is the minimal number of bytes an object should be to "count"?
+NECESSARY_STRING = ''        #Is there any string that should be in the file name to "count"?
+
+# PLUGINS
+USE_PLUGINS = 'False'
+UPDATE_PLUGINS = 'False'
+PLUGINS_COMMIT = '' # What commit or version tag do you want to check out?
+INSTALL_REQUIREMENTS = 'False'
+REQUIREMENTS_FILE = '' # Path within the plugins repo to a requirements file
diff --git a/example_project/files/exampleJob.json b/example_project/files/exampleJob.json
@@ -0,0 +1,15 @@
+{
+  "_comment1": "Paths in this file are relative to the root of your S3 bucket",
+  "pipeline": "project_folder/workspace/ExampleFly.cppipe", 
+  "data_file": "project_folder/workspace/load_data.csv", 
+  "input": "project_folder/workspace/",
+  "output": "project_folder/output",
+  "output_structure": "Metadata_Position",
+  "_comment2": "The following groups are tasks, and each will be run in parallel",
+  "groups": [
+    {"Metadata": "Metadata_Position=2"},
+    {"Metadata": "Metadata_Position=76"},
+    {"Metadata": "Metadata_Position=218"}
+  ]
+}
+
diff --git a/example_project/project_folder/images/01_POS002_D.TIF b/example_project/project_folder/images/01_POS002_D.TIF
diff --git a/example_project/project_folder/images/01_POS002_F.TIF b/example_project/project_folder/images/01_POS002_F.TIF
diff --git a/example_project/project_folder/images/01_POS002_R.TIF b/example_project/project_folder/images/01_POS002_R.TIF
diff --git a/example_project/project_folder/images/01_POS076_D.TIF b/example_project/project_folder/images/01_POS076_D.TIF
diff --git a/example_project/project_folder/images/01_POS076_F.TIF b/example_project/project_folder/images/01_POS076_F.TIF
diff --git a/example_project/project_folder/images/01_POS076_R.TIF b/example_project/project_folder/images/01_POS076_R.TIF
diff --git a/example_project/project_folder/images/01_POS218_D.TIF b/example_project/project_folder/images/01_POS218_D.TIF
diff --git a/example_project/project_folder/images/01_POS218_F.TIF b/example_project/project_folder/images/01_POS218_F.TIF
diff --git a/example_project/project_folder/images/01_POS218_R.TIF b/example_project/project_folder/images/01_POS218_R.TIF