diff --git a/LICENSE b/LICENSE
index 6e000a0..1d6c29f 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
-Distributed-CellProfiler is distributed under the following BSD-style license:
+Distributed-Something is distributed under the following BSD-style license:
-Copyright © 2020 Broad Institute, Inc. All rights reserved.
+Copyright © 2022 Broad Institute, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
diff --git a/README.md b/README.md
index 3d7df42..246dcc1 100644
--- a/README.md
+++ b/README.md
@@ -1,68 +1,62 @@
# Distributed-Something
-Run encapsulated docker containers that do... something in the Amazon Web Services infrastructure.
-It could be [CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler) or [Fiji](https://github.com/CellProfiler/Distributed-Fiji) or really whatever you want.
+Run encapsulated docker containers that do... something in the Amazon Web Services (AWS) infrastructure.
+We are interested in scientific image analysis so we have used it for [CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler), [Fiji](https://github.com/CellProfiler/Distributed-Fiji), and [BioFormats2Raw](https://github.com/CellProfiler/Distributed-OmeZarrMaker).
+You can use it for whatever you want!
[Here's how you adapt this to whatever you need to Distribute](https://github.com/CellProfiler/Distributed-Something/wiki)
-This code is an example of how to use AWS distributed infrastructure for running anything dockerized.
-The configuration of the AWS resources is done using boto3 and the awscli. The worker is written in Python
-and is encapsulated in a docker container. There are four AWS components that are minimally
-needed to run distributed jobs:
+This code is an example of how to use AWS distributed infrastructure for running anything Dockerized.
+The configuration of the AWS resources is done using boto3 and the AWS CLI.
+The worker is written in Python and is encapsulated in a Docker container. There are four AWS components that are minimally needed to run distributed jobs:
1. An SQS queue
2. An ECS cluster
3. An S3 bucket
4. A spot fleet of EC2 instances
-All of them can be managed through the AWS Management Console. However, this code helps to get
-started quickly and run a job autonomously if all the configuration is correct. The code includes
-prepares the infrastructure to run a distributed job. When the job is completed, the code is also
-able to stop resources and clean up components. It also adds logging and alarms via CloudWatch,
-helping the user troubleshoot runs and destroy stuck machines.
+All of them can be managed individually through the AWS Management Console.
+However, this code helps to get started quickly and run a job autonomously if all the configuration is correct.
+The code prepares the infrastructure to run a distributed job.
+When the job is completed, the code is also able to stop resources and clean up components.
+It also adds logging and alarms via CloudWatch, helping the user troubleshoot runs and destroy stuck machines.
## Running the code
### Step 1
-Edit the config.py file with all the relevant information for your job. Then, start creating
-the basic AWS resources by running the following script:
+Edit the `config.py` file with all the relevant information for your job. Then, start creating the basic AWS resources by running the following script:
$ python run.py setup
-This script intializes the resources in AWS. Notice that the docker registry is built separately,
-and you can modify the worker code to build your own. Anytime you modify the worker code, you need
-to update the docker registry using the Makefile script inside the worker directory.
+This script initializes the resources in AWS.
+Notice that the docker registry is built separately and you can modify the worker code to build your own.
+Any time you modify the worker code, you need to update the docker registry using the Makefile script inside the worker directory.
### Step 2
-After the first script runs successfully, the job can now be submitted to with the
-following command:
+After the first script runs successfully, the job can now be submitted to with the following command:
$ python run.py submitJob files/exampleJob.json
-Running the script uploads the tasks that are configured in the json file. You have to
-customize the exampleJob.json file with information that make sense for your project.
+Running the script uploads the tasks that are configured in the json file. You have to customize the `exampleJob.json` file with information that makes sense for your project.
You'll want to figure out which information is generic and which is the information that makes each job unique.
### Step 3
-After submitting the job to the queue, we can add computing power to process all tasks in AWS. This
-code starts a fleet of spot EC2 instances which will run the worker code. The worker code is encapsulated
-in docker containers, and the code uses ECS services to inject them in EC2. All this is automated
-with the following command:
+After submitting the job to the queue, we can add computing power to process all tasks in AWS.
+This code starts a fleet of spot EC2 instances which will run the worker code.
+The worker code is encapsulated in Docker containers, and the code uses ECS services to inject them in EC2.
+All this is automated with the following command:
$ python run.py startCluster files/exampleFleet.json
-After the cluster is ready, the code informs you that everything is setup, and saves the spot fleet identifier
-in a file for further reference.
+After the cluster is ready, the code informs you that everything is setup, and saves the spot fleet identifier in a file for further reference.
### Step 4
When the cluster is up and running, you can monitor progress using the following command:
$ python run.py monitor files/APP_NAMESpotFleetRequestId.json
-The file APP_NAMESpotFleetRequestId.json is created after the cluster is setup in step 3. It is
-
-important to keep this monitor running if you want to automatically shutdown computing resources
-when there are no more tasks in the queue (recommended).
+The file APP_NAMESpotFleetRequestId.json is created after the cluster is setup in step 3.
+It is important to keep this monitor running if you want to automatically shutdown computing resources when there are no more tasks in the queue (recommended).
See the wiki for more information about each step of the process.
-
+
diff --git a/config.py b/config.py
index 9d3f99b..7d49abc 100644
--- a/config.py
+++ b/config.py
@@ -1,6 +1,6 @@
# Constants (User configurable)
-APP_NAME = 'DistributedSomething' # Used to generate derivative names unique to the application.
+APP_NAME = 'DistributedSomething' # Used to generate derivative names unique to the application.
# DOCKER REGISTRY INFORMATION:
DOCKERHUB_TAG = 'user/distributed-something:sometag'
@@ -20,10 +20,10 @@
EBS_VOL_SIZE = 30 # In GB. Minimum allowed is 22.
# DOCKER INSTANCE RUNNING ENVIRONMENT:
-DOCKER_CORES = 4 # Number of CellProfiler processes to run inside a docker container
+DOCKER_CORES = 4 # Number of software processes to run inside a docker container
CPU_SHARES = DOCKER_CORES * 1024 # ECS computing units assigned to each docker container (1024 units = 1 core)
-MEMORY = 15000 # Memory assigned to the docker container in MB
-SECONDS_TO_START = 3*60 # Wait before the next CP process is initiated to avoid memory collisions
+MEMORY = 15000 # Memory assigned to the docker container in MB
+SECONDS_TO_START = 3*60 # Wait before the next process is initiated to avoid memory collisions
# SQS QUEUE INFORMATION:
SQS_QUEUE_NAME = APP_NAME + 'Queue'
@@ -31,12 +31,12 @@
SQS_DEAD_LETTER_QUEUE = 'arn:aws:sqs:some-region:111111100000:DeadMessages'
# LOG GROUP INFORMATION:
-LOG_GROUP_NAME = APP_NAME
+LOG_GROUP_NAME = APP_NAME
# REDUNDANCY CHECKS
-CHECK_IF_DONE_BOOL = 'False' #True or False- should it check if there are a certain number of non-empty files and delete the job if yes?
-EXPECTED_NUMBER_FILES = 7 #What is the number of files that trigger skipping a job?
-MIN_FILE_SIZE_BYTES = 1 #What is the minimal number of bytes an object should be to "count"?
-NECESSARY_STRING = '' #Is there any string that should be in the file name to "count"?
+CHECK_IF_DONE_BOOL = 'False' # True or False - should it check if there are a certain number of non-empty files and delete the job if yes?
+EXPECTED_NUMBER_FILES = 7 # What is the number of files that trigger skipping a job?
+MIN_FILE_SIZE_BYTES = 1 # What is the minimal number of bytes an object should be to "count"?
+NECESSARY_STRING = '' # Is there any string that should be in the file name to "count"?
# PUT ANYTHING SPECIFIC TO YOUR PROGRAM DOWN HERE
diff --git a/documentation/.github/workflows/deploy.yml b/documentation/.github/workflows/deploy.yml
new file mode 100644
index 0000000..e23d56f
--- /dev/null
+++ b/documentation/.github/workflows/deploy.yml
@@ -0,0 +1,39 @@
+name: deploy-documentation
+
+# Only run this when the master branch changes
+on:
+ push:
+ branches:
+ - main
+ # Only run if edits in DS-documentation
+ paths:
+ - documentation/DS-documentation/**
+
+# This job installs dependencies, builds the book, and pushes it to `gh-pages`
+jobs:
+ deploy-book:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v2
+
+ # Install dependencies
+ - name: Set up Python 3.8
+ uses: actions/setup-python@v2
+ with:
+ python-version: 3.8
+
+ - name: Install dependencies
+ run: |
+ pip install jupyter-book
+
+ # Build the book
+ - name: Build the book
+ run: |
+ jupyter-book build DS-documentation/
+
+ # Push the book's HTML to github-pages
+ - name: GitHub Pages action
+ uses: peaceiris/actions-gh-pages@v3.6.1
+ with:
+ github_token: ${{ secrets.GITHUB_TOKEN }}
+ publish_dir: ./DS-documentation/_build/html
diff --git a/documentation/DS-documentation/SQS_QUEUE_information.md b/documentation/DS-documentation/SQS_QUEUE_information.md
new file mode 100644
index 0000000..74e2a50
--- /dev/null
+++ b/documentation/DS-documentation/SQS_QUEUE_information.md
@@ -0,0 +1,41 @@
+# SQS QUEUE Information
+
+This is in-depth information about the configurable components in SQS QUEUE INFORMATION, a section in [Step 1: Configuration](step_1_configuration.md) of running Distributed CellProfiler.
+
+## SQS_QUEUE_NAME
+
+**SQS_QUEUE_NAME** is the name of the queue where all of your jobs are sent. (A queue is exactly what it sounds like - a list of things waiting their turn. Jobs represent one complete run through a CellProfiler pipeline (though each job may involve any number of images. e.g. analysis may require thousands of jobs, each with a single image making one complete CellProfiler run, while making an illumination correction may be a single job that iterates through thousands of images to produce a single output file.)) You want a name that is descriptive enough to distinguish it from other queues. We usually name our queues based on the project and the step or pipeline goal. An example may be something like Hepatocyte_Differentiation_Illum or Lipid_Droplet_Analysis.
+
+## SQS_DEAD_LETTER_QUEUE
+
+**SQS_DEAD_LETTER_QUEUE** is the name of the queue where all the jobs that failed to run are sent. If everything goes perfectly, this will always remain empty. If jobs that are in the queue fail multiple times (our default is 10) they are moved to the dead-letter queue, which is not used to initiate jobs. The dead-letter queue therefore functions effectively as a log so you can see if any of your jobs failed. It is different from your other queue as machines do not try and pull jobs from it. Protip: Each member of our team maintains their own dead-letter queue so we don’t have to worry about finding messages if multiple people are running jobs at the same time. We use names like DeadMessages_Erin.
+
+If all of your jobs end up in your dead-letter queue there are many different places you could have a problem. Hopefully, you’ll keep an eye on the logs in your CloudWatch (the part of AWS used for monitoring what all your other AWS services are doing) after starting a run and catch the issue before all of your jobs fail multiple times.
+
+If a single job ends up in your dead-letter queue while the rest of your jobs complete successfully, it is likely that that an image is corrupted (a corrupted image is one that has failed to save properly or has been damaged so that it will not open). This is true whether your pipeline processes a single image at a time (such as in analysis runs where you’re interested in cellular measurements on a per-image basis) or whether your pipeline processes many images at a time (such as when making an illumination correction image on a per-plate basis). This is the major reason why we have the dead-letter queue: you certainly don’t want to pay for your cluster to indefinitely attempt to process a corrupted image. Keeping an eye on your CloudWatch logs wouldn’t necessarily help you catch this kind of error because you could have tens or hundreds of successful jobs run before an instance pulls the job for the corrupted image, or the corrupted image could be thousands of images into an illumination correction run, etc.
+
+## SQS_MESSAGE_VISIBILITY
+
+**SQS_MESSAGE_VISIBILITY** controls how long jobs are hidden after being pulled by a machine to run. Jobs must be visible (i.e. not hidden) in order to be pulled by a Docker and therefore run. In other words, the time you enter in SQS_MESSAGE_VISIBILITY is how long a job is allowed a chance to complete before it is unhidden and made available to be started by a different copy of CellProfiler. It’s quite important to set this time correctly- we typically say to estimate 1.5X how long the job typically takes to run (or your best guess of that if you’re not sure). To understand why, and the consequences of setting an incorrect time, let’s look more carefully at the SQS queue.
+
+The SQS queue has two categories - “Messages Available” and “Messages In Flight”. Each message is a job and regardless of the category it’s in, the jobs all remain in the same queue. In effect, “In Flight” means currently hiding and “Available” means not currently hiding.
+
+When you submit your Config file to AWS it creates your queue in SQS but that queue starts out empty. When you submit your Jobs file to AWS it puts all of your jobs into the queue under “Messages Available”. When you submit your Fleet file to AWS it 1) creates machines in EC2, 2) ECS puts Docker containers on those instances, and 3) those instances look in “Messages Available” in SQS for jobs to run.
+
+Once a Docker has pulled a job, that job moves from “Available’ to “In Flight”. It remains hidden (“In Flight”) for the duration of time set in SQS_MESSAGE_VISIBILITY and then it becomes visible again (“Available”). Jobs are hidden so that multiple machines don’t process the same job at the same time. If the job completes successfully, the Docker tells the queue to delete that message.
+
+If the job completes but it is not successful (e.g. CellProfiler errors), the Docker tells the queue to move the job from “In Flight” to “Available” so another Docker (with a different copy of CellProfiler) can attempt to complete the job.
+
+If the SQS_MESSAGE_VISIBILITY is too short then a job will become unhidden even though it is still currently being (hopefully successfully) run by the Docker that originally picked it up. This means that another Docker may come along and start the same job and you end up paying for unnecessary compute time because both Dockers will continue to run the job until they each finish.
+
+If the SQS_MESSAGE_VISIBILITY is too long then you can end up wasting time and money waiting for the job to become available again after a crash even when the rest of your analysis is done. If anything causes a job to stop mid-run (e.g. CellProfiler crashes, the instance crashes, or the instance is removed by AWS because you are outbid), that job stays hidden until the set time. If a Docker instance goes to the queue and doesn’t find any visible jobs, then it does not try to run any more jobs in that copy of CellProfiler, limiting the effective computing power of that Docker. Therefore, some or all of your instances may hang around doing nothing (but costing money) until the job is visible again. When in doubt, it is better to have your SQS_MESSAGE_VISIBILITY set too long than too short because, while crashes can happen, it is rare that AWS takes small machines from your fleet, though we do notice it happening with larger machines.
+
+There is not an easy way to see if you have selected the appropriate amount of time for your SQS_MESSAGE_VISIBILITY on your first run through a brand new pipeline. To confirm that multiple Dockers didn’t run the same job, after the jobs are complete, you need to manually go through each log in CloudWatch and figure out how many times you got the word “SUCCESS” in each log. (This may be reasonable to do on an illumination correction run where you have a single job per plate, but it’s not so reasonable if running an analysis pipeline on thousands of individual images). To confirm that multiple Dockers are never processing the same job, you can keep an eye on your queue and make sure that you never have more jobs “In Flight” than the number of copies of CellProfiler that you have running; likewise, if your timeout time is too short, it may seem like too few jobs are “In Flight” even though the CPU usage on all your machines is high.
+
+Once you have run a pipeline once, you can check the execution time (either by noticing how long after you started your jobs that your first jobs begin to finish, or by checking the logs of individual jobs and noting the start and end time), you will then have an accurate idea of roughly how long that pipeline needs to execute, and can set your message visibility accordingly. You can even do this on the fly while jobs are currently processing; the updated visibility time won’t affect the jobs already out for processing (ie if the time was set to 3 hours and you change it to 1 hour, the jobs already processing will remain hidden for 3 hours or until finished), but any job that begins processing AFTER the change will use the new visibility timeout setting.
+
+## Example SQS Queue
+
+[[images/Sample_SQS_Queue.png|alt="Sample_SQS_Queue"]]
+
+This is an example of an SQS Queue. You can see that there is one active task with 64 jobs in it. In this example, we are running a fleet of 32 instances, each with a single Docker, so at this moment (right after starting the fleet), there are 32 tasks "In Flight" and 32 tasks that are still "Available." You can also see that many lab members have their own dead-letter queues which are, fortunately, all currently empty.
diff --git a/documentation/DS-documentation/_config.yml b/documentation/DS-documentation/_config.yml
new file mode 100644
index 0000000..36638eb
--- /dev/null
+++ b/documentation/DS-documentation/_config.yml
@@ -0,0 +1,36 @@
+# Book settings
+# For your DS implementation, you will need to update author, repository:url:, and html:baseurl:
+
+# Learn more at https://jupyterbook.org/customize/config.html
+title: Documentation
+author: Broad Institute
+copyright: "2022"
+#logo: img/logo.svg
+
+# Only build files that are in the ToC
+only_build_toc_files: true
+
+# Force re-execution of notebooks on each build.
+# See https://jupyterbook.org/content/execute.html
+execute:
+ execute_notebooks: force
+
+# Information about where the book exists on the web
+repository:
+ url: https://github.com/distributed-something/documentation
+ branch: main # Which branch of the repository should be used when creating links (optional)
+ path_to_book: DS-documentation
+
+html:
+ #favicon: "img/favicon.ico"
+ baseurl: distributed-something.github.io
+ use_repository_button: true
+ use_issues_button: true
+ use_edit_page_button: true
+ comments:
+ hypothesis: true
+
+parse:
+ myst_enable_extensions:
+ # Only required if you use html
+ - html_image
diff --git a/documentation/DS-documentation/_toc.yml b/documentation/DS-documentation/_toc.yml
new file mode 100644
index 0000000..5035b2f
--- /dev/null
+++ b/documentation/DS-documentation/_toc.yml
@@ -0,0 +1,24 @@
+# Table of contents
+# Learn more at https://jupyterbook.org/customize/toc.html
+
+format: jb-book
+root: overview
+parts:
+- caption: Creating DS
+ chapters:
+ - file: customizing_DS
+ - file: implementing_DS
+ - file: troubleshooting_implementation
+- caption: Running DS
+ chapters:
+ - file: step_0_prep
+ - file: step_1_configuration
+ sections:
+ - file: SQS_QUEUE_information
+ - file: step_2_submit_jobs
+ - file: step_3_start_cluster
+ - file: step_4_monitor
+- caption:
+ chapters:
+ - file: troubleshooting_runs
+ - file: versions
diff --git a/documentation/DS-documentation/customizing_DS.md b/documentation/DS-documentation/customizing_DS.md
new file mode 100644
index 0000000..803309b
--- /dev/null
+++ b/documentation/DS-documentation/customizing_DS.md
@@ -0,0 +1,68 @@
+# Customizing DS
+
+Distributed-Something is a template.
+It is not fully functional software but is intended to serve as an editable source so that you can quickly and easily implement a distributed workflow for your own Dockerized software.
+
+Examples of implementations can be found at [Distributed-CellProfiler](http://github.com/cellprofiler/distributed-cellprofiler), [Distributed-Fiji](http://github.com/cellprofiler/distributed-fiji), and [Distributed-OmeZarrMaker](http://github.com/cellprofiler/distributed-omezarrmaker).
+
+There are many points at which you will need to customize Distributed-Something for your own implementation; These customization points are summarized below.
+Files that do not require any customization are not listed.
+
+## files/
+
+### exampleFleet.json
+
+exampleFleet.json does not need to be changed depending on your implementation of Distributed-Something.
+However, each AWS account running your implementation will need to update the Fleet file with configuration specific to their account as detailed in [Step 3: Start Cluster](step_3_start_cluster.md).
+
+### exampleJob.json
+
+exampleJob.json needs to be entirely customized for your implementation of Distributed-Something.
+When you submit your jobs in [Step 2: Submit Jobs](step_2_submit_jobs.md), Distributed-Something adds a job to your SQS queue for each item in `groups`.
+Each job contains the shared variables common to all jobs, listed in the exampleJob.json above the `groups` key.
+These variables are passed to your worker as the `message` and should include any metadata that may possibly change between runs of your Distributed-Something implementation.
+
+Some common variables used in Job files include:
+- input location
+- output location
+- output structure
+- script/pipeline name
+- flags to pass to your program
+
+## worker/
+
+### Dockerfile
+
+The Dockerfile is used to create the Distributed-Something Docker.
+You will need to edit the `FROM` to point to your own docker.
+
+No further edits to the Dockerfile should be necessary, though advanced users make additional customizations based on the docker they are `FROM`ing.
+Additionally, you may remove the section `# Install S3FS` if your workflow doesn't require mounting an S3 bucket.
+You will still be able to upload and download from an S3 bucket using AWS CLI without mounting it with S3FS.
+
+### generic-worker.py
+
+The majority of code customization for your implementation of Distributed-Something happens in the worker file.
+The `generic-worker.py` code is thoroughly documented with customization details.
+
+### Makefile
+
+Update `user` and `project` to match you and your Distributed-Something implementation, respectively.
+
+### run-worker.sh
+
+You do not need to make any modifications to run-worker.sh.
+You might want to remove `2. MOUNT S3` if your workflow doesn't require mounting an S3 bucket.
+
+## other files
+
+### config.py
+
+`DOCKERHUB_TAG` needs to match the `user` and `project` set in `Makefile`.
+We recommend adjusting `EC2 AND ECS INFORMATION` and `DOCKER INSTANCE RUNNING ENVIRONMENT` variables to reasonable defaults for your Distributed-Something implementation.
+Suggestions for determining optimal parameters can be found in [Implementing Distributed-Something](implementing_DS.md).
+
+If there are any variables you would like to pass to your program as part of configuration, you can add them at the bottom and they will be passed as system variables to the Docker.
+Note that any additional variables added to `config.py` need to also be added to `CONSTANT PATHS IN THE CONTAINER` in `generic-worker.py`.
+
+`AWS GENERAL SETTINGS` are specific to your account. All other sections are variable specific to each batch/run of your Distributed-Something implementation and will need to be adjusted at each run time. More configuration information is available in [Step 1: Configuration](step_1_configuration.md)
diff --git a/documentation/DS-documentation/images/AMIID.jpg b/documentation/DS-documentation/images/AMIID.jpg
new file mode 100644
index 0000000..b7edb8a
Binary files /dev/null and b/documentation/DS-documentation/images/AMIID.jpg differ
diff --git a/documentation/DS-documentation/images/ECS.jpg b/documentation/DS-documentation/images/ECS.jpg
new file mode 100644
index 0000000..abd9056
Binary files /dev/null and b/documentation/DS-documentation/images/ECS.jpg differ
diff --git a/documentation/DS-documentation/images/InstanceID.jpg b/documentation/DS-documentation/images/InstanceID.jpg
new file mode 100644
index 0000000..9258205
Binary files /dev/null and b/documentation/DS-documentation/images/InstanceID.jpg differ
diff --git a/documentation/DS-documentation/images/Launch.jpg b/documentation/DS-documentation/images/Launch.jpg
new file mode 100644
index 0000000..8686685
Binary files /dev/null and b/documentation/DS-documentation/images/Launch.jpg differ
diff --git a/documentation/DS-documentation/images/Network.jpg b/documentation/DS-documentation/images/Network.jpg
new file mode 100644
index 0000000..74ef5fc
Binary files /dev/null and b/documentation/DS-documentation/images/Network.jpg differ
diff --git a/documentation/DS-documentation/images/Snapshot.jpg b/documentation/DS-documentation/images/Snapshot.jpg
new file mode 100644
index 0000000..73a853e
Binary files /dev/null and b/documentation/DS-documentation/images/Snapshot.jpg differ
diff --git a/documentation/DS-documentation/images/sample_DCP_config_1.png b/documentation/DS-documentation/images/sample_DCP_config_1.png
new file mode 100644
index 0000000..4cfc7e1
Binary files /dev/null and b/documentation/DS-documentation/images/sample_DCP_config_1.png differ
diff --git a/documentation/DS-documentation/implementing_DS.md b/documentation/DS-documentation/implementing_DS.md
new file mode 100644
index 0000000..64f0e4b
--- /dev/null
+++ b/documentation/DS-documentation/implementing_DS.md
@@ -0,0 +1,54 @@
+# Implementing DS
+
+## Make your software Docker
+
+Your software will ned to be containerized in its own Docker.
+This Docker image is what you will `FROM` in `Dockerfile` when you create your Distributed-Something Docker image.
+Detailed instructions are out of the scope of this documentation, though we refer you to [Docker's documentation](https://docs.docker.com/get-started/).
+For examples that we use in our Distributed-Something suite, you can refer to the code used to make the [CellProfiler Docker](https://github.com/CellProfiler/distribution/tree/master/docker), [BioFormats2Raw Docker](https://github.com/ome/bioformats2raw-docker), and [Fiji Docker](https://github.com/fiji/dockerfiles).
+
+## Make your Distributed-Something Docker
+
+Once you have made all the alterations to the Distributed-Something code detailed in [Customizing Distributed-Something](customizing_DS.md), you need to make your Distributed-Something Docker image.
+
+You will need a [DockerHub account](https://hub.docker.com).
+Connect to the Docker daemon.
+We find the simplest way to do this is to download and open Docker Desktop.
+You can leave Docker Desktop open in the background while you continue to work at the command line.
+
+
+# Navigate into the Distributed-Something/worker folder +cd worker +# Run the make command +make ++ +While it is generally a good principle to iterate numerical tags, note that you can set the tag to `latest` in both `Makefile` and in `config.py` to simplify troubleshooting (as you don't have to remember to change the tag in either location while potentially testing multiple Docker builds). + +## Test requirements for config.py + +Once you have created a functional Docker image of your software, it is useful to know exact memory requirements so you can request appropriately sized machines in `config.py`. +We recommend using CloudWatch Agent on an AWS instance. +(Standard CloudWatch metrics do not report granular memory usage.) + +To test necessary parameters: +- Create a EC2 instance using an AMI with S3FS already installed. +- Add an IAM role. This machine must have an instance role with sufficient permissions attached in order to transmit metrics. +- Connect to your EC2 instance. +- Install CloudWatch Agent following [AWS documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/download-cloudwatch-agent-commandline.html). + +
+# Start CloudWatch Agent +sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json +# Install Docker +sudo apt install docker.io +# Run +sudo docker run -it --rm --entrypoint /bin/sh -v ~/bucket:/bucket DOCKER +# Stop CloudWatch Agent +sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a stop ++ +View your collected memory metrics in CloudWatch: +- CloudWatch => Metrics => All Metrics +- Custom Namespaces => CWAgent => ImageId, InstanceId, InstanceType => YOUR_INSTANCE mem_used_percent +- Graphed metrics => PERIOD = 1 Minute (or whatever you have set in your CloudWatch config) diff --git a/documentation/DS-documentation/overview.md b/documentation/DS-documentation/overview.md new file mode 100644 index 0000000..dbbfd3e --- /dev/null +++ b/documentation/DS-documentation/overview.md @@ -0,0 +1,92 @@ +# What is Distributed-Something? + +Distributed-Something is a series of scripts designed to help you run a Dockerized version of your software on [Amazon Web Services](https://aws.amazon.com/) (AWS) using AWS's file storage and computing systems. +* Data is stored in S3 buckets. +* Software is run on "Spot Fleets" of computers (or instances) in the cloud. + +You will need to customize Distributed-Something for your particular use case. +See [Customizing Distributed-Something](customizing_DS.md) for customization details. + +## Why would I want to use this? + +Using AWS allows you to create a flexible, on-demand computing infrastructure where you only have to pay for the resources you use. +This can give you access to far more computing power than you may have available at your home institution, which is great when you have large datasets to process. + +Each piece of the infrastructure has to be added and configured separately, which can be time-consuming and confusing. + +Distributed-Something tries to leverage the power of the former, while minimizing the problems of the latter. + +## What do I need to have to run this? + +Essentially all you need to run Distributed-Something is an AWS account and a terminal program; see our [page on getting set up](step_0_prep.md) for all the specific steps you'll need to take. +You will also need a Dockerized version of your software. + +## What happens in AWS when I run Distributed-Something? + +The steps for actually running the Distributed-Something code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-Something/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)). +We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up. + +**Step 1**: +In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers. +When you run `$ python3 run.py setup` to execute the Config, it does three major things: +* Creates task definitions. +These are found in ECS. +They define the configuration of the Dockers and include the settings you gave for **CHECK_IF_DONE_BOOL**, **DOCKER_CORES**, **EXPECTED_NUMBER_FILES**, and **MEMORY**. +* Makes a queue in SQS (it is empty at this point) and sets a dead-letter queue. +* Makes a service in ECS which defines how many Dockers you want. + +**Step 2**: +In the Job file you set the location of any inputs (e.g. data and batch-specific scripts) and outputs. +Additionally, you list all of the individual tasks that you want run. +When you submit the Job file it adds that list of tasks to the queue in SQS (which you made in the previous step). +Submit jobs with `$ python3 run.py submitJob`. + +**Step 3**: +In the Config file you set the number and size of the EC2 instances you want. +This information, along with account-specific configuration in the Fleet file is used to start the fleet with `$ python3 run.py startCluster`. + +**After these steps are complete, a number of things happen automatically**: +* ECS puts Docker containers onto EC2 instances. +If there is a mismatch within your Config file and the Docker is larger than the instance it will not be placed. +ECS will keep placing Dockers onto an instance until it is full, so if you accidentally create instances that are too large you may end up with more Dockers placed on it than intended. +This is also why you may want multiple **ECS_CLUSTER**s so that ECS doesn't blindly place Dockers you intended for one job onto an instance you intended for another job. +* When a Docker container gets placed it gives the instance it's on its own name. +* Once an instance has a name, the Docker gives it an alarm that tells it to reboot if it is sitting idle for 15 minutes. +* The Docker hooks the instance up to the _perinstance logs in CloudWatch. +* The instances look in SQS for a job. +Any time they don't have a job they go back to SQS. +If SQS tells them there are no visible jobs then they shut themselves down. +* When an instance finishes a job it sends a message to SQS and removes that job from the queue. + +## What does this look like? + + + +This is an example of one possible instance configuration using [Distributed-CellProfiler](http://github.com/cellprofiler/distributed-cellprofiler) as an example. +This is one m4.16xlarge EC2 instance (64 CPUs, 250GB of RAM) with a 165 EBS volume mounted on it. A spot fleet could contain many such instances. +It has 16 tasks (individual Docker containers). +Each Docker container uses 10GB of hard disk space and is assigned 4 CPUs and 15 GB of RAM (which it does not share with other Docker containers). +Each container shares its individual resources among 4 copies of CellProfiler. +Each copy of CellProfiler runs a pipeline on one "job", which can be anything from a single image to an entire 384 well plate or timelapse movie. +You can optionally stagger the start time of these 4 copies of CellProfiler, ensuring that the most memory- or disk-intensive steps aren't happening simultaneously, decreasing the likelihood of a crash. + +Read more about this and other configurations in [Step 1: Configuration](step_1_configuration.md). + +## How do I determine my configuration? + +To some degree, you determine the best configuration for your needs through trial and error. +* Looking at the resources your software uses on your local computer when it runs your jobs can give you a sense of roughly how much hard drive and memory space each job requires, which can help you determine your group size and what machines to use. +* Prices of different machine sizes fluctuate, so the choice of which type of machines to use in your spot fleet is best determined at the time you run it. +How long a job takes to run and how quickly you need the data may also affect how much you're willing to bid for any given machine. +* Running a few large Docker containers (as opposed to many small ones) increases the amount of memory all the copies of your software are sharing, decreasing the likelihood you'll run out of memory if you stagger your job start times. +However, you're also at a greater risk of running out of hard disk space. + +Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking. + +## Can I contribute code to Distributed-Something? + +Feel free! We're always looking for ways to improve. + +## Who made this? + +Distributed-Something is a project from the [Cimini Lab](https://cimini-lab.broadinstitute.org) in the Imaging Platform at the Broad Institute in Cambridge, MA, USA. It was initially conceived and implemented for a single use case as [Distributed-CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler) in what is now the [Carpenter-Singh Lab](https://carpenter-singh-lab.broadinstitute.org). diff --git a/documentation/DS-documentation/step_0_prep.md b/documentation/DS-documentation/step_0_prep.md new file mode 100644 index 0000000..4b8fdba --- /dev/null +++ b/documentation/DS-documentation/step_0_prep.md @@ -0,0 +1,97 @@ +# Step 0: Prep + +Distributed-Something runs many parallel jobs in EC2 instances that are automatically managed by ECS. +To get jobs started, a control node to submit jobs and monitor progress is needed. +This section describes what you need in AWS and in the control node to get started. + +## 1. AWS Configuration + +The AWS resources involved in running Distributed-Something can be primarily configured using the [AWS Web Console](https://aws.amazon.com/console/). +The architecture of Distributed-Something is based in the [worker pattern](https://aws.amazon.com/blogs/compute/better-together-amazon-ecs-and-aws-lambda/) for distributed systems. +We have adapted and simplified that architecture for Distributed-Something. + +You need an active account configured to proceed. Login into your AWS account, and make sure the following list of resources is created: + +### 1.1 Access keys +* Get [security credentials](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account. +Store your credentials in a safe place that you can access later. +* You will probably need an ssh key to login into your EC2 instances (control or worker nodes). +[Generate an SSH key](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) and store it in a safe place for later use. +If you'd rather, you can generate a new key pair to use for this during creation of the control node; make sure to `chmod 600` the private key when you download it. + +### 1.2 Roles and permissions +* You can use your default VPC, subnet, and security groups; you should add an inbound SSH connection from your IP address to your security group. +* [Create an ecsInstanceRole](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html) with appropriate permissions (An S3 bucket access policy CloudWatchFullAccess, CloudWatchActionEC2Access, AmazonEC2ContainerServiceforEC2Role policies, ec2.amazonaws.com as a Trusted Entity) +* [Create an aws-ec2-spot-fleet-tagging-role](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html) with appropriate permissions (just needs AmazonEC2SpotFleetTaggingRole); ensure that in the "Trust Relationships" tab it says "spotfleet.amazonaws.com" rather than "ec2.amazonaws.com" (edit this if necessary). +In the current interface, it's easiest to click "Create role", select "EC2" from the main service list, then select "EC2- Spot Fleet Tagging". + +### 1.3 Auxiliary Resources +* [Create an S3 bucket](http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) and upload your data to it. +* Add permissions to your bucket so that [logs can be exported to it](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html) (Step 3, first code block) +* [Create an SQS](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/CreatingQueue.html) queue for unprocessable-messages to be dumped into (aka a DeadLetterQueue). + +### 1.4 Primary Resources +The following five are resources you need to interact with constantly while working with Distributed-Something. +Although at this point you don't need to create anything special there, you can open each console in a separate tab in your browser to keep them handy and monitor DS's behavior. +* [S3 Console](https://console.aws.amazon.com/s3) +* [EC2 Console](https://console.aws.amazon.com/ec2/) +* [ECS Console](https://console.aws.amazon.com/ecs/) +* [SQS Console](https://console.aws.amazon.com/sqs/) +* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/) + +### 1.5 Spot Limits +AWS initially [limits the number of spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html) you can use at one time; you can request more through a process in the linked documentation. +Depending on your workflow (your scale and how you group your jobs), this may not be necessary. + +## 2. The Control Node +The control node can be your local machine if it is configured properly, or it can also be a small instance in AWS. +We prefer to have a small EC2 instance dedicated to controlling our Distributed-Something workflows for simplicity of access and configuration. +To login in an EC2 machine you need an ssh key that can be generated in the web console. +Each time you launch an EC2 instance you have to confirm having this key (which is a .pem file). +This machine is needed only for submitting jobs, and does not have any special computational requirements, so you can use a micro instance to run basic scripts to proceed. + +The control node needs the following tools to successfully run Distributed-Something. +Here we assume you are using the command line in a Linux machine, but you are free to try other operating systems too. + +### 2.1 Make your own + +#### 2.1.1 Clone this repo +You will need the scripts in Distributed-Something locally available in your control node. +
+ sudo apt-get install git + git clone https://github.com/DistributedScience/Distributed-Something.git + cd Distributed-Something/ + git pull ++ +#### 2.1.2 Python 3.8 or higher and pip +Most scripts are written in Python and support Python 3.8 and 3.9. +Follow installation instructions for your platform to install python and, if needed, pip. +After Python has been installed, you need to install the requirements for Distributed-Something following this steps: + +
+ cd Distributed-Something/files + sudo pip install -r requirements.txt ++ +#### 2.1.3 AWS CLI +The command line interface is the main mode of interaction between the local node and the resources in AWS. +You need to install [awscli](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) for Distributed-Something to work properly: + +
+ sudo pip install awscli --ignore-installed six + sudo pip install --upgrade awscli + aws configure ++ +When running the last step, you will need to enter your AWS credentials. +Make sure to set the region correctly (i.e. us-west-1 or eu-east-1, not eu-west-2a), and set the default file type to json. + +#### 2.1.4 s3fs-fuse (optional) +[s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) allows you to mount your s3 bucket as a pseudo-file system. +It does not have all the performance of a real file system, but allows you to easily access all the files in your s3 bucket. +Follow the instructions at the link to mount your bucket. + +#### 2.1.5 Create Control Node AMI (optional) +Once you've set up the other software (and gotten a job running, so you know everything is set up correctly), you can use Amazon's web console to set this up as an Amazon Machine Instance, or AMI, to replicate the current state of the hard drive. +Create future control nodes using this AMI so that you don't need to repeat the above installation. diff --git a/documentation/DS-documentation/step_1_configuration.md b/documentation/DS-documentation/step_1_configuration.md new file mode 100644 index 0000000..f481adf --- /dev/null +++ b/documentation/DS-documentation/step_1_configuration.md @@ -0,0 +1,82 @@ +# Step 1: Configuration + +The first step in setting up any job is editing the values in the config.py file. +Once the config file is created, simply type `python run.py setup` to set up your resources based on the configurations you've specified. + +*** + +## Components of the config file + +* **APP_NAME:** This will be used to tie your clusters, tasks, services, logs, and alarms together. +It need not be unique, but it should be descriptive enough that you can tell jobs apart if you're running multiple analyses (i.e. "NuclearSegmentation_Drosophila" is better than "CellProfiler"). + +*** + +* **DOCKERHUB_TAG:** This is the encapsulated version of your software your analyses will be running. + +*** + +### AWS GENERAL SETTINGS +These are settings that will allow your instances to be configured correctly and access the resources they need- see [Step 0: Prep](step_0_prep.md) for more information. + +*** + +### EC2 AND ECS INFORMATION + +* **ECS_CLUSTER:** Which ECS cluster you'd like the jobs to go into. +All AWS accounts come with a "default" cluster, but you may add more clusters if you like. +Distinct clusters for each job are not necessary, but if you're running multiple analyses at once it can help avoid the wrong Docker containers (such as the ones for your "NuclearSegmentation_Drosophila" job) going to the wrong instances (such as the instances that are part of your "NuclearSegmentation_HeLa" spot fleet). +* **CLUSTER_MACHINES:** How many EC2 instances you want to have in your cluster. +* **TASKS_PER_MACHINE:** How many Docker containers to place on each machine. +* **MACHINE_TYPE:** A list of what type(s) of machines your spot fleet should contain. +* **MACHINE_PRICE:** How much you're willing to pay per hour for each machine launched. +AWS has a handy [price history tracker](https://console.aws.amazon.com/ec2sp/v1/spot/home) you can use to make a reasonable estimate of how much to bid. +If your jobs complete quickly and/or you don't need the data immediately you can reduce your bid accordingly; jobs that may take many hours to finish or that you need results from immediately may justify a higher bid. +* **EBS_VOL_SIZE:** The size of the temporary hard drive associated with each EC2 instance in GB. +The minimum allowed is 22. +If you have multiple Dockers running per machine, each Docker will have access to (EBS_VOL_SIZE/TASKS_PER_MACHINE)- 2 GB of space. + +*** + +### DOCKER INSTANCE RUNNING ENVIRONMENT +* **DOCKER_CORES:** How many copies of your script to run in each Docker container. +* **CPU_SHARES:** How many CPUs each Docker container may have. +* **MEMORY:** How much memory each Docker container may have. +* **SECONDS_TO_START:** The time each Docker core will wait before it starts another copy of your software. +This can safely be set to 0 for workflows that don't require much memory or execute quickly; for slower and/or more memory intensive pipelines we advise you to space them out by roughly the length of your most memory intensive step to make sure your software doesn't crash due to lack of memory. + +*** + +### SQS QUEUE INFORMATION + +* **SQS_QUEUE_NAME:** The name of the queue where all of your jobs will be sent. +* **SQS_MESSAGE_VISIBILITY:** How long each job is hidden from view before being allowed to be tried again. +We recommend setting this to slightly longer than the average amount of time it takes an individual job to process- if you set it too short, you may waste resources doing the same job multiple times; if you set it too long, your instances may have to wait around a long while to access a job that was sent to an instance that stalled or has since been terminated. +* **SQS_DEAD_LETTER_QUEUE:** The name of the queue to send jobs to if they fail to process correctly multiple times; this keeps a single bad job (such as one where a single file has been corrupted) from keeping your cluster active indefinitely. +See [Step 0: Prep](step_0_prep.med) for more information. + +*** + +### LOG GROUP INFORMATION + +* **LOG_GROUP_NAME:** The name to give the log group that will monitor the progress of your jobs and allow you to check performance or look for problems after the fact. + +*** + +### REDUNDANCY CHECKS + +* **CHECK_IF_DONE_BOOL:** Whether or not to check the output folder before proceeding. +Case-insensitive. +If an analysis fails partway through (due to some of the files being in the wrong place, an AWS outage, a machine crash, etc.), setting this to 'True' this allows you to resubmit the whole analysis but only reprocess jobs that haven't already been done. +This saves you from having to try to parse exactly which jobs succeeded vs failed or from having to pay to rerun the entire analysis. +If your software determines the correct number of files are already in the output folder it will designate that job as completed and move onto the next one. +If you actually do want to overwrite files that were previously generated (such as when you have improved a pipeline and no longer want the output of the old version), set this to 'False' to process jobs whether or not there are already files in the output folder. +* **EXPECTED_NUMBER_FILES:** How many files need to be in the output folder in order to mark a job as completed. +* **MIN_FILE_SIZE_BYTES:** What is the minimal number of bytes an object should be to "count"? +Useful when trying to detect jobs that may have exported smaller corrupted files vs larger, full-size files. +* **NECESSARY_STRING:** This allows you to optionally set a string that must be included in your file to count towards the total in EXPECTED_NUMBER_FILES. + +*** + +### YOUR CONFIGURATIONS +* **VARIABLE:** Add in any additional system variables specific to your program. diff --git a/documentation/DS-documentation/step_2_submit_jobs.md b/documentation/DS-documentation/step_2_submit_jobs.md new file mode 100644 index 0000000..ef2a64a --- /dev/null +++ b/documentation/DS-documentation/step_2_submit_jobs.md @@ -0,0 +1,14 @@ +# Step 2: Submit Jobs + +Distributed-Something works by breaking your workflow into a series of smaller jobs based on the metadata and groupings you've specified in your job file. +The choice of how to group your jobs is largely dependent on the details of your workflow. +Once you've decided on a grouping, you're ready to start configuring your job file. +Once your job file is configured, simply use `python run.py submitJob files/{YourJobFile}.json` to send all the jobs to the SQS queue [specified in your config file](step_1_configuration.md). + +## Configuring your job file + +All keys (outside of your groups) are shared between all jobs. + +* **groups:** The list of all the groups you'd like to process. +Keys within each job can either be used to define the job (e.g. Metadata, file location) or can be used to pass job-specific variables. +For large numbers of groups, it may be helpful to create this list separately as a txt file you can then append into the jobs JSON file using your favorite scripting language. diff --git a/documentation/DS-documentation/step_3_start_cluster.md b/documentation/DS-documentation/step_3_start_cluster.md new file mode 100644 index 0000000..2fe99a8 --- /dev/null +++ b/documentation/DS-documentation/step_3_start_cluster.md @@ -0,0 +1,69 @@ +# Step 3: Start Cluster + +After your jobs have been submitted to the queue, it is time to start your cluster. +Once you have configured your spot fleet request per the instructions below, you may run +`python run.py startCluster files/{YourFleetFile}.json` + +When you enter this command, the following things will happen (in this order): + +* Your spot fleet request will be sent to AWS. +Depending on their capacity and the price that you bid, it can take anywhere from a couple of minutes to several hours for your machines to be ready. +* Distributed-Something will create the APP_NAMESpotFleetRequestId.json file, which will allow you to [start your progress monitor](step_4_monitor.md). +This will allow you to walk away and just let things run even if your spot fleet won't be ready for some time. + +Once the spot fleet is ready: + +* Distributed-Something will create the log groups (if they don't already exist) for your log streams to go in. +* Distributed-Something will ask AWS to place Docker containers onto the instances in your spot fleet. +Your job will begin shortly! + +*** +## Configuring your spot fleet request +Definition of many of these terms and explanations of many of the individual configuration parameters of spot fleets are covered in AWS documentation [here](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html) and [here](http://docs.aws.amazon.com/cli/latest/reference/ec2/request-spot-fleet.html). +You may also configure your spot fleet request through Amazon's web interface and simply download the JSON file at the "review" page to generate the configuration file you want, though we do not recommend this as Distributed-Something assumes a certain fleet request structure and has only been tested on certain Amazon AMI's. +Looking at the output of this automatically generated spot fleet request can be useful though for obtaining values like your VPC's subnet and security groups, as well the ARN ID's of your roles. + +Among the parameters you should/must update: + +* **The IamFleetRole, IamInstanceProfile, KeyName, SubnetId, and Groups:** These are account specific and you will configure these based on the [previous setup work that you did](step_0_prep.md). +Once you've created your first complete spot fleet request, you can save a copy as a local template so that you don't have to look these up every time. + + * The KeyName used here should be the same used in your config file but **without** the `.pem` extension. + +* **ImageId and SnapshotId** These refer to the OS and pre-installed programming that will be used by your spot fleet instances, and are both AWS region specific. +We use the Amazon ECS-Optimized Amazon Linux AMI; but the Linux 2 AMI also seems to work in our limited testing. +If there is no template fleet file for your region, or the one here is too out-of-date, see below for instructions on configuring these options yourselves. +If you have a good working configuration for a region that isn't represented or for a more up-to-date version of the AMI than we've had time to test, please feel free to create a pull request and we'll include it in the repo! + +## To run in a region where a spot fleet config isn't available or is out of date: + +* Under EC2 -> Instances select "Launch Instance" + + + +* Search "ECS", then choose the "Amazon ECS-Optimized Amazon Linux AMI" + + + +* Select Continue, then select any instance type (we're going to kill this after a few seconds) and click "Next: Configure Instance Details" + +* Choose a network and subnet in the region you wish to launch instances in, and then click "Next: Add Storage" + + + +* On the "Add Storage" page, note down the Snapshot column for the Root volume- this is your SnapshotId. +Click "Review and Launch" + + + +* Click "Launch", and then select any key pair (again, we'll be killing this in a few seconds) + +* Once your instance has launched, click its link from the launch page. + + + +* In the list of information on the instance, find and note its AMI ID - this is your ImageId + + + +* Terminate the instance diff --git a/documentation/DS-documentation/step_4_monitor.md b/documentation/DS-documentation/step_4_monitor.md new file mode 100644 index 0000000..6a0966f --- /dev/null +++ b/documentation/DS-documentation/step_4_monitor.md @@ -0,0 +1,51 @@ +# Step 4: Monitor + +Your workflow is now submitted. +Distributed-Something will keep an eye on a few things for you at this point without you having to do anything else. + +* Each instance is labeled with your APP_NAME, so that you can easily find your instances if you want to look at the instance metrics on the Running Instances section of the [EC2 web interface](https://console.aws.amazon.com/ec2/v2/home) to monitor performance. + +* You can also look at the whole-cluster CPU and memory usage statistics related to your APP_NAME in the [ECS web interface](https://console.aws.amazon.com/ecs/home). + +* Each instance will have an alarm placed on it so that if CPU usage dips below 1% for 15 consecutive minutes (almost always the result of a crashed machine), the instance will be automatically terminated and a new one will take its place. + +* Each individual job processed will create a log of the CellProfiler output, and each Docker container will create a log showing CPU, memory, and disk usage. + +If you choose to run the monitor script, Distributed-Something can be even more helpful. +The monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`; the JSON file containing all the information Distributed-Something needs will have been automatically created when you sent the instructions to start your cluster in the previous step. + +(**Note:** You should run the monitor inside [Screen](https://www.gnu.org/software/screen/), [tmux](https://tmux.github.io/), or another comparable service to keep a network disconnection from killing your monitor; this is particularly critical the longer your run takes.) + +*** + +## Monitor functions + +### While your analysis is running + +* Checks your queue once per minute to see how many jobs are currently processing and how many remain to be processed. + +* Once per hour, it deletes the alarms for any instances that have been terminated in the last 24 hours (because of spot prices rising above your maximum bid, machine crashes, etc). + +### When the number of jobs in your queue goes to 0 + +* Downscales the ECS service associated with your APP_NAME. + +* Deletes all the alarms associated with your spot fleet (both the currently running and the previously terminated instances). + +* Shuts down your spot fleet to keep you from incurring charges after your analysis is over. + +* Gets rid of the queue, service, and task definition created for this analysis. + +* Exports all the logs from your analysis onto your S3 bucket. + +*** + +## Cheapest mode + +You can run the monitor in an optional "cheapest" mode, which will downscale the number of requested machines (but not RUNNING machines) to one 15 minutes after the monitor is engaged. +You can engage cheapest mode by adding `True` as a final configurable parameter when starting the monitor, aka `python run.py monitor files/APP_NAMESpotFleetRequestId.json True` + +Cheapest mode is cheapest because it will remove all but 1 machine as soon as that machine crashes and/or runs out of jobs to do; this can save you money, particularly in multi-CPU Dockers running long jobs. + +This mode is optional because running this way involves some inherent risks- if machines stall out due to processing errors, they will not be replaced, meaning your job will take overall longer. +Additionally, if there is limited capacity for your requested configuration when you first start (e.g. you want 200 machines but AWS says it can currently only allocate you 50), more machines will not be added if and when they become available in cheapest mode as they would in normal mode. diff --git a/documentation/DS-documentation/troubleshooting_implementation.md b/documentation/DS-documentation/troubleshooting_implementation.md new file mode 100644 index 0000000..113b44a --- /dev/null +++ b/documentation/DS-documentation/troubleshooting_implementation.md @@ -0,0 +1,32 @@ +# Troubleshooting Your DS Implementation + +## Check the Logs +Logs are automatically created in CloudWatch as a part of Distributed-Something. +If you need to troubleshoot your Distributed-Something implementation, it is likely that your error will be logged in CloudWatch. +`LOG_GROUP_NAME_perInstance` logs contain a log for every EC2 instance that is launched as a part of your Spot Fleet request and should be the first place that you look. +If your instances are successfully able to pull messages from the SQS queue, they will create a log for each job which can be easier to parse than the full `_perInstance` logs. + +## AWS Credential Handling +Improper credential handling can be a blocking point in accessing many AWS services as all permissions must be explicitly granted in AWS and best practices are to set least-privilege permissions. +Distributed-Something is configured to simplify access management, but if you need to make changes to any of the recommended access management, be sure that you have carefully considered permissions. + +Some of the required permissions are as follows: +run.py uses default credentials from your computer to run the following commands. +run.py doesn't pass flags to the various AWS commands so you need to have AWS CLI set up with default account having these permissions. +- `setup`: needs SQS permissions to create the queue and and ECS permissions to create the cluster and task definition. +It sends environment variables to the task definition that include either role or access keys. +If you pass a key/secret key and a role it will default to using the key/secret key for the task. +- `submitJob`: needs SQS permissions to add jobs to queue. +- `startCluster`: needs ECS permissions to create ECS config, S3 permissions to upload ECS config, EC2 permissions to launch a spot fleet, and CloudWatch permissions to create logs. + +Once an EC2 instance is launched, ECS automatically puts a docker on that instance. +The fleet file determines IamFleetRole and the IamInstanceProfile. One or both of those contains permission for ECS to put the docker, and then anything that is needed in the Dockerfile (may be nothing. Could be mounting s3fs) + +No AWS credentials are used to execute `Dockerfile`. + +`Dockerfile` enters `run-worker.sh`. +`run-worker.sh` configures AWS CLI with environment variables. +`run-worker.sh` configures the EC2 instance and then launches workers with `generic-worker.py`. +`generic-worker.py` interacts with the SQS queue, CloudWatch, S3, and whatever else is required by your software. + +If you're having a hard time with determining what credentials have been passed to your Docker, you can add `curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI` or `aws configure list` in `run-worker.sh` to have it print the credentials that the docker is using. diff --git a/documentation/DS-documentation/troubleshooting_runs.md b/documentation/DS-documentation/troubleshooting_runs.md new file mode 100644 index 0000000..ad4a89d --- /dev/null +++ b/documentation/DS-documentation/troubleshooting_runs.md @@ -0,0 +1,11 @@ +# Troubleshooting + +We recommend you create a troubleshooting table that describes common failure modes for your implementation of Distributed-Something. +Shown below are errors common to most implementations of DS. +Distributed-CellProfiler documentation has a [thorough example](https://github.com/CellProfiler/Distributed-CellProfiler/wiki/Troubleshooting). + +| SQS | CloudWatch | S3 | EC2/ECS | Problem | Solution | +|---|---|---|---|---|---| +| | Within a single log, your run command is logging multiple times. | Expected output seen. | | A single job is being processed multiple times. | SQS_MESSAGE_VISIBILITY is set too short. See SQS-QUEUE-INFORMATION for more information. | +| | Your specified output structure does not match the Metadata passed. | Expected output is seen. | | This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the CloudWatch log that matches the input structure but actual job progress will print in the CloudWatch log that matches the output structure. | | +| | | | Machines made in EC2 and dockers are made in ECS but the dockers are not placed on the machines. | There is a mismatch in your DS config file. | Confirm that the MEMORY matches the MACHINE_TYPE set in your config. | diff --git a/documentation/DS-documentation/versions.md b/documentation/DS-documentation/versions.md new file mode 100644 index 0000000..092339b --- /dev/null +++ b/documentation/DS-documentation/versions.md @@ -0,0 +1,10 @@ +# Versions + +The most current release can always be found at `distributedscience/distributed-something`. +Current version is 1.0.0. + +--- + +# Version History + +## 1.0.0 - Version as of 20220701 diff --git a/documentation/README.md b/documentation/README.md new file mode 100644 index 0000000..4eef9a6 --- /dev/null +++ b/documentation/README.md @@ -0,0 +1,13 @@ +This folder contains the files to automatically generate documentation for your Distributed-Something distribution using [Jupyter Book](https://jupyterbook.org/en/stable/intro.html). +This documentation is linked from the main repository README and serves in place of a wiki. + +Instructions for customizing your Distributed-Something distribution are in the `Creating DS` section and can be deleted from your repository's documentation you have completed creating your Distributed-Something distribution. + +Documentation for running Distributed-Something are in the `Running DS` section and should be updated to fit your particular Distributed-Something implementation. + +This documentation will automatically re-build any time you push to your repository's `main` branch. +By default, only pages that have undergone edits will rebuild. + +To enable this auto-build, you need to set up a GitHub Action in your repository. +Read more about [GitHub Actions](https://help.github.com/en/actions) and about [hosting Jupyter Books with GitHub Actions](https://jupyterbook.org/en/stable/publish/gh-pages.html#automatically-host-your-book-with-github-actions). +It will auto-build the documentation to a `gh-pages` branch which will be visible at `githubusername.github.io/yourbookname`. diff --git a/files/batches.sh b/files/batches.sh deleted file mode 100644 index 200cc00..0000000 --- a/files/batches.sh +++ /dev/null @@ -1,7 +0,0 @@ -# Command to generate batches for a single plate. -# It generates 384*9 tasks, corresponding to 384 wells with 9 images each. -# An image is the unit of parallelization in this example. -# -# You need to install parallel to run this command. - -parallel echo '{\"Metadata\": \"Metadata_Plate={1},Metadata_Well={2}{3},Metadata_Site={4}\"},' ::: Plate1 Plate2 ::: `echo {A..P}` ::: `seq -w 24` ::: `seq -w 9` | sort > batches.txt