-
Notifications
You must be signed in to change notification settings - Fork 26
Cleanup open issues #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Cleanup open issues #142
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
fbe44eb
add file input info to docs
ErinWeisbart 1226cd6
start passing files docs
ErinWeisbart 9898cce
fstrings ftw
ErinWeisbart b265f43
support for file lists (untested)
ErinWeisbart 8d0ecfc
warning message for filtering csv
ErinWeisbart 8784859
better error message for Cloudwatch grouping mismatch
ErinWeisbart 882aeb7
working minimal example
ErinWeisbart 5e4275e
add auto deadletter queue, dashboard
ErinWeisbart 01eedcf
remove grouping from file list command
ErinWeisbart 22b8221
Merge branch 'master' into erin_cleanup
bethac07 088eb5b
update file list handling, docs
ErinWeisbart 7f51c04
f strings
ErinWeisbart 238cada
another expanded comment
ErinWeisbart bbd4ce7
clean get_queue_url
ErinWeisbart 7999389
clean get_queue_url
ErinWeisbart 317e7b9
clean up generate_task_definition after merge of master
ErinWeisbart 8836a8d
debug
ErinWeisbart 4073602
debug
ErinWeisbart aca2cfa
minor doc edit
ErinWeisbart File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| # Passing Files to DCP | ||
|
|
||
| Distributed-CellProfiler can be told what files to use through LoadData.csv, Batch Files, or file lists. | ||
|
|
||
| ## Load Data | ||
|
|
||
|  | ||
|
|
||
| LoadData.csv are CSVs that tell CellProfiler how the images should be parsed. | ||
| At a minimum, this CSV should contain PathName_{NameOfChannel} and FileName_{NameOfChannel} columns for each of your channels, as well as Metadata_{PieceOfMetadata} for each kind of metadata being used to group your image sets. | ||
| It can contain any other metadata you would like to track. | ||
| Some users have reported issues with using relative paths in the PathName columns; using absolute paths beginning with `/home/ubuntu/bucket/{relativepath}` may increase your odds of success. | ||
|
|
||
| ### Creating LoadData.csv | ||
|
|
||
| You can create this CSV yourself via your favorite scripting language. | ||
| We maintain a script for creating LoadData.csv from Phenix metadata XML files called [pe2loaddata](https://github.com/broadinstitute/pe2loaddata). | ||
|
|
||
| You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. | ||
| More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). | ||
| After loading in your images, use the Export->Image Set Listing command. | ||
| You will then need to replace the local paths with the paths where the files can be found in the cloud. | ||
| If your files are in the same structure, this can be done with a simple find and replace in any text editing software. | ||
| (e.g. Find '/Users/eweisbar/Desktop' and replace with '/home/ubuntu/bucket') | ||
|
|
||
| ### Using LoadData.csv | ||
|
|
||
| To use a LoadData.csv with submitJobs, put the path to the LoadData.csv in **data_file:**. | ||
|
|
||
| To use a LoadData.csv with run_batch_general.py, enter the name of the LoadData.csv under **#project specific stuff** in `{STEP}name`. | ||
| At the bottom of the file, make sure there are no arguments or `batch=False` in the command for the step you are running. | ||
| (e.g. `MakeAnalysisJobs()` or `MakeAnalysisJobs(batch=False)`) | ||
| Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `datafilepath`. | ||
|
|
||
| ## Batch Files | ||
|
|
||
| Batch files are an easy way to transition from running locally to distributed. | ||
| A batch file is an `.h5` file created by CellProfiler which captures all the data needed to run your workflow - pipeline and file information are packaged together. | ||
| To use a batch file, your data needs to have the same structure in the cloud as on your local machine. | ||
|
|
||
| ### Creating batch files | ||
|
|
||
| To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. | ||
| More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). | ||
| Put the `CreateBatchFiles` module at the end of your pipeline and ensure that it is selected. | ||
| Add a path mapping and edit the `Local root path` and `Cluster root path`. | ||
| Run the CellProfiler pipeline by pressing the `Analyze Images` button; note that it won't actually run your pipeline but will instead create a batch file. | ||
| More information on the `CreateBatchFiles` module can be found [here](https://cellprofiler-manual.s3.amazonaws.com/CellProfiler-4.2.4/modules/fileprocessing.html). | ||
|
|
||
| ### Using batch files | ||
|
|
||
| To use a batch file with submitJobs, put the path to the `.h5` file in **data_file:** and **pipeline:**. | ||
|
|
||
| To use a batch file with run_batch_general.py, enter the name of the batch file under **#project specific stuff** in `batchpipename{STEP}`. | ||
| At the bottom of the file, set `batch=True` in the command for the step you are running. | ||
| (e.g. `MakeAnalysisJobs(batch=True)`) | ||
| Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `batchpath`. | ||
|
|
||
| ## File lists | ||
|
|
||
| You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format. | ||
| Note that file lists themselves do not associate metadata with file paths (in contrast to LoadData.csv files where you can enter any metadata columns you desire.) | ||
| Therefore, you need to extract metadata for Distributed-CellProfiler to use for grouping by extracting metadata from file and folder names in the Metadata module in your CellProfiler pipeline. | ||
| You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`. | ||
|
|
||
| ### Creating File Lists | ||
|
|
||
| Use any text editing software to create a `.txt` file where each line of the file is a path to a single image that you want to process. | ||
|
|
||
| ### Using File Lists | ||
|
|
||
| To use a file list with submitJobs, put the path to the `.txt` file in **data_file:**. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| Included in this folder is all of the resources for running a complete mini-example of Distributed-Cellprofiler. | ||
| It includes 3 sample image sets and a CellProfiler pipeline that identifies cells within the images and makes measuremements. | ||
| It also includes the Distributed-CellProfiler files pre-configured to create a queue of all 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set. | ||
|
|
||
| ## Running example project | ||
|
|
||
| ### Step 0 | ||
|
|
||
| Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html). | ||
| This includes creating the fleet file that you will use in Step 3. | ||
|
|
||
| Upload the 'sample_project' folder to the top level of your bucket. | ||
| While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name: | ||
|
|
||
| ```bash | ||
| # Copy example files to S3 | ||
| BUCKET=yourbucket | ||
| aws s3 sync example_project/project_folder s3://${BUCKET}/project_folder | ||
|
|
||
| # Replace the default config with the example config | ||
| cp example_project/config.py config.py | ||
| ``` | ||
|
|
||
| ### Step 1 | ||
| In config.py you will need to update the following fields specific to your AWS configuration: | ||
| ``` | ||
| # AWS GENERAL SETTINGS: | ||
| AWS_REGION = 'us-east-1' | ||
| AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation | ||
| SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh | ||
| AWS_BUCKET = 'your-bucket-name' | ||
| SOURCE_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration | ||
| DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration | ||
| ``` | ||
| Then run `python3 run.py setup` | ||
|
|
||
| ### Step 2 | ||
| This command points to the job file created for this demonstartion and should be run as-is. | ||
| `python3 run.py submitJob example_project/files/exampleJob.json` | ||
|
|
||
| ### Step 3 | ||
| This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name. | ||
| `python3 run.py startCluster files/exampleFleet.json` | ||
|
|
||
| ### Step 4 | ||
| This command points to the monitor file that is automatically created with your run and should be run as-is. | ||
| `python3 run.py monitor files/FlyExampleSpotFleetRequestId.json` | ||
|
|
||
| ## Results | ||
|
|
||
| While the run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch). | ||
| Note that the metrics update at intervals that may not be helpful with this fast, minimal example. | ||
|
|
||
| After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Constants (User configurable) | ||
|
|
||
| APP_NAME = 'FlyExample' # Used to generate derivative names unique to the application. | ||
|
|
||
| # DOCKER REGISTRY INFORMATION: | ||
| DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.2.4' | ||
|
|
||
| # AWS GENERAL SETTINGS: | ||
| AWS_REGION = 'us-east-1' | ||
| AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation | ||
| SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh | ||
| AWS_BUCKET = 'your-bucket-name' # Bucket to use for logging (likely all three buckets the same for this example) | ||
| SOURCE_BUCKET = 'your-bucket-name' # Bucket to download files from (likely all three buckets the same for this example) | ||
| DESTINATION_BUCKET = 'your-bucket-name' # Bucket to upload files to (likely all three buckets the same for this example) | ||
|
|
||
| # EC2 AND ECS INFORMATION: | ||
| ECS_CLUSTER = 'default' | ||
| CLUSTER_MACHINES = 3 | ||
| TASKS_PER_MACHINE = 1 | ||
| MACHINE_TYPE = ['c4.xlarge'] | ||
| MACHINE_PRICE = 0.10 | ||
| EBS_VOL_SIZE = 22 # In GB. Minimum allowed is 22. | ||
| DOWNLOAD_FILES = 'False' | ||
|
|
||
| # DOCKER INSTANCE RUNNING ENVIRONMENT: | ||
| DOCKER_CORES = 1 # Number of CellProfiler processes to run inside a docker container | ||
| CPU_SHARES = DOCKER_CORES * 1024 # ECS computing units assigned to each docker container (1024 units = 1 core) | ||
| MEMORY = 7500 # Memory assigned to the docker container in MB | ||
| SECONDS_TO_START = 3*60 # Wait before the next CP process is initiated to avoid memory collisions | ||
|
|
||
| # SQS QUEUE INFORMATION: | ||
| SQS_QUEUE_NAME = APP_NAME + 'Queue' | ||
| SQS_MESSAGE_VISIBILITY = 10*60 # Timeout (secs) for messages in flight (average time to be processed) | ||
| SQS_DEAD_LETTER_QUEUE = 'ExampleProject_DeadMessages' | ||
|
|
||
| # LOG GROUP INFORMATION: | ||
| LOG_GROUP_NAME = APP_NAME | ||
|
|
||
| # CLOUDWATCH DASHBOARD CREATION | ||
| CREATE_DASHBOARD = 'True' # Create a dashboard in Cloudwatch for run | ||
| CLEAN_DASHBOARD = 'True' # Automatically remove dashboard at end of run with Monitor | ||
|
|
||
| # REDUNDANCY CHECKS | ||
| CHECK_IF_DONE_BOOL = 'False' #True or False- should it check if there are a certain number of non-empty files and delete the job if yes? | ||
| EXPECTED_NUMBER_FILES = 7 #What is the number of files that trigger skipping a job? | ||
| MIN_FILE_SIZE_BYTES = 1 #What is the minimal number of bytes an object should be to "count"? | ||
| NECESSARY_STRING = '' #Is there any string that should be in the file name to "count"? | ||
|
|
||
| # PLUGINS | ||
| USE_PLUGINS = 'False' | ||
| UPDATE_PLUGINS = 'False' | ||
| PLUGINS_COMMIT = '' # What commit or version tag do you want to check out? | ||
| INSTALL_REQUIREMENTS = 'False' | ||
| REQUIREMENTS_FILE = '' # Path within the plugins repo to a requirements file |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| { | ||
| "_comment1": "Paths in this file are relative to the root of your S3 bucket", | ||
| "pipeline": "project_folder/workspace/ExampleFly.cppipe", | ||
| "data_file": "project_folder/workspace/load_data.csv", | ||
| "input": "project_folder/workspace/", | ||
| "output": "project_folder/output", | ||
| "output_structure": "Metadata_Position", | ||
| "_comment2": "The following groups are tasks, and each will be run in parallel", | ||
| "groups": [ | ||
| {"Metadata": "Metadata_Position=2"}, | ||
| {"Metadata": "Metadata_Position=76"}, | ||
| {"Metadata": "Metadata_Position=218"} | ||
| ] | ||
| } | ||
|
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention here that it needs to be abspaths, not relpaths.