Skip to content

Commit 47ac57c

Browse files
Merge pull request #312 from MontgomeryLab/issue-311
Pipeline: auto-documentation improvements
2 parents 1e38098 + ce37eb6 commit 47ac57c

File tree

18 files changed

+278
-155
lines changed

18 files changed

+278
-155
lines changed

START_HERE/run_config.yml

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
user:
2121
run_date: ~
2222
run_time: ~
23-
paths_config: ./paths.yml
23+
paths_config: paths.yml
2424

2525
##-- The label for final outputs --##
2626
##-- If none provided, the default of user_tinyrna will be used --##
@@ -310,6 +310,7 @@ dir_name_tiny-count: tiny-count
310310
dir_name_tiny-deseq: tiny-deseq
311311
dir_name_tiny-plot: tiny-plot
312312
dir_name_logs: logs
313+
dir_name_config: config
313314

314315

315316
######################### AUTOMATICALLY GENERATED CONFIGURATIONS #########################
@@ -320,7 +321,7 @@ dir_name_logs: logs
320321
#
321322
###########################################################################################
322323

323-
version: 1.4.0
324+
version: 1.5.0
324325

325326
######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
326327
#
@@ -332,10 +333,10 @@ run_directory: ~
332333
tmp_directory: ~
333334
features_csv: { }
334335
samples_csv: { }
335-
paths_file: { }
336336
gff_files: [ ]
337337
run_bowtie_build: false
338338
reference_genome_files: [ ]
339+
bt_index_files: [ ]
339340
plot_style_sheet: ~
340341
adapter_fasta: ~
341342
ebwt: ~
@@ -356,10 +357,6 @@ in_fq: [ ]
356357
# output reports
357358
fastp_report_titles: [ ]
358359

359-
###-- Utilized by bowtie --###
360-
# bowtie index files
361-
bt_index_files: [ ]
362-
363360
##-- Utilized by tiny-deseq.r --##
364361
# The control for comparison. If unspecified, all comparisons are made
365362
control_condition:
@@ -383,4 +380,11 @@ run_deseq: True
383380
##-- Utilized by tiny-plot --##
384381
# Filters for class scatter plots
385382
plot_class_scatter_filter_include: []
386-
plot_class_scatter_filter_exclude: []
383+
plot_class_scatter_filter_exclude: []
384+
385+
##-- Used to populate the directory defined in dir_name_config --##
386+
##-- CWL spec doesn't provide a way to get this info from within the workflow --##
387+
processed_run_config: {}
388+
389+
##-- This is the paths_config key converted to a CWL file object for handling --##
390+
paths_file: {}

START_HERE/samples.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Input Files,Sample/Group Name,Replicate number,Control,Normalization
1+
Input Files,Sample/Group Name,Replicate Number,Control,Normalization
22
./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
33
./fastq_files/cond1_rep2.fastq.gz,condition1,2,,
44
./fastq_files/cond1_rep3.fastq.gz,condition1,3,,

START_HERE/tinyRNA_TUTORIAL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ And when you're done, you can close your terminal or use `conda deactivate` to r
3232
The output you see on your terminal is from `cwltool`, which coordinates the execution of the workflow CWL. The terminal output from individual steps is redirected to a logfile for later reference.
3333

3434
### File outputs
35-
When the analysis is complete you'll notice a new folder has appeared whose name contains the date and time of the run. Inside you'll find subdirectories containing the file and terminal outputs for each step, and the processed Run Config file for auto-documentation of the run.
35+
When the analysis is complete you'll notice a new timestamped folder has appeared. Inside you'll find subdirectories containing the file outputs for each step, and processed copies of your configuration files which serve as auto-documentation of the run. These configuration copies also allow for repeat analyses using the existing file outputs.
3636

3737
### Bowtie indexes
3838
Bowtie indexes were built during this run because `paths.yml` didn't define an `ebwt` prefix. Now, you'll see the `ebwt` points to the freshly built indexes in your run directory. This means that indexes won't be rebuilt during any subsequent runs that use this `paths.yml` file. If you need to rebuild your indexes, simply delete the value to the right of `ebwt` in paths.yml

doc/Pipeline.md

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,10 @@ tiny replot --config processed_run_config.yml
1818
The `tiny run` command performs a comprehensive analysis of your [input files](../README.md#requirements-for-user-provided-input-files) according to the preferences defined in your [configuration files](Configuration.md).
1919

2020
## Resuming a Prior Analysis
21-
The tiny-count and tiny-plot steps offer a wide variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps (fastp, tiny-collapse, and bowtie) handle the largest volume of data and are resource intensive, so you can save time by reusing their outputs for subsequent analyses. One could do so by running the later steps individually (e.g. using commands `tiny-count`, `tiny-deseq.r`, and `tiny-plot`), but assembling their commandline inputs by hand is labor-intensive and prone to spelling mistakes.
21+
The tiny-count and tiny-plot steps offer many options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps (fastp, tiny-collapse, and bowtie) handle the largest volume of data and are resource intensive, so you can save time by reusing their outputs for subsequent analyses.
22+
23+
The commands `tiny recount` and `tiny replot` allow the workflow to be resumed using outputs from a prior run. The Run Directory for each end-to-end analysis will contain the run's four primary configuration files, and these files can be freely edited to change the resume run's behavior without sacrificing auto-documentation.
2224

23-
The commands `tiny recount` and `tiny replot` seek to solve this problem. As discussed in the [Run Config documentation](Configuration.md#the-processed-run-config), the Run Directory for each end-to-end analysis will contain a processed Run Config, and this is the file that determines the behavior of a resume run.
2425

2526
<figure align="center">
2627
<figcaption><b>tiny recount</b></figcaption>
@@ -29,25 +30,19 @@ The commands `tiny recount` and `tiny replot` seek to solve this problem. As dis
2930
<img src="../images/replot.png" width="65%" alt="replot"/>
3031
</figure>
3132

32-
33-
You can modify the behavior of a resume run by changing settings in:
34-
- The **processed** Run Config
35-
- The **original** Features Sheet that was used for the end-to-end run (as indicated by `features_csv` in the processed Run Config)
36-
- The **original** Paths File (as indicated by `paths_config` in the processed Run Config)
37-
3833
### The Steps
39-
1. Make and save the desired changes in the files above
40-
2. In your terminal, `cd` to the Run Directory of the end-to-end run you wish to resume
34+
1. Make and save changes to the configuration files within the target Run Directory
35+
2. In your terminal, `cd` to the target Run Directory
4136
3. Run the desired resume command
4237

43-
### A Note on File Inputs
44-
File inputs are sourced from the **original** output subdirectories of prior steps in the target Run Directory. For `tiny replot`, this means that files from previous executions of `tiny recount` will **not** be used as inputs; only the original end-to-end outputs are used.
38+
### Auto-Documentation
39+
Among the subdirectories produced in your Run Directory after an end-to-end run, you'll find a directory named "config" which holds a copy of the run's four primary configuration files. These files serve as documentation for the run and, unlike those found at the root of the Run Directory, they should not be modified. A timestamped "config" directory is created after each resume run to similarly document the configurations that were used.
4540

46-
### Where to Find Outputs from Resume Runs
41+
### Resume Run Outputs
4742
Output subdirectories for resume runs can be found alongside the originals, and will have a timestamp appended to their name to differentiate them.
4843

49-
### Auto-Documentation of Resume Runs
50-
A new processed Run Config will be saved in the Run Directory at the beginning of each resume run. It will be labelled with the same timestamp used in the resume run's other outputs to differentiate it. It includes the changes to your Paths File and Run Config. A copy of your Features Sheet is saved to the timestamped tiny-count output directory during `tiny recount` runs.
44+
### Repeated Analyses
45+
If a `recount` run is performed and a `replot` is performed later in the same Run Directory, then only the outputs of the `recount` run are used for generating the plots. If multiple `recount` runs precede the `replot` then the most recent outputs are used.
5146

5247
## Parallelization
5348
Most steps in the pipeline run in parallel to minimize runtimes. This is particularly advantageous for multiprocessor systems like server environments. However, parallelization isn't always beneficial. If your computer doesn't have enough free memory, or if you have a large sample file set and/or reference genome, parallel execution might push your machine to its limits. When this happens you might see memory errors or your computer may become unresponsive. In these cases it makes more sense to run resource intensive steps one at a time, in serial, rather than in parallel. To do so, set `run_parallel: false` in your Run Config. This will affect fastp, tiny-collapse, and bowtie since these steps typically handle the largest volumes of data.

images/recount.png

11.5 KB
Loading

images/replot.png

8.9 KB
Loading

tests/testdata/config_files/run_config_template.yml

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,7 @@ dir_name_tiny-count: tiny-count
310310
dir_name_tiny-deseq: tiny-deseq
311311
dir_name_tiny-plot: tiny-plot
312312
dir_name_logs: logs
313+
dir_name_config: config
313314

314315

315316
######################### AUTOMATICALLY GENERATED CONFIGURATIONS #########################
@@ -320,7 +321,7 @@ dir_name_logs: logs
320321
#
321322
###########################################################################################
322323

323-
version: 1.4.0
324+
version: 1.5.0
324325

325326
######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
326327
#
@@ -332,10 +333,10 @@ run_directory: ~
332333
tmp_directory: ~
333334
features_csv: { }
334335
samples_csv: { }
335-
paths_file: { }
336336
gff_files: [ ]
337337
run_bowtie_build: false
338338
reference_genome_files: [ ]
339+
bt_index_files: [ ]
339340
plot_style_sheet: ~
340341
adapter_fasta: ~
341342
ebwt: ~
@@ -356,10 +357,6 @@ in_fq: [ ]
356357
# output reports
357358
fastp_report_titles: [ ]
358359

359-
###-- Utilized by bowtie --###
360-
# bowtie index files
361-
bt_index_files: [ ]
362-
363360
##-- Utilized by tiny-deseq.r --##
364361
# The control for comparison. If unspecified, all comparisons are made
365362
control_condition:
@@ -383,4 +380,11 @@ run_deseq: True
383380
##-- Utilized by tiny-plot --##
384381
# Filters for class scatter plots
385382
plot_class_scatter_filter_include: []
386-
plot_class_scatter_filter_exclude: []
383+
plot_class_scatter_filter_exclude: []
384+
385+
##-- Used to populate the directory defined in dir_name_config --##
386+
##-- CWL spec doesn't provide a way to get this info from within the workflow --##
387+
processed_run_config: {}
388+
389+
##-- This is the paths_config key converted to a CWL file object for handling --##
390+
paths_file: {}

tiny/cwl/workflows/tinyrna_wf.cwl

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ inputs:
1515
# multi input
1616
threads: int?
1717
run_name: string
18+
processed_run_config: File
1819
sample_basenames: string[]
1920

2021
# bowtie build
@@ -117,6 +118,7 @@ inputs:
117118
dir_name_tiny-count: string
118119
dir_name_tiny-deseq: string
119120
dir_name_tiny-plot: string
121+
dir_name_config: string
120122

121123
steps:
122124

@@ -281,6 +283,14 @@ steps:
281283
- sample_avg_scatter_by_dge
282284
- sample_avg_scatter_by_dge_class
283285

286+
organize_config:
287+
run: ../tools/make-subdir.cwl
288+
in:
289+
dir_files:
290+
source: [ processed_run_config, paths_file, samples_csv, features_csv, plot_style_sheet ]
291+
dir_name: dir_name_config
292+
out: [ subdir ]
293+
284294
organize_bt_indexes:
285295
run: ../tools/make-subdir.cwl
286296
when: $(inputs.run_bowtie_build)
@@ -353,6 +363,10 @@ steps:
353363
outputs:
354364

355365
# Subdirectory outputs
366+
config_out_dir:
367+
type: Directory
368+
outputSource: organize_config/subdir
369+
356370
bt_build_out_dir:
357371
type: Directory?
358372
outputSource: organize_bt_indexes/subdir

tiny/entry.py

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -189,32 +189,32 @@ def resume(tinyrna_cwl_path: str, config_file: str, step: str) -> None:
189189
190190
"""
191191

192-
# Maps step to Configuration class
193-
entry_config = {
192+
# Map step to Configuration class
193+
resume_config_class = {
194194
"tiny-count": ResumeCounterConfig,
195195
"tiny-plot": ResumePlotterConfig
196-
}
196+
}[step]
197197

198198
print(f"Resuming pipeline execution at the {step} step...")
199199

200-
# Make appropriate config and workflow for this step; write modified workflow to disk
201-
config = entry_config[step](config_file, f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl")
202-
resume_wf = f"{tinyrna_cwl_path}/workflows/tiny-resume.cwl"
203-
config.write_workflow(resume_wf)
200+
# The resume workflow is dynamically generated from the run workflow
201+
base_workflow = f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl" # The workflow to derive from
202+
workflow_dyna = f"{tinyrna_cwl_path}/workflows/tiny-resume.cwl" # The dynamically generated workflow to write
203+
204+
config_object = resume_config_class(config_file, base_workflow)
205+
config_object.write_processed_config(config_file)
206+
config_object.write_workflow(workflow_dyna)
204207

205-
if config['run_native']:
206-
# We can pass our config object directly without writing to disk first
207-
run_cwltool_native(config, resume_wf)
208+
if config_object['run_native']:
209+
# Can pass the config object directly but still write to disk for autodocumentation
210+
run_cwltool_native(config_object, workflow_dyna)
208211
else:
209212
# Processed Run Config must be written to disk first
210-
resume_conf_file = config.get_outfile_path()
211-
config.write_processed_config(resume_conf_file)
212-
run_cwltool_subprocess(config, resume_wf)
213+
run_cwltool_subprocess(config_object, workflow_dyna)
213214

214-
if os.path.isfile(resume_wf):
215+
if os.path.isfile(workflow_dyna):
215216
# We don't want the generated workflow to be returned by a call to setup-cwl
216-
os.remove(resume_wf)
217-
217+
os.remove(workflow_dyna)
218218

219219
def run_cwltool_subprocess(config_object: 'ConfigBase', workflow: str, run_directory='.') -> int:
220220
"""Executes the workflow using a command line invocation of cwltool

tiny/rna/compatibility.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,10 @@ def add_mapping(doc: CommentedMap, prec_key, key_obj):
6868

6969
# Comments & linebreaks are often (but not always!) attached to
7070
# the preceding key. Move them down to the new key.
71-
inherit_prev = doc.ca.items[prec_key][2]
72-
doc.ca.items[key] = [None, None, inherit_prev, None]
73-
doc.ca.items[prec_key][2] = None
71+
if prec_key in doc.ca.items:
72+
inherit_prev = doc.ca.items[prec_key][2]
73+
doc.ca.items[key] = [None, None, inherit_prev, None]
74+
doc.ca.items[prec_key][2] = None
7475

7576

7677
class RunConfigCompatibility:

0 commit comments

Comments
 (0)