diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml index c905184f..206ce8e0 100644 --- a/START_HERE/run_config.yml +++ b/START_HERE/run_config.yml @@ -20,7 +20,7 @@ user: run_date: ~ run_time: ~ -paths_config: ./paths.yml +paths_config: paths.yml ##-- The label for final outputs --## ##-- If none provided, the default of user_tinyrna will be used --## @@ -310,6 +310,7 @@ dir_name_tiny-count: tiny-count dir_name_tiny-deseq: tiny-deseq dir_name_tiny-plot: tiny-plot dir_name_logs: logs +dir_name_config: config ######################### AUTOMATICALLY GENERATED CONFIGURATIONS ######################### @@ -320,7 +321,7 @@ dir_name_logs: logs # ########################################################################################### -version: 1.4.0 +version: 1.5.0 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### # @@ -332,10 +333,10 @@ run_directory: ~ tmp_directory: ~ features_csv: { } samples_csv: { } -paths_file: { } gff_files: [ ] run_bowtie_build: false reference_genome_files: [ ] +bt_index_files: [ ] plot_style_sheet: ~ adapter_fasta: ~ ebwt: ~ @@ -356,10 +357,6 @@ in_fq: [ ] # output reports fastp_report_titles: [ ] -###-- Utilized by bowtie --### -# bowtie index files -bt_index_files: [ ] - ##-- Utilized by tiny-deseq.r --## # The control for comparison. If unspecified, all comparisons are made control_condition: @@ -383,4 +380,11 @@ run_deseq: True ##-- Utilized by tiny-plot --## # Filters for class scatter plots plot_class_scatter_filter_include: [] -plot_class_scatter_filter_exclude: [] \ No newline at end of file +plot_class_scatter_filter_exclude: [] + +##-- Used to populate the directory defined in dir_name_config --## +##-- CWL spec doesn't provide a way to get this info from within the workflow --## +processed_run_config: {} + +##-- This is the paths_config key converted to a CWL file object for handling --## +paths_file: {} \ No newline at end of file diff --git a/START_HERE/samples.csv b/START_HERE/samples.csv index 47caddd8..ee799e65 100755 --- a/START_HERE/samples.csv +++ b/START_HERE/samples.csv @@ -1,4 +1,4 @@ -Input Files,Sample/Group Name,Replicate number,Control,Normalization +Input Files,Sample/Group Name,Replicate Number,Control,Normalization ./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, ./fastq_files/cond1_rep2.fastq.gz,condition1,2,, ./fastq_files/cond1_rep3.fastq.gz,condition1,3,, diff --git a/START_HERE/tinyRNA_TUTORIAL.md b/START_HERE/tinyRNA_TUTORIAL.md index 3ca92a08..523836b6 100644 --- a/START_HERE/tinyRNA_TUTORIAL.md +++ b/START_HERE/tinyRNA_TUTORIAL.md @@ -32,7 +32,7 @@ And when you're done, you can close your terminal or use `conda deactivate` to r The output you see on your terminal is from `cwltool`, which coordinates the execution of the workflow CWL. The terminal output from individual steps is redirected to a logfile for later reference. ### File outputs -When the analysis is complete you'll notice a new folder has appeared whose name contains the date and time of the run. Inside you'll find subdirectories containing the file and terminal outputs for each step, and the processed Run Config file for auto-documentation of the run. +When the analysis is complete you'll notice a new timestamped folder has appeared. Inside you'll find subdirectories containing the file outputs for each step, and processed copies of your configuration files which serve as auto-documentation of the run. These configuration copies also allow for repeat analyses using the existing file outputs. ### Bowtie indexes Bowtie indexes were built during this run because `paths.yml` didn't define an `ebwt` prefix. Now, you'll see the `ebwt` points to the freshly built indexes in your run directory. This means that indexes won't be rebuilt during any subsequent runs that use this `paths.yml` file. If you need to rebuild your indexes, simply delete the value to the right of `ebwt` in paths.yml diff --git a/doc/Pipeline.md b/doc/Pipeline.md index 1fcdf495..1b5f7ab3 100644 --- a/doc/Pipeline.md +++ b/doc/Pipeline.md @@ -18,9 +18,10 @@ tiny replot --config processed_run_config.yml The `tiny run` command performs a comprehensive analysis of your [input files](../README.md#requirements-for-user-provided-input-files) according to the preferences defined in your [configuration files](Configuration.md). ## Resuming a Prior Analysis -The tiny-count and tiny-plot steps offer a wide variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps (fastp, tiny-collapse, and bowtie) handle the largest volume of data and are resource intensive, so you can save time by reusing their outputs for subsequent analyses. One could do so by running the later steps individually (e.g. using commands `tiny-count`, `tiny-deseq.r`, and `tiny-plot`), but assembling their commandline inputs by hand is labor-intensive and prone to spelling mistakes. +The tiny-count and tiny-plot steps offer many options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps (fastp, tiny-collapse, and bowtie) handle the largest volume of data and are resource intensive, so you can save time by reusing their outputs for subsequent analyses. + +The commands `tiny recount` and `tiny replot` allow the workflow to be resumed using outputs from a prior run. The Run Directory for each end-to-end analysis will contain the run's four primary configuration files, and these files can be freely edited to change the resume run's behavior without sacrificing auto-documentation. -The commands `tiny recount` and `tiny replot` seek to solve this problem. As discussed in the [Run Config documentation](Configuration.md#the-processed-run-config), the Run Directory for each end-to-end analysis will contain a processed Run Config, and this is the file that determines the behavior of a resume run.
tiny recount
@@ -29,25 +30,19 @@ The commands `tiny recount` and `tiny replot` seek to solve this problem. As dis replot
- -You can modify the behavior of a resume run by changing settings in: -- The **processed** Run Config -- The **original** Features Sheet that was used for the end-to-end run (as indicated by `features_csv` in the processed Run Config) -- The **original** Paths File (as indicated by `paths_config` in the processed Run Config) - ### The Steps -1. Make and save the desired changes in the files above -2. In your terminal, `cd` to the Run Directory of the end-to-end run you wish to resume +1. Make and save changes to the configuration files within the target Run Directory +2. In your terminal, `cd` to the target Run Directory 3. Run the desired resume command -### A Note on File Inputs -File inputs are sourced from the **original** output subdirectories of prior steps in the target Run Directory. For `tiny replot`, this means that files from previous executions of `tiny recount` will **not** be used as inputs; only the original end-to-end outputs are used. +### Auto-Documentation +Among the subdirectories produced in your Run Directory after an end-to-end run, you'll find a directory named "config" which holds a copy of the run's four primary configuration files. These files serve as documentation for the run and, unlike those found at the root of the Run Directory, they should not be modified. A timestamped "config" directory is created after each resume run to similarly document the configurations that were used. -### Where to Find Outputs from Resume Runs +### Resume Run Outputs Output subdirectories for resume runs can be found alongside the originals, and will have a timestamp appended to their name to differentiate them. -### Auto-Documentation of Resume Runs -A new processed Run Config will be saved in the Run Directory at the beginning of each resume run. It will be labelled with the same timestamp used in the resume run's other outputs to differentiate it. It includes the changes to your Paths File and Run Config. A copy of your Features Sheet is saved to the timestamped tiny-count output directory during `tiny recount` runs. +### Repeated Analyses +If a `recount` run is performed and a `replot` is performed later in the same Run Directory, then only the outputs of the `recount` run are used for generating the plots. If multiple `recount` runs precede the `replot` then the most recent outputs are used. ## Parallelization Most steps in the pipeline run in parallel to minimize runtimes. This is particularly advantageous for multiprocessor systems like server environments. However, parallelization isn't always beneficial. If your computer doesn't have enough free memory, or if you have a large sample file set and/or reference genome, parallel execution might push your machine to its limits. When this happens you might see memory errors or your computer may become unresponsive. In these cases it makes more sense to run resource intensive steps one at a time, in serial, rather than in parallel. To do so, set `run_parallel: false` in your Run Config. This will affect fastp, tiny-collapse, and bowtie since these steps typically handle the largest volumes of data. diff --git a/images/recount.png b/images/recount.png index e5811d11..f81f7910 100644 Binary files a/images/recount.png and b/images/recount.png differ diff --git a/images/replot.png b/images/replot.png index 2dcdf4cf..0f26eeff 100644 Binary files a/images/replot.png and b/images/replot.png differ diff --git a/tests/testdata/config_files/run_config_template.yml b/tests/testdata/config_files/run_config_template.yml index 74cdd0b8..da96c76c 100644 --- a/tests/testdata/config_files/run_config_template.yml +++ b/tests/testdata/config_files/run_config_template.yml @@ -310,6 +310,7 @@ dir_name_tiny-count: tiny-count dir_name_tiny-deseq: tiny-deseq dir_name_tiny-plot: tiny-plot dir_name_logs: logs +dir_name_config: config ######################### AUTOMATICALLY GENERATED CONFIGURATIONS ######################### @@ -320,7 +321,7 @@ dir_name_logs: logs # ########################################################################################### -version: 1.4.0 +version: 1.5.0 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### # @@ -332,10 +333,10 @@ run_directory: ~ tmp_directory: ~ features_csv: { } samples_csv: { } -paths_file: { } gff_files: [ ] run_bowtie_build: false reference_genome_files: [ ] +bt_index_files: [ ] plot_style_sheet: ~ adapter_fasta: ~ ebwt: ~ @@ -356,10 +357,6 @@ in_fq: [ ] # output reports fastp_report_titles: [ ] -###-- Utilized by bowtie --### -# bowtie index files -bt_index_files: [ ] - ##-- Utilized by tiny-deseq.r --## # The control for comparison. If unspecified, all comparisons are made control_condition: @@ -383,4 +380,11 @@ run_deseq: True ##-- Utilized by tiny-plot --## # Filters for class scatter plots plot_class_scatter_filter_include: [] -plot_class_scatter_filter_exclude: [] \ No newline at end of file +plot_class_scatter_filter_exclude: [] + +##-- Used to populate the directory defined in dir_name_config --## +##-- CWL spec doesn't provide a way to get this info from within the workflow --## +processed_run_config: {} + +##-- This is the paths_config key converted to a CWL file object for handling --## +paths_file: {} \ No newline at end of file diff --git a/tiny/cwl/workflows/tinyrna_wf.cwl b/tiny/cwl/workflows/tinyrna_wf.cwl index 0380bae8..9f6d332d 100644 --- a/tiny/cwl/workflows/tinyrna_wf.cwl +++ b/tiny/cwl/workflows/tinyrna_wf.cwl @@ -15,6 +15,7 @@ inputs: # multi input threads: int? run_name: string + processed_run_config: File sample_basenames: string[] # bowtie build @@ -117,6 +118,7 @@ inputs: dir_name_tiny-count: string dir_name_tiny-deseq: string dir_name_tiny-plot: string + dir_name_config: string steps: @@ -281,6 +283,14 @@ steps: - sample_avg_scatter_by_dge - sample_avg_scatter_by_dge_class + organize_config: + run: ../tools/make-subdir.cwl + in: + dir_files: + source: [ processed_run_config, paths_file, samples_csv, features_csv, plot_style_sheet ] + dir_name: dir_name_config + out: [ subdir ] + organize_bt_indexes: run: ../tools/make-subdir.cwl when: $(inputs.run_bowtie_build) @@ -353,6 +363,10 @@ steps: outputs: # Subdirectory outputs + config_out_dir: + type: Directory + outputSource: organize_config/subdir + bt_build_out_dir: type: Directory? outputSource: organize_bt_indexes/subdir diff --git a/tiny/entry.py b/tiny/entry.py index 6660da41..fcebe933 100644 --- a/tiny/entry.py +++ b/tiny/entry.py @@ -189,32 +189,32 @@ def resume(tinyrna_cwl_path: str, config_file: str, step: str) -> None: """ - # Maps step to Configuration class - entry_config = { + # Map step to Configuration class + resume_config_class = { "tiny-count": ResumeCounterConfig, "tiny-plot": ResumePlotterConfig - } + }[step] print(f"Resuming pipeline execution at the {step} step...") - # Make appropriate config and workflow for this step; write modified workflow to disk - config = entry_config[step](config_file, f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl") - resume_wf = f"{tinyrna_cwl_path}/workflows/tiny-resume.cwl" - config.write_workflow(resume_wf) + # The resume workflow is dynamically generated from the run workflow + base_workflow = f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl" # The workflow to derive from + workflow_dyna = f"{tinyrna_cwl_path}/workflows/tiny-resume.cwl" # The dynamically generated workflow to write + + config_object = resume_config_class(config_file, base_workflow) + config_object.write_processed_config(config_file) + config_object.write_workflow(workflow_dyna) - if config['run_native']: - # We can pass our config object directly without writing to disk first - run_cwltool_native(config, resume_wf) + if config_object['run_native']: + # Can pass the config object directly but still write to disk for autodocumentation + run_cwltool_native(config_object, workflow_dyna) else: # Processed Run Config must be written to disk first - resume_conf_file = config.get_outfile_path() - config.write_processed_config(resume_conf_file) - run_cwltool_subprocess(config, resume_wf) + run_cwltool_subprocess(config_object, workflow_dyna) - if os.path.isfile(resume_wf): + if os.path.isfile(workflow_dyna): # We don't want the generated workflow to be returned by a call to setup-cwl - os.remove(resume_wf) - + os.remove(workflow_dyna) def run_cwltool_subprocess(config_object: 'ConfigBase', workflow: str, run_directory='.') -> int: """Executes the workflow using a command line invocation of cwltool diff --git a/tiny/rna/compatibility.py b/tiny/rna/compatibility.py index c1fa9d7b..c3ee2142 100644 --- a/tiny/rna/compatibility.py +++ b/tiny/rna/compatibility.py @@ -68,9 +68,10 @@ def add_mapping(doc: CommentedMap, prec_key, key_obj): # Comments & linebreaks are often (but not always!) attached to # the preceding key. Move them down to the new key. - inherit_prev = doc.ca.items[prec_key][2] - doc.ca.items[key] = [None, None, inherit_prev, None] - doc.ca.items[prec_key][2] = None + if prec_key in doc.ca.items: + inherit_prev = doc.ca.items[prec_key][2] + doc.ca.items[key] = [None, None, inherit_prev, None] + doc.ca.items[prec_key][2] = None class RunConfigCompatibility: diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py index b317fcb4..fba50804 100644 --- a/tiny/rna/configuration.py +++ b/tiny/rna/configuration.py @@ -3,6 +3,7 @@ import shutil import errno import time +import copy import sys import csv import re @@ -196,7 +197,13 @@ def get_outfile_path(self, infile: str = None) -> str: def write_processed_config(self, filename: str = None) -> str: """Writes the current configuration to disk""" - if filename is None: filename = self.get_outfile_path(self.inf) + if filename is None: + filename = self.get_outfile_path(self.inf) + + if "processed_run_config" in self: + # The CWL specification doesn't provide a way to get this info, + # but it's needed for run_directory/config, so we store it here + self['processed_run_config'] = self.cwl_file(filename, verify=False) with open(filename, 'w') as outconf: self.yaml.dump(self.config, outconf) @@ -232,38 +239,30 @@ def __init__(self, config_file: str, validate_gffs=False, skip_setup=False): super().__init__(config_file, RunConfigCompatibility) self.paths = self.load_paths_config() - self.absorb_paths_file() + self.samples_sheet = self.load_samples_config() + self.features_sheet = self.load_features_config() if skip_setup: return self.setup_pipeline() - self.setup_file_groups() self.setup_ebwt_idx() - self.process_samples_sheet() - self.process_features_sheet() self.setup_step_inputs() if validate_gffs: self.validate_inputs() - def load_paths_config(self): + def load_paths_config(self) -> 'PathsFile': """Returns a PathsFile object and updates keys related to the Paths File path""" - # paths_config: user-specified - # Resolve the absolute path so that it remains valid when - # the processed Run Config is copied to the Run Directory - self['paths_config'] = self.from_here(self['paths_config']) - - # paths_file: automatically generated - # CWL file dictionary is used as a workflow input - self['paths_file'] = self.cwl_file(self['paths_config']) - - return PathsFile(self['paths_config']) + # Resolve relative path to the Paths File and construct + resolved = self.from_here(self['paths_config']) + paths = PathsFile(resolved) - def absorb_paths_file(self): + # Absorb PathsFile object keys into the configuration for key in [*PathsFile.single, *PathsFile.groups]: - self[key] = self.paths.as_cwl_file_obj(key) + self[key] = paths.as_cwl_file_obj(key) for key in PathsFile.prefix: - self[key] = self.paths[key] + self[key] = paths[key] + return paths - def process_samples_sheet(self): + def load_samples_config(self) -> 'SamplesSheet': samples_sheet_path = self.paths['samples_csv'] samples_sheet = SamplesSheet(samples_sheet_path, context="Pipeline Start") @@ -274,21 +273,13 @@ def process_samples_sheet(self): self['in_fq'] = [self.cwl_file(fq, verify=False) for fq in samples_sheet.hts_samples] self['fastp_report_titles'] = [f"{g}_rep_{r}" for g, r in samples_sheet.groups_reps] - def process_features_sheet(self) -> List[dict]: - """Retrieves GFF Source and Type Filter definitions for use in GFFValidator""" - features_sheet_path = self.paths['features_csv'] - reader = CSVReader(features_sheet_path, "Features Sheet").rows() + return samples_sheet - interests = ("Filter_s", "Filter_t") - return [{selector: rule[selector] for selector in interests} - for rule in reader] - - def setup_file_groups(self): - """Configuration keys that represent lists of files""" + def load_features_config(self) -> 'FeaturesSheet': + """Retrieves GFF Source and Type Filter definitions for use in GFFValidator""" - self.set_default_dict({per_file_setting_key: [] for per_file_setting_key in - ['in_fq', 'sample_basenames', 'gff_files', 'fastp_report_titles'] - }) + features_sheet_path = self.paths['features_csv'] + return FeaturesSheet(features_sheet_path, context="Pipeline Start") def setup_pipeline(self): """Overall settings for the whole pipeline""" @@ -381,7 +372,7 @@ def validate_inputs(self): if gff_files: GFFValidator( gff_files, - self.process_features_sheet(), + self.features_sheet.get_source_type_filters(), self.paths['ebwt'] if not self['run_bowtie_build'] else None, self.paths['reference_genome_files'] ).validate() @@ -411,11 +402,15 @@ def verify_bowtie_build_outputs(self): def save_run_profile(self, config_file_name=None) -> str: """Saves Samples Sheet and processed run config to the Run Directory for record keeping""" - from importlib.metadata import version - self['version'] = version('tinyrna') + run_dir = self['run_directory'] + self.paths.save_run_profile(run_dir) + self.samples_sheet.save_run_profile(run_dir) + self.features_sheet.save_run_profile(run_dir) + + # The paths_* keys should now point to the copy produced above + self['paths_file'] = self.cwl_file(os.path.join(run_dir, self.paths.basename)) # CWL file object + self['paths_config'] = self.paths.basename # User-facing value - samples_sheet_name = os.path.basename(self['samples_csv']['path']) - shutil.copyfile(self['samples_csv']['path'], f"{self['run_directory']}/{samples_sheet_name}") return self.write_processed_config(config_file_name) """========== COMMAND LINE ==========""" @@ -478,7 +473,7 @@ class PathsFile(ConfigBase): Relative paths are automatically resolved on lookup and list types are enforced. While this is convenient, developers should be aware of the following caveats: - Lookups that return list values do not return the original object; don't - append to them. Instead, use the append_to() helper function. + expect modifications to stick. If appending, use append_to(). - Chained assignments can produce unexpected results. Args: @@ -494,8 +489,8 @@ class PathsFile(ConfigBase): groups = ('reference_genome_files', 'gff_files') prefix = ('ebwt', 'run_directory', 'tmp_directory') - # Parameters that need to be held constant between resume runs for analysis integrity - resume_forbidden = ('samples_csv', 'run_directory', 'ebwt', 'reference_genome_files') + # Parameters that should be held constant between resume runs + resume_forbidden = ('run_directory', 'ebwt', 'reference_genome_files') def __init__(self, file: str, in_pipeline=False): super().__init__(file) @@ -639,6 +634,35 @@ def append_to(self, key: str, val: Any): target.append(val) return target + def save_run_profile(self, run_directory): + """Saves a copy of the Paths File to the Run Directory with amended paths. + Note the distinction between out_obj[key] and self[key]. The latter performs + automatic path resolution, whereas out_obj is essentially just a dict.""" + + out_obj = copy.deepcopy(self.config) + out_file = os.path.join(run_directory, self.basename) + + adjacent_paths = self.required + absolute_paths = [path for path in (*self.single, *self.prefix) + if path not in ("run_directory", *self.required)] + + for adjacent in adjacent_paths: + out_obj[adjacent] = os.path.basename(self[adjacent]) + + for key in absolute_paths: + if not self.is_path_str(self[key]): continue + out_obj[key] = os.path.abspath(self[key]) + + for key in self.groups: + for i, entry in enumerate(self[key]): + if self.is_path_dict(entry): + out_obj[key][i]['path'] = os.path.abspath(entry['path']) + elif self.is_path_str(entry): + out_obj[key][i] = os.path.abspath(entry) + + with open(out_file, 'w') as f: + self.yaml.dump(out_obj, f) + class SamplesSheet: def __init__(self, file, context): @@ -802,12 +826,76 @@ def validate_r_safe_sample_groups(sample_groups: Counter): "The following group names are too similar and will cause a namespace collision in R:\n" \ + '\n'.join(collisions) + def save_run_profile(self, run_directory): + """Writes a copy of the CSV with absolute paths""" + + outfile = os.path.join(run_directory, self.basename) + header = CSVReader.tinyrna_sheet_fields['Samples Sheet'].keys() + coldata = zip(self.hts_samples, self.groups_reps, self.normalizations) + + with open(outfile, 'w', newline='') as out_csv: + csv_writer = csv.writer(out_csv) + csv_writer.writerow(header) + for sample, (group, rep), norm in coldata: + control = (group == self.control_condition) or "" + sample = os.path.abspath(sample) + csv_writer.writerow([sample, group, rep, control, norm]) + @staticmethod def get_sample_basename(filename): root, _ = os.path.splitext(filename) return os.path.basename(root) +class FeaturesSheet: + def __init__(self, file, context): + self.csv = CSVReader(file, "Features Sheet") + self.basename = os.path.basename(file) + self.dir = os.path.dirname(file) + self.context = context + self.file = file + + self.rules = [] + self.read_csv() + + def read_csv(self): + try: + rules, hierarchies = [], [] + for rule in self.csv.rows(): + rule['nt5end'] = rule['nt5end'].upper().translate({ord('U'): 'T'}) # Convert RNA base to cDNA base + rule['Identity'] = (rule.pop('Key'), rule.pop('Value')) # Create identity tuple + rule['Overlap'] = rule['Overlap'].lower() # Built later in reference parsers + hierarchy = int(rule.pop('Hierarchy')) # Convert hierarchy to number + + # Duplicate rules are screened out here + # Equality check omits hierarchy value + if rule not in rules: + rules.append(rule) + hierarchies.append(hierarchy) + except Exception as e: + msg = f"Error occurred on line {self.csv.row_num} of {self.basename}" + append_to_exception(e, msg) + raise + + # Reunite hierarchy values with their rules + self.rules = [ + dict(rule, Hierarchy=hierarchy) + for rule, hierarchy in zip(rules, hierarchies) + ] + + def get_source_type_filters(self): + """Returns only the Source Filter and Type Filter columns""" + + interests = ("Filter_s", "Filter_t") + return [{selector: rule[selector] for selector in interests} + for rule in self.rules] + + def save_run_profile(self, run_directory): + """Copies the Features Sheet to the run directory""" + + outfile = os.path.join(run_directory, self.basename) + shutil.copyfile(self.file, outfile) + class CSVReader(csv.DictReader): """A simple wrapper class for csv.DictReader @@ -852,7 +940,7 @@ def __init__(self, filename: str, doctype: str = None): def rows(self): self.replace_excel_ellipses() - with open(os.path.expanduser(self.tinyrna_file), 'r', encoding='utf-8-sig') as f: + with open(os.path.expanduser(self.tinyrna_file), 'r', encoding='utf-8-sig', newline='') as f: super().__init__(f, fieldnames=self.tinyrna_fields, delimiter=',') header = next(self) diff --git a/tiny/rna/counter/counter.py b/tiny/rna/counter/counter.py index e451941c..be4c30f4 100644 --- a/tiny/rna/counter/counter.py +++ b/tiny/rna/counter/counter.py @@ -4,7 +4,6 @@ import traceback import argparse import sys -import os from typing import List, Dict @@ -12,11 +11,10 @@ from tiny.rna.counter.features import Features, FeatureCounter from tiny.rna.counter.statistics import MergedStatsManager from tiny.rna.counter.hts_parsing import ReferenceFeatures, ReferenceSeqs, ReferenceBase -from tiny.rna.configuration import PathsFile, SamplesSheet, CSVReader, get_templates +from tiny.rna.configuration import PathsFile, SamplesSheet, FeaturesSheet, get_templates from tiny.rna.util import ( report_execution_time, add_transparent_help, - append_to_exception, get_timestamp, ReadOnlyDict ) @@ -131,24 +129,10 @@ def load_config(features_csv: str, in_pipeline: bool) -> List[dict]: further digest to produce its rules table. """ - sheet = CSVReader(features_csv, "Features Sheet") - rules = list() - - try: - for rule in sheet.rows(): - rule['nt5end'] = rule['nt5end'].upper().translate({ord('U'): 'T'}) # Convert RNA base to cDNA base - rule['Identity'] = (rule.pop('Key'), rule.pop('Value')) # Create identity tuple - rule['Hierarchy'] = int(rule['Hierarchy']) # Convert hierarchy to number - rule['Overlap'] = rule['Overlap'].lower() # Built later in reference parsers - - # Duplicate rule entries are not allowed - if rule not in rules: rules.append(rule) - except Exception as e: - msg = f"Error occurred on line {sheet.row_num} of {os.path.basename(features_csv)}" - append_to_exception(e, msg) - raise + context = "Pipeline Step" if in_pipeline else "Standalone Run" + features = FeaturesSheet(features_csv, context=context) - return rules + return features.rules def load_references(paths: PathsFile, libraries: List[dict], rules: List[dict], prefs) -> ReferenceBase: diff --git a/tiny/rna/counter/features.py b/tiny/rna/counter/features.py index bd32f353..e7c9ee61 100644 --- a/tiny/rna/counter/features.py +++ b/tiny/rna/counter/features.py @@ -102,11 +102,12 @@ class FeatureSelector: rules_table: List[dict] inv_ident: Dict[tuple, List[int]] - def __init__(self, rules: List[dict], **kwargs): + def __init__(self, rules: List[dict], **prefs): FeatureSelector.rules_table = self.build_selectors(rules) FeatureSelector.inv_ident = self.build_inverted_identities(FeatureSelector.rules_table) self.warnings = defaultdict(set) self.overlap_cache = {} + self.prefs = prefs @classmethod def choose(cls, candidates: Set[feature_record_tuple], alignment: dict) -> Mapping[str, set]: diff --git a/tiny/rna/counter/statistics.py b/tiny/rna/counter/statistics.py index aaa342dd..90d67eca 100644 --- a/tiny/rna/counter/statistics.py +++ b/tiny/rna/counter/statistics.py @@ -699,7 +699,7 @@ def write_alignment_tables(self): header = Diagnostics.alignment_columns for library_name, table in self.alignment_tables.items(): outfile = make_filename([self.prefix, library_name, 'alignment_table'], ext='.csv') - with open(outfile, 'w') as ao: + with open(outfile, 'w', newline='') as ao: csv_writer = csv.writer(ao) csv_writer.writerow(header) csv_writer.writerows(table) diff --git a/tiny/rna/resume.py b/tiny/rna/resume.py index 549f0b51..1046edb2 100644 --- a/tiny/rna/resume.py +++ b/tiny/rna/resume.py @@ -1,12 +1,13 @@ +import shutil +import sys import os import re -import sys from ruamel.yaml.comments import CommentedOrderedMap from abc import ABC, abstractmethod from glob import glob -from tiny.rna.configuration import ConfigBase, PathsFile +from tiny.rna.configuration import ConfigBase, PathsFile, SamplesSheet, FeaturesSheet from tiny.rna.compatibility import RunConfigCompatibility from tiny.rna.util import timestamp_format, get_timestamp @@ -90,11 +91,35 @@ def _create_truncated_workflow(self): wf_steps[self.steps[0]]['in'][param] = new_input['var'] def load_paths_config(self): - """Returns a PathsFile object and updates keys related to the Paths File path""" + """Returns a PathsFile object and updates keys if necessary + + If paths_config is an absolute path then we assume this Run Directory was + created under the old auto-documentation approach (in the new approach, it + would be adjacent and therefore a basename). In order to allow for multiple + resumes on this old Run Directory, we upgrade it to use the new auto-doc + approach and save the existing processed Run Config to the /config subdir.""" - self['paths_config'] = self.from_here(self['paths_config']) - self['paths_file'] = self.cwl_file(self['paths_config']) - return PathsFile(self['paths_config']) + paths = PathsFile(self['paths_config']) + if os.path.isabs(self['paths_config']): + run_dir = os.getcwd() + conf_dir = self['dir_name_config'] + + try: + # Handle existing Run Config + os.mkdir(conf_dir) + shutil.copyfile(self.inf, os.path.join(conf_dir, self.basename)) + except FileExistsError: + msg = f"Could not resume old-style Run Directory (/{conf_dir} exists)." + raise FileExistsError(msg) + + # Handle remaining config files + paths.save_run_profile(run_dir) + self['paths_config'] = paths.basename + self['paths_file'] = self.cwl_file(paths.basename) + SamplesSheet(paths['samples_csv'], "Pipeline Start").save_run_profile(run_dir) + FeaturesSheet(paths['features_csv'], "Pipeline Start").save_run_profile(run_dir) + + return paths def assimilate_paths_file(self): """Updates the processed workflow with resume-safe Paths File parameters""" @@ -106,24 +131,23 @@ def assimilate_paths_file(self): self[key] = self.paths[key] def _add_timestamps(self, steps): - """Differentiates resume-run output subdirs by adding a timestamp to them""" + """Differentiates resume-run output subdirs by appending a timestamp to their names""" # Rename output directories with timestamp for subdir in steps: step_dir = "dir_name_" + subdir - self[step_dir] = self[step_dir] + "_" + self.dt + self[step_dir] = self.append_or_replace_ts(self[step_dir]) - # The logs dir isn't a workflow step but still needs a timestamp - self['dir_name_logs'] = self['dir_name_logs'] + "_" + self.dt + # The logs dir isn't from a workflow step but still needs a timestamp + self['dir_name_logs'] = self.append_or_replace_ts(self['dir_name_logs']) - # Update run_name output prefix variable for the current date and time - self['run_name'] = re.sub(timestamp_format, self.dt, self['run_name']) + # Update run_name output prefix with the current date and time + self['run_name'] = self.append_or_replace_ts(self['run_name']) - # Override - def get_outfile_path(self, infile: str = None) -> str: - if infile is None: infile = self.inf - root, ext = os.path.splitext(os.path.basename(infile)) - return '_'.join(["resume", root, self.dt]) + ext + def append_or_replace_ts(self, s): + """Appends (or replaces) a timestamp at the end of the string""" + optional_timestamp = rf"(_{timestamp_format})|$" + return re.sub(optional_timestamp, "_" + self.dt, s, count=1) def write_workflow(self, workflow_outfile: str) -> None: with open(workflow_outfile, "w") as wf: @@ -134,7 +158,7 @@ class ResumeCounterConfig(ResumeConfig): """A class for modifying the workflow and config to resume a run at tiny-count""" def __init__(self, processed_config, workflow): - steps = ["tiny-count", "tiny-deseq", "tiny-plot"] + steps = ["tiny-count", "tiny-deseq", "tiny-plot", "config"] inputs = { 'aligned_seqs': {'var': "resume_sams", 'type': "File[]"}, @@ -153,26 +177,23 @@ def _rebuild_entry_inputs(self): File[] arrays with their corresponding pipeline outputs on disk. """ - def cwl_file_resume(subdir, file): - try: - return self.cwl_file('/'.join([subdir, file])) - except FileNotFoundError as e: - sys.exit("The following pipeline output could not be found:\n%s" % (e.filename,)) - - resume_file_lists = ['resume_sams', 'resume_fastp_logs', 'resume_collapsed_fas'] - self.set_default_dict({key: [] for key in resume_file_lists}) + bowtie = self['dir_name_bowtie'] + fastp = self['dir_name_fastp'] + collapser = self['dir_name_tiny-collapse'] - for sample in self['sample_basenames']: - self['resume_sams'].append(cwl_file_resume(self['dir_name_bowtie'], sample + '_aligned_seqs.sam')) - self['resume_fastp_logs'].append(cwl_file_resume(self['dir_name_fastp'], sample + '_qc.json')) - self['resume_collapsed_fas'].append(cwl_file_resume(self['dir_name_tiny-collapse'], sample + '_collapsed.fa')) + try: + self['resume_sams'] = list(map(self.cwl_file, glob(bowtie + "/*_aligned_seqs.sam"))) + self['resume_fastp_logs'] = list(map(self.cwl_file, glob(fastp + "/*_qc.json"))) + self['resume_collapsed_fas'] = list(map(self.cwl_file, glob(collapser + "/*_collapsed.fa"))) + except FileNotFoundError as e: + sys.exit("The following pipeline output could not be found:\n%s" % (e.filename,)) class ResumePlotterConfig(ResumeConfig): """A class for modifying the workflow and config to resume a run at tiny-plot""" def __init__(self, processed_config, workflow): - steps = ["tiny-plot"] + steps = ["tiny-plot", "config"] inputs = { 'raw_counts': {'var': "resume_raw", 'type': "File"}, diff --git a/tiny/rna/util.py b/tiny/rna/util.py index 1670ef3b..96b296f5 100644 --- a/tiny/rna/util.py +++ b/tiny/rna/util.py @@ -213,7 +213,7 @@ def sorted_natural(lines, key=None, reverse=False): # For timestamp matching and creation -timestamp_format = re.compile(r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}") +timestamp_format = r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}" def get_timestamp(): return datetime.now().strftime('%Y-%m-%d_%H-%M-%S') diff --git a/tiny/templates/compatibility/run_config_compatibility.yml b/tiny/templates/compatibility/run_config_compatibility.yml index 8b13fb8d..e1d9a070 100644 --- a/tiny/templates/compatibility/run_config_compatibility.yml +++ b/tiny/templates/compatibility/run_config_compatibility.yml @@ -5,7 +5,14 @@ # - Adding mappings requires noting the key that should precede the new key # - Renames are evaluated before additions; preceding_key should use the new name if version renames it - +1.5.0: + remove: [] + rename: [] + add: + - preceding_key: dir_name_logs + dir_name_config: config + - preceding_key: plot_class_scatter_filter_exclude + processed_run_config: {} 1.4.0: remove: - counter_all_features diff --git a/tiny/templates/run_config_template.yml b/tiny/templates/run_config_template.yml index 782378fd..90079425 100644 --- a/tiny/templates/run_config_template.yml +++ b/tiny/templates/run_config_template.yml @@ -310,6 +310,7 @@ dir_name_tiny-count: tiny-count dir_name_tiny-deseq: tiny-deseq dir_name_tiny-plot: tiny-plot dir_name_logs: logs +dir_name_config: config ######################### AUTOMATICALLY GENERATED CONFIGURATIONS ######################### @@ -320,7 +321,7 @@ dir_name_logs: logs # ########################################################################################### -version: 1.4.0 +version: 1.5.0 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### # @@ -332,10 +333,10 @@ run_directory: ~ tmp_directory: ~ features_csv: { } samples_csv: { } -paths_file: { } gff_files: [ ] run_bowtie_build: false reference_genome_files: [ ] +bt_index_files: [ ] plot_style_sheet: ~ adapter_fasta: ~ ebwt: ~ @@ -356,10 +357,6 @@ in_fq: [ ] # output reports fastp_report_titles: [ ] -###-- Utilized by bowtie --### -# bowtie index files -bt_index_files: [ ] - ##-- Utilized by tiny-deseq.r --## # The control for comparison. If unspecified, all comparisons are made control_condition: @@ -383,4 +380,11 @@ run_deseq: True ##-- Utilized by tiny-plot --## # Filters for class scatter plots plot_class_scatter_filter_include: [] -plot_class_scatter_filter_exclude: [] \ No newline at end of file +plot_class_scatter_filter_exclude: [] + +##-- Used to populate the directory defined in dir_name_config --## +##-- CWL spec doesn't provide a way to get this info from within the workflow --## +processed_run_config: {} + +##-- This is the paths_config key converted to a CWL file object for handling --## +paths_file: {} \ No newline at end of file