diff --git a/README.md b/README.md index 7a1f3889..be6ee175 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ The pipeline requires that you identify: For more information, please see the [configuration file documentation](doc/Configuration.md). The `START_HERE` directory demonstrates a working configuration using these files. You can also get a copy of them by running the command: ```shell -tiny get-template +tiny get-templates ``` diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml index 444ac5c6..1bda5099 100644 --- a/START_HERE/paths.yml +++ b/START_HERE/paths.yml @@ -59,7 +59,7 @@ adapter_fasta: ######--------------------------------- tiny-plot -----------------------------------###### # # Optional: override the styles used by tiny-plot by providing your own .mplstyle sheet -# Run "tiny get-template" in your terminal to get a copy of the current style sheet +# Run "tiny get-templates" in your terminal to get a copy of the current style sheet # ######-------------------------------------------------------------------------------###### diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml index d4278638..7dfe47da 100644 --- a/START_HERE/run_config.yml +++ b/START_HERE/run_config.yml @@ -317,7 +317,7 @@ dir_name_plotter: plots # ########################################################################################### -version: 1.2 +version: 1.2.1 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### # diff --git a/START_HERE/samples.csv b/START_HERE/samples.csv index 99273e51..6455c9ae 100755 --- a/START_HERE/samples.csv +++ b/START_HERE/samples.csv @@ -1,4 +1,4 @@ -Input FastQ Files,Sample/Group Name,Replicate number,Control,Normalization +FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization ./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, ./fastq_files/cond1_rep2.fastq.gz,condition1,2,, ./fastq_files/cond1_rep3.fastq.gz,condition1,3,, diff --git a/doc/Configuration.md b/doc/Configuration.md index 60fb75a7..94c23f66 100644 --- a/doc/Configuration.md +++ b/doc/Configuration.md @@ -9,7 +9,7 @@ The pipeline requires that you identify: The `START_HERE` directory demonstrates a working configuration using these files. You can also get a copy of them (and other optional template files) with: ``` -tiny get-template +tiny get-templates ``` ## Overview @@ -117,7 +117,7 @@ The final output directory name has three components: - The `run_directory` basename defined in your Paths File ## Samples Sheet Details -| _Column:_ | Input FASTQ Files | Sample/Group Name | Replicate Number | Control | Normalization | +| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization | |-----------:|---------------------|-------------------|------------------|---------|---------------| | _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM | @@ -151,4 +151,4 @@ Rules that match features in the first stage of selection will be used in a seco See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column. ## Plot Stylesheet Details -Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-template`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to. \ No newline at end of file +Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-templates`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to. \ No newline at end of file diff --git a/doc/Parameters.md b/doc/Parameters.md index fa9831ae..5bffa4db 100644 --- a/doc/Parameters.md +++ b/doc/Parameters.md @@ -63,22 +63,22 @@ Optional arguments: ## tiny-count -### All Features -| Run Config Key | Commandline Argument | -|------------------------|------------------------| -| counter_all_features: | `--all-features` | +### Get Templates +| Run Config Key | Commandline Argument | +|----------------|----------------------| +| | `--get-templates` | -By default, tiny-count will only evaluate alignments to features which match a `Select for...` & `with value...` of at least one rule in your Features Sheet. It is this matching feature set, and only this set, which is included in `feature_counts.csv` and therefore available for analysis by tiny-deseq.r and tiny-plot. Switching this option "on" will include all features in every input GFF file, regardless of attribute matches, for tiny-count and downstream steps. +Copies the template configuration files required by tiny-count into the current directory. This argument can't be combined with `--paths-file`. All other arguments are ignored when provided, and once the templates have been copied tiny-count exits. ### Normalize by Hits - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |----------------------------|---------------------------| | counter-normalize-by-hits: | `--normalize-by-hits T/F` | By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step. ### Decollapse - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |---------------------|------------------------| | counter_decollapse: | `--decollapse` | @@ -89,24 +89,17 @@ The SAM files produced by the tinyRNA pipeline are collapsed by default; alignme |--------------------|----------------------| | counter_stepvector | `--stepvector` | -A custom Cython implementation of HTSeq's StepVector is used for finding features that overlap each alignment interval. While the core C++ component of the StepVector is the same, we have found that our Cython implementation can result in runtimes up to 50% faster than HTSeq's implementation. This parameter allows you to use HTSeq's StepVector if you wish (for example, if the Cython StepVector is incompatible with your system) - -### Allow Features with Multiple ID Values - | Run Config Key | Commandline Argument | -|------------------------|----------------------| -| counter_allow_multi_id | `--multi-id` | - -By default, an error will be produced if a GFF file contains a feature with multiple comma separated values listed under its ID attribute. Switching this option "on" instructs tiny-count to accept these features without error, but only the first listed value is used as the ID. +A custom Cython implementation of HTSeq's StepVector is used for finding features that overlap each alignment interval. While the core C++ component of the StepVector is the same, we have found that our Cython implementation can result in runtimes up to 50% faster than HTSeq's implementation. This parameter allows you to use HTSeq's StepVector if you wish. ### Is Pipeline - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |----------------|----------------------| | | `--is-pipeline` | This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet. ### Report Diags - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |----------------|----------------------| | counter_diags: | `--report-diags` | @@ -114,38 +107,46 @@ Diagnostic information will include intermediate alignment files for each librar ### Full tiny-count Help String ``` -tiny-count -pf PATHS -o OUTPUTPREFIX [-h] [-nh T/F] [-dc] - [-sv {Cython,HTSeq}] [-a] [-p] [-d] +tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-nh T/F] [-dc] + [-sv {Cython,HTSeq}] [-p] [-d] -This submodule assigns feature counts for SAM alignments using a Feature Sheet -ruleset. If you find that you are sourcing all of your input files from a -prior run, we recommend that you instead run `tiny recount` within that run's -directory. +tiny-count is a precision counting tool for hierarchical classification and +quantification of small RNA-seq reads Required arguments: - -pf PATHS, --paths-file PATHS - your Paths File - -o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX - output prefix to use for file names + You must either provide a Paths File or request templates for detailing + your configuration. + + -pf FILE, --paths-file FILE + your Paths File (default: None) + --get-templates Copies the template configuration files required by + tiny-count into the current directory. (default: + False) Optional arguments: - -h, --help show this help message and exit + These options can be used in conjunction with the Paths File (-pf) + argument mentioned above. + + -o PREFIX, --out-prefix PREFIX + The output prefix to use for file names. All + occurrences of the substring {timestamp} will be + replaced with the current date and time. (default: + tiny-count_{timestamp}) -nh T/F, --normalize-by-hits T/F If T/true, normalize counts by (selected) overlapping - feature counts. Default: true. + feature counts. (default: T) -dc, --decollapse Create a decollapsed copy of all SAM files listed in your Samples Sheet. This option is ignored for non- - collapsed inputs. + collapsed inputs. (default: False) -sv {Cython,HTSeq}, --stepvector {Cython,HTSeq} Select which StepVector implementation is used to find - features overlapping an interval. - -a, --all-features Represent all features in output counts table, even if - they did not match a Select for / with value. + features overlapping an interval. (default: Cython) -p, --is-pipeline Indicates that tiny-count was invoked as part of a pipeline run and that input files should be sourced as - such. + such. (default: False) -d, --report-diags Produce diagnostic information about - uncounted/eliminated selection elements. + uncounted/eliminated selection elements. (default: + False) ``` ## tiny-deseq.r @@ -198,28 +199,28 @@ Optional arguments: ## tiny-plot ### Plot Requests - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |----------------|------------------------------| | plot_requests: | `--plots PLOT PLOT PLOT ...` | tiny-plot will only produce the list of plots requested. ### P value - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |----------------|----------------------| | plot_pval: | `--p-value VALUE` | Feature expression levels are considered significant if their P value is less than this value, with a default of 0.05. Non-differentially expressed features are plotted as gray points, and in `sample_avg_scatter_by_dge_class`, these points are not colored by feature class. ### Style Sheet - | Run Config Key | Paths File Key | Commandline Argument | +| Run Config Key | Paths File Key | Commandline Argument | |----------------|-------------------|--------------------------| | | plot_style_sheet: | `--style-sheet MPLSTYLE` | The plot style sheet can be used to override the default Matplotlib styles used by tiny-plot. Unlike the other parameters, this option is found in the Paths File. See the [Plot Stylesheet documentation](Configuration.md#plot-stylesheet-details) for more information. ### Vector Scatter - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |---------------------|----------------------| | plot_vector_points: | `--vector-scatter` | @@ -227,7 +228,7 @@ The scatter plots produced by tiny-plot have rasterized points by default. This >**Note**: only scatter points are rasterized with this option switched "off"; all other elements are vectorized in every plot type. ### Bounds for len_dist Charts - | Run Config Key | Commandline Argument | +| Run Config Key | Commandline Argument | |--------------------|------------------------| | plot_len_dist_min: | `--len-dist-min VALUE` | | plot_len_dist_max: | `--len-dist-max VALUE` | @@ -247,8 +248,8 @@ The labels that should be used for special groups in `class_charts` and `sample_ tiny-plot [-rc RAW_COUNTS] [-nc NORM_COUNTS] [-uc RULE_COUNTS] [-ss STAT] [-dge COMPARISON [COMPARISON ...]] [-len 5P_LEN [5P_LEN ...]] [-h] [-o PREFIX] [-pv VALUE] - [-s MPLSTYLE] [-v] [-ldi VALUE] [-lda VALUE] -p PLOT - [PLOT ...] + [-s MPLSTYLE] [-v] [-ldi VALUE] [-lda VALUE] [-una LABEL] + [-unk LABEL] -p PLOT [PLOT ...] This script produces basic static plots for publication as part of the tinyRNA workflow. Input file requirements vary by plot type and you are free to supply diff --git a/doc/Pipeline.md b/doc/Pipeline.md index aa0d8e74..92a617aa 100644 --- a/doc/Pipeline.md +++ b/doc/Pipeline.md @@ -3,7 +3,7 @@ The following commands deal with pipeline operations for carrying out end-to-end ```shell # Retrieving config files -tiny get-template +tiny get-templates tiny setup-cwl # End-to-end analysis diff --git a/doc/tiny-count.md b/doc/tiny-count.md index 2e83ff46..184fab49 100644 --- a/doc/tiny-count.md +++ b/doc/tiny-count.md @@ -4,19 +4,16 @@ For an explanation of tiny-count's parameters in the Run Config and by commandline, see [the parameters documentation](Parameters.md#tiny-count). ## Resuming an End-to-End Analysis -tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps are resource and time intensive, so it is inconvenient to rerun an end-to-end analysis to test new selection rules. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequesites. +tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites. ## Running as a Standalone Tool -If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from third party sources are also supported, and can be produced from reads collapsed by tiny-collapse or fastx, or from non-collapsed reads. - ->**Important:** reusing the same output filename prefix between standalone runs will result in prior outputs being overwritten. +If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts. #### Using Non-collapsed Sequence Alignments -While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1.) +While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1). # Feature Selection -![Feature Selection Diagram](../images/tiny-count_selection.png) We provide a Features Sheet (`features.csv`) in which you can define selection rules to more accurately capture counts for the small RNAs of interest. The parameters for these rules include attributes commonly used in the classification of small RNAs, such as length, strandedness, and 5' nucleotide. @@ -26,6 +23,8 @@ Selection occurs in three stages, with the output of each stage as input to the 1. Features are matched to rules based on their attributes defined in GFF files 2. At each alignment locus, overlapping features are selected based on the overlap requirements of their matched rules. Selected features are sorted by hierarchy value so that smaller values take precedence in the next stage. 3. Finally, features are selected for read assignment based on the small RNA attributes of the alignment locus. Once reads are assigned to a feature, they are excluded from matches with larger hierarchy values. + +![Feature Selection Diagram](../images/tiny-count_selection.png) ## Stage 1: Feature Attribute Parameters | _features.csv columns:_ | Select for... | with value... | Classify as... | Source Filter | Type Filter | diff --git a/setup.py b/setup.py index c85a058c..f1bbc1b1 100644 --- a/setup.py +++ b/setup.py @@ -14,7 +14,7 @@ AUTHOR = 'Kristen Brown, Alex Tate' PLATFORM = 'Unix' REQUIRES_PYTHON = '>=3.9.0' -VERSION = '1.2' +VERSION = '1.2.1' REQUIRED = [] # Required packages are installed via Conda's environment.yml diff --git a/tests/testdata/config_files/features.csv b/tests/testdata/config_files/features.csv new file mode 100755 index 00000000..66977d07 --- /dev/null +++ b/tests/testdata/config_files/features.csv @@ -0,0 +1,7 @@ +Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap +Class,mask,,,,1,both,all,all,Partial +Class,miRNA,,,,2,sense,all,16-22,Full +Class,piRNA,5pA,,,2,both,A,24-32,Full +Class,piRNA,5pT,,,2,both,T,24-32,Full +Class,siRNA,,,,2,both,all,15-22,Full +Class,unk,,,,3,both,all,all,Full \ No newline at end of file diff --git a/tests/testdata/config_files/paths.yml b/tests/testdata/config_files/paths.yml new file mode 100644 index 00000000..125e6a76 --- /dev/null +++ b/tests/testdata/config_files/paths.yml @@ -0,0 +1,66 @@ +############################## MAIN INPUT FILES FOR ANALYSIS ############################## +# +# Edit this section to provide the path to your Samples and Features sheets. Relative and +# absolute paths are both allowed. All relative paths are relative to THIS config file. +# +# Directions: +# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv] +# 2. Fill out the Features Sheet with selection rules [features.csv] +# 3. Set samples_csv and features_csv (below) to point to these files +# 4. Add annotation files and per-file alias preferences to gff_files +# +######-------------------------------------------------------------------------------###### + +##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --## +samples_csv: samples.csv +features_csv: features.csv + +##-- Each entry: 1. the file, 2. (optional) list of attribute keys for feature aliases --## +gff_files: +- path: "../../../START_HERE/reference_data/ram1.gff3" + alias: [ID] +#- path: +# alias: [ ] + +##-- The final output directory for files produced by the pipeline --# +run_directory: run_directory + +##-- The directory for temporary files. Determined by cwltool if blank. --## +tmp_directory: + +######-------------------------------- BOWTIE-BUILD ---------------------------------###### +# +# To build bowtie indexes: +# 1. Your reference genome file(s) must be listed under reference_genome_files (below) +# 2. ebwt (below) must be empty (nothing after ":") +# +# Once your indexes have been built, this config file will be modified such +# that ebwt points to their location (prefix) within your Run Directory. This +# means that indexes will not be unnecessarily rebuilt on subsequent runs. If +# you need them rebuilt, simply set ebwt: '' +# +######-------------------------------------------------------------------------------###### + +##-- The prefix for your bowtie index, include relative path (relative to this config file) --## +##-- If you do not have a bowtie index, change this to ebwt: '' +ebwt: '' + +##-- If you do not have a bowtie index, provide your reference genome file(s) here --## +##-- One file per line, with "- " at the beginning (think: bulleted list) --## +reference_genome_files: +- '../../../START_HERE/reference_data/ram1.fa' + +######----------------------------------- fastp -------------------------------------###### +# Optional: provide a FASTA file containing the specific adapters you wish to trim +######-------------------------------------------------------------------------------###### + +adapter_fasta: + +######--------------------------------- tiny-plot -----------------------------------###### +# +# Optional: override the styles used by tiny-plot by providing your own .mplstyle sheet +# Run "tiny get-templates" in your terminal to get a copy of the current style sheet +# +######-------------------------------------------------------------------------------###### + +plot_style_sheet: diff --git a/tests/testdata/config_files/run_config_template.yml b/tests/testdata/config_files/run_config_template.yml new file mode 100644 index 00000000..b59dee56 --- /dev/null +++ b/tests/testdata/config_files/run_config_template.yml @@ -0,0 +1,370 @@ +######----------------------------- tinyRNA Configuration -----------------------------###### +# +# In this file you can specify your configuration preferences for the workflow and +# each workflow step. +# +# If you want to use DEFAULT settings for the workflow, all you need to do is provide the path +# to your Samples Sheet and Features Sheet in your Paths File, then make sure that the +# 'paths_config' setting below points to your Paths File. +# +# We suggest that you also: +# 1. Add a username to identify the person performing runs, if desired for record keeping +# 2. Add a run directory name in your Paths File. If not provided, "run_directory" is used +# 3. Add a run name to label your run directory and run-specific summary reports. +# If not provided, user_tinyrna will be used. +# +# This file will be further processed at run time to generate the appropriate pipeline +# settings for each workflow step. A copy of this processed configuration will be stored +# in your run directory. +# +######-------------------------------------------------------------------------------###### + +user: ~ +run_date: ~ +run_time: ~ +paths_config: paths.yml + +##-- The label for final outputs --## +##-- If none provided, the default of user_tinyrna will be used --## +run_name: test_run_config + +##-- Number of threads to use when a step supports multi-threading --## +##-- For best performance, this should be equal to your computer's processor core count --## +threads: 4 + +##-- Control the amount of information printed to terminal: debug, normal, quiet --## +verbosity: normal + +##-- If True: process fastp, tiny-collapse, and bowtie in parallel per-library --## +run_parallel: true + +##-- (EXPERIMENTAL) If True: execute the pipeline using native cwltool Python --## +run_native: false + +######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------###### +# +# If you do not already have bowtie indexes, they can be built for you by setting +# run_bowtie_build (above) to true and adding your reference genome file(s) to your +# paths_config file. +# +# We have specified default parameters for small RNA data based on our own "best practices". +# You can change the parameters here. +# +######-------------------------------------------------------------------------------###### + + +##-- SA is sampled every 2^offRate BWT chars (default: 5) +offrate: ~ + +##-- Convert Ns in reference to As --## +ntoa: false + +##-- Don't build .3/.4.ebwt (packed reference) portion --## +noref: false + +##-- Number of chars consumed in initial lookup (default: 10) --## +ftabchars: ~ + + +######---------------------TRIMMING AND QUALITY FILTER OPTIONS ----------------------###### +# +# We use the program fastp to perform: adapter trimming (req), quality filtering (on), +# and QC analysis for an output QC report. See https://github.com/OpenGene/fastp for more +# information on the fastp tool. We have limited the options available to those appropriate +# for small RNA sequencing data. If you require an addition option, create an issue on the +# pipeline github: https://github.com/MontgomeryLab/tinyrna +# +# We have specified default parameters for small RNA data based on our own "best practices". +# You can change the parameters here. +# +######-------------------------------------------------------------------------------###### + + +##-- Adapter sequence to trim --## +adapter_sequence: 'auto' + +##-- Minumum & maximum accepted lengths after trimming --## +length_required: 15 +length_limit: 35 + +##-- Minimum average score for a read to pass quality filter --## +average_qual: 25 + +##-- Minimum phred score for a base to pass quality filter --## +qualified_quality_phred: 20 + +##-- Minimum % of bases that can be below minimum phred score (above) --## +unqualified_percent_limit: 10 + +##-- Minimum allowed number of bases --## +n_base_limit: 1 + +##-- Compression level for gzip output --## +compression: 4 + +###-- Unused optional inputs: Remove '#' in front to use --### +##-- Trim poly x tails of a given length --## +# trim_poly_x: false +# poly_x_min_len: 0 + +##-- Trim n bases from the front/tail of a read --## +# trim_front1: 0 +# trim_tail1: 0 + +##-- Is the data phred 64? --## +# fp_phred64: False + +##-- Turn on overrepresentation sampling analysis --## +# overrepresentation_sampling: 0 +# overrepresentation_analysis: false + +##-- If true: don't overwrite the files --## +# dont_overwrite: false + +##-- If true: disable these options --## +# disable_quality_filtering: false +# disable_length_filtering: false +# disable_adapter_trimming: false + + +######--------------------------- READ COLLAPSER OPTIONS ----------------------------###### +# +# We use a custom Python utility for collapsing duplicate reads. +# We recommend using the default (keep all reads, or threshold: 0). +# Sequences <= threshold will not be included in downstream steps. +# Trimming takes place prior to counting/collapsing. +# +# We have specified default parameters for small RNA data based on our own "best practices". +# You can change the parameters here. +# +######-------------------------------------------------------------------------------###### + +##-- Trim the specified number of bases from the 5' end of each sequence --## +5p_trim: 0 + +##-- Trim the specified number of bases from the 3' end of each sequence --## +3p_trim: 0 + +##-- Sequences with count <= threshold will be placed in a separate low_counts fasta --## +threshold: 0 + +##-- If True: outputs will be gzip compressed --## +compress: False + + +######-------------------------- BOWTIE ALIGNMENT OPTIONS ---------------------------###### +# +# We use bowtie for read alignment to a genome. +# +# We have specified default parameters for small RNA data based on our own "best practices". +# You can change the parameters here. +# +######-------------------------------------------------------------------------------###### + + +##-- Report end-to-end hits w/ <=v mismatches; ignore qualities --## +end_to_end: 0 + +##-- Report all alignments per read (much slower than low -k) --## +all_aln: True + +##-- Seed for random number generator --## +seed: 0 + +##-- Suppress SAM records for unaligned reads --## +no_unal: True + +##-- Use shared mem for index; many bowtie's can share --## +##-- Note: this requires further configuration of your OS --## +##-- http://bowtie-bio.sourceforge.net/manual.shtml#bowtie-options-shmem --## +shared_memory: False + +###-- Unused option inputs: Remove '#' in front to use --### +##-- Hits are guaranteed best stratum, sorted; ties broken by quality --## +#best: False + +##-- Hits in sub-optimal strata aren't reported (requires best, ^^^^) --## +#strata: False + +##-- Max mismatches in seed (can be 0-3, default: -n 2) --## +#seedmms: 2 + +##-- Seed length for seedmms (default: 28) --## +#seedlen: 28 + +##-- Do not align to reverse-compliment reference --## +# norc: False + +##-- Do not align to forward reference --## +# nofw: False + +##-- Input quals are Phred+64 (same as --solexa1.3-quals) --## +# bt_phred64: False + +##-- Report up to good alignments per read (default: 1) --## +# k_aln + +##-- Number of bases to trim from 5' or 3' end of reads --## +# trim5: 0 +# trim3: 0 + +##-- Input quals are from GA Pipeline ver. < 1.3 --## +# solexa: false + +##-- Input quals are from GA Pipeline ver. >= 1.3 --## +# solexa13: false + + +######--------------------------- FEATURE COUNTER OPTIONS ---------------------------###### +# +# We use a custom Python utility that utilizes HTSeq's Genomic Array of Sets and GFF reader +# to count small RNA reads. Selection rules are defined in your Features Sheet. +# +######-------------------------------------------------------------------------------###### + + +##-- If True: show all parsed features in the counts csv, regardless of count/identity --## +counter_all_features: False + +##-- If True: counts will be normalized by genomic hits AND selected feature count --## +##-- If False: counts will only be normalized by genomic hits --## +counter_normalize_by_hits: True + +##-- If True: a decollapsed copy of each SAM file will be produced (useful for IGV) --## +counter_decollapse: False + +##-- Select the StepVector implementation that is used. Options: HTSeq or Cython --## +counter_stepvector: 'Cython' + +##-- If True: produce diagnostic logs to indicate what was eliminated and why --## +counter_diags: False + + +######--------------------------- DIFFERENTIAL EXPRESSION ---------------------------###### +# +# Differential expression analysis is performed using the DESeq2 R library. +# +######-------------------------------------------------------------------------------###### + + +##-- If True: produce a principal component analysis plot from the input dataset --## +dge_pca_plot: True + +##-- If True: before analysis, drop features which have a zero count across all samples --## +dge_drop_zero: False + + +######-------------------------------- PLOTTING OPTIONS -----------------------------###### +# +# We use a custom Python script for creating all plots. If you wish to use another matplotlib +# stylesheet you can specify that in the Paths File. +# +# We have specified default parameters for small RNA data based on our own "best practices". +# You can change the parameters here. +# +######-------------------------------------------------------------------------------###### + + +##-- Enable plots by uncommenting (removing the '#') for the desired plot type --## +##-- Disable plots by commenting (adding a '#') for the undesired plot type --## +plot_requests: + - 'len_dist' + - 'rule_charts' + - 'class_charts' + - 'replicate_scatter' + - 'sample_avg_scatter_by_dge' + - 'sample_avg_scatter_by_dge_class' + +##-- You can set a custom P value to use in DGE scatter plots. Default: 0.05 --## +plot_pval: ~ + +##-- If True: scatter plot points will be vectorized. If False, only points are raster --## +plot_vector_points: False + +##-- Optionally set the min and/or max lengths for len_dist plots; auto if unset --## +plot_len_dist_min: +plot_len_dist_max: + +##-- Use this label in class plots for counts assigned by rules lacking a classifier --## +plot_unknown_class: "_UNKNOWN_" + +##-- Use this label in class plots for unassigned counts --## +plot_unassigned_class: "_UNASSIGNED_" + + +######----------------------------- OUTPUT DIRECTORIES ------------------------------###### +# +# Outputs for each step are organized into their own subdirectories in your run +# directory. You can set these folder names here. +# +######-------------------------------------------------------------------------------###### + + +dir_name_bt_build: bowtie-build +dir_name_fastp: fastp +dir_name_collapser: collapser +dir_name_bowtie: bowtie +dir_name_counter: counter +dir_name_dge: DGE +dir_name_plotter: plots + + +######################### AUTOMATICALLY GENERATED CONFIGURATIONS ######################### +# +# Do not make any changes to the following sections. These options are automatically +# generated using your Paths File, your Samples and Features sheets, and the above +# settings in this file. +# +########################################################################################### + +version: 1.2 + +######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### +# +# The following configuration settings are automatically derived from the Paths File +# +######-------------------------------------------------------------------------------###### + +run_directory: ~ +tmp_directory: ~ +features_csv: { } +samples_csv: { } +paths_file: { } +gff_files: [ ] +run_bowtie_build: false +reference_genome_files: [ ] +plot_style_sheet: ~ +adapter_fasta: ~ +ebwt: ~ + + +######------------------------- DERIVED FROM SAMPLES SHEET --------------------------###### +# +# The following configuration settings are automatically derived from the Samples Sheet +# +######-------------------------------------------------------------------------------###### + +##-- Utilized by fastp, tiny-collapse, and bowtie --## +sample_basenames: [ ] + +##-- Utilized by fastp --## +# input fastq files +in_fq: [ ] +# output reports +fastp_report_titles: [ ] + +###-- Utilized by bowtie --### +# bowtie index files +bt_index_files: [ ] + +##-- Utilized by tiny-deseq.r --## +# The control for comparison. If unspecified, all comparisons are made +control_condition: +# If the experiment design yields less than one degree of freedom, tiny-deseq.r is skipped +run_deseq: True + +######------------------------- DERIVED FROM FEATURES SHEET -------------------------###### +# +# The following configuration settings are automatically derived from the Features Sheet +# +######-------------------------------------------------------------------------------###### \ No newline at end of file diff --git a/tests/testdata/config_files/samples.csv b/tests/testdata/config_files/samples.csv new file mode 100755 index 00000000..d96b524e --- /dev/null +++ b/tests/testdata/config_files/samples.csv @@ -0,0 +1,7 @@ +FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization +../../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, +../../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,, +../../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,, +../../../START_HERE/fastq_files/cond2_rep1.fastq.gz,condition2,1,, +../../../START_HERE/fastq_files/cond2_rep2.fastq.gz,condition2,2,, +../../../START_HERE/fastq_files/cond2_rep3.fastq.gz,condition2,3,, \ No newline at end of file diff --git a/tests/testdata/config_files/tinyrna-light.mplstyle b/tests/testdata/config_files/tinyrna-light.mplstyle new file mode 100644 index 00000000..79642acd --- /dev/null +++ b/tests/testdata/config_files/tinyrna-light.mplstyle @@ -0,0 +1,99 @@ +# Matplotlib style sheet for Small RNA plots +# === DOCUMENTATION for this version =============================================================== +# https://matplotlib.org/3.5.2/tutorials/introductory/customizing.html#the-default-matplotlibrc-file + +#### Figure basics #### +savefig.dpi: 300 +figure.autolayout: true +figure.facecolor: None +figure.edgecolor: None + +#### Scatter Styles #### +scatter.marker: . + +#### Tick styles #### +xtick.color: 333333 +xtick.direction: out +ytick.color: 333333 +ytick.direction: out + +xtick.major.size: 5.0 +ytick.major.size: 5.0 +xtick.minor.size: 1.5 +ytick.minor.size: 1.5 +xtick.major.width: 0.4 +ytick.major.width: 0.4 +xtick.minor.width: 0.4 +ytick.minor.width: 0.4 + +xtick.major.pad: 3.5 +ytick.major.pad: 3.5 + +xtick.labelsize: 20 +ytick.labelsize: 20 + +#### Axes basics #### +axes.labelsize: 20 +axes.titlesize: 20 +axes.facecolor: white +axes.edgecolor: 333333 +axes.linewidth: 0.4 +axes.grid: True +axes.labelcolor: 000000 +axes.labelpad: 4.0 +axes.axisbelow: True +axes.autolimit_mode: round_numbers +axes.xmargin: 0.04 +axes.ymargin: 0.04 + +#### Legend #### +legend.fontsize: 20 +legend.framealpha: 0.8 +legend.edgecolor: 0.8 +legend.markerscale: 3.0 +legend.borderpad: 0.4 +legend.labelspacing: 0.5 +legend.handletextpad: 0.8 +legend.borderaxespad: 0.5 +legend.columnspacing: 2.0 + +#### Save options #### +pdf.fonttype: 42 +ps.fonttype: 42 +savefig.pad_inches: 0.1 + +#### Font #### +font.family: sans-serif +font.sans-serif: Arial +font.size: 10 + +#### TeX Font #### +mathtext.default: regular + +#### Line style #### +lines.linewidth: 0.4 +lines.markersize: 7.5 +lines.markeredgewidth: 0 + +#### Color cycle Default #### +#### Default colors used when color isn't specified in code +axes.prop_cycle: cycler('color', ['F1605D', '2980B9', 'FDC010', 'A5D38E', 'ED2891', '09535B', 'A32225', '971BF0', '17D1C9', '82C046', 'F0732A', 'E9D9A3']) + # F1605D : light red + # 2980B9 : blue + # FDC010 : gold-yellow + # A5D38E : light green + # ED2891 : hot pink + # 09535B : dark blue + # A32225 : deep red + # 971BF0 : purple + # 17D1C9 : turquoise + # 82C046 : green + # F0732A : orange + # E9D9A3 : tan + +#### Grid style #### +grid.color: 333333 +grid.linestyle: : +grid.linewidth: 0.4 +grid.alpha: 0.2 + diff --git a/tests/unit_test_helpers.py b/tests/unit_test_helpers.py index affa6b29..3bbc6579 100644 --- a/tests/unit_test_helpers.py +++ b/tests/unit_test_helpers.py @@ -49,11 +49,11 @@ def csv_factory(type: str, rows: List[dict], header=()): return csv_string.getvalue() -paths_template_file = os.path.abspath('../tiny/templates/paths.yml') +paths_template_file = os.path.abspath('./testdata/config_files/paths.yml') def make_paths_file(in_pipeline=False, prefs=None): - """IMPORTANT: relative file paths are evaluated relative to /tiny/templates/""" + """IMPORTANT: relative file paths are evaluated relative to /tests/testdata/config_files""" paths_file = paths_template_file config = PathsFile(paths_file, in_pipeline) diff --git a/tests/unit_tests_collapser.py b/tests/unit_tests_collapser.py index 41d032e2..52b3e319 100644 --- a/tests/unit_tests_collapser.py +++ b/tests/unit_tests_collapser.py @@ -94,7 +94,7 @@ def test_seq_counter_full(self): def test_seq_counter_gzip(self): # MIN TEST # Need to patch builtins.open in gzip module scope - with patch('tiny.rna.collapser.gzip.builtins.open', new=mock_open(read_data=self.min_fastq_gz)): + with patch('tiny.rna.util.gzip.builtins.open', new=mock_open(read_data=self.min_fastq_gz)): # Read the mock gzipped single record fastq file gz_min_result = collapser.seq_counter("mockPrefixDNE", collapser.gz_f) self.assertEqual(self.min_counts_dict, gz_min_result) @@ -109,7 +109,7 @@ def test_seq_counter_gzip(self): """ @patch('tiny.rna.collapser.open', new_callable=mock_open()) def test_seq2fasta_gzip(self, mock_open_f): - with patch('tiny.rna.collapser.gzip.builtins.open', new_callable=mock_open) as gz_open: + with patch('tiny.rna.util.gzip.builtins.open', new_callable=mock_open) as gz_open: # MIN TEST collapser.seq2fasta(self.min_counts_dict, "min_gz", gz=True) output = reassemble_gz_w(gz_open.mock_calls) @@ -362,7 +362,7 @@ def test_collapser_command(self): Testing argparse requirements. """ @patch('tiny.rna.collapser.os', autospec=True) - @patch('tiny.rna.collapser.gzip.os', autospec=True) + @patch('tiny.rna.util.gzip.os', autospec=True) @patch('sys.stdout', new_callable=StringIO) @patch('sys.stderr', new_callable=StringIO) def test_collapser_args(self, mock_stderr, mock_stdout, os_gz, os_aq): diff --git a/tests/unit_tests_configuration.py b/tests/unit_tests_configuration.py index 5b9b63e2..0d97b443 100644 --- a/tests/unit_tests_configuration.py +++ b/tests/unit_tests_configuration.py @@ -12,7 +12,7 @@ class BowtieIndexesTest(unittest.TestCase): @classmethod def setUpClass(self): - self.root_cfg_dir = os.path.abspath("../tiny/templates") + self.root_cfg_dir = os.path.abspath("./testdata/config_files") self.run_config = self.root_cfg_dir + "/run_config_template.yml" self.paths = self.root_cfg_dir + "/paths.yml" @@ -231,8 +231,8 @@ def test_getitem_group(self): config.groups = ('mock_parameter',) mapping_1 = {'path': "./some/../file", "other_key": "irrelevant"} - mapping_2 = {'path': "../templates/another_file"} - path_string = "../../START_HERE/reference_data/ram1.gff3" + mapping_2 = {'path': "../config_files/another_file"} + path_string = "../../../START_HERE/reference_data/ram1.gff3" config['mock_parameter'] = [mapping_1, mapping_2, path_string, None, ''] diff --git a/tests/unit_tests_entry.py b/tests/unit_tests_entry.py index 49808b0f..b913200f 100644 --- a/tests/unit_tests_entry.py +++ b/tests/unit_tests_entry.py @@ -27,7 +27,7 @@ def setUpClass(self): # For pre-install tests self.cwl_path = '../tiny/cwl' - self.templates_path = '../tiny/templates' + self.templates_path = './testdata/config_files' # For both pre and post install self.config_file = f'{self.templates_path}/run_config_template.yml' @@ -49,13 +49,13 @@ def setUpClass(self): } """ - Testing that get-template copies the correct files to the current directory. + Testing that get-templates copies the correct files to the current directory. """ - def test_get_template(self): + def test_get_templates(self): test_functions = [ - helpers.LambdaCapture(lambda: entry.get_template(self.templates_path)), # The pre-install invocation - helpers.ShellCapture("tiny get-template") # The post-install command + helpers.LambdaCapture(lambda: entry.get_templates(self.templates_path)), # The pre-install invocation + helpers.ShellCapture("tiny get-templates") # The post-install command ] template_files = ['run_config_template.yml', 'samples.csv', 'features.csv', 'paths.yml', 'tinyrna-light.mplstyle'] @@ -198,7 +198,7 @@ def get_children(): print('\n\n') print("Captured stderr:\n" + f'"{test.get_stderr()}"') finally: - run_dirs = glob("./testdata/entry_test_*_run_directory") + run_dirs = glob(f"{self.templates_path}/test_run_config_*_run_directory") for dir in run_dirs: shutil.rmtree(dir) diff --git a/tests/unit_tests_plotter.py b/tests/unit_tests_plotter.py index b53c172d..8e12cce5 100644 --- a/tests/unit_tests_plotter.py +++ b/tests/unit_tests_plotter.py @@ -28,14 +28,16 @@ def test_class_counts(self): be divided by the number of classes before being summed.""" # Each feature contributes a single count to its listed classes - raw_counts_df = plotter.tokenize_feature_classes( - pd.DataFrame.from_dict( - {('feat1', pd.NA): ['', 'wago', 1, 1, 1], - ('feat2', pd.NA): ['', 'csr,wago', 2, 2, 2], - ('feat3', pd.NA): ['', 'wago,csr,other', 3, 3, 3]}, - orient='index', - columns=['Feature Name', 'Feature Class', 'lib1', 'lib2', 'lib3']) - ) + raw_counts_df = pd.DataFrame.from_dict( + {('feat1', 'wago'): ['', 1.0, 1.0, 1.0], + ('feat2', 'csr'): ['', 1.0, 1.0, 1.0], + ('feat2', 'wago'): ['', 1.0, 1.0, 1.0], + ('feat3', 'wago'): ['', 1.0, 1.0, 1.0], + ('feat3', 'csr'): ['', 1.0, 1.0, 1.0], + ('feat3', 'other'): ['', 1.0, 1.0, 1.0]}, + orient='index', + columns=['Feature Name', 'lib1', 'lib2', 'lib3']) + raw_counts_df.index = pd.MultiIndex.from_tuples(raw_counts_df.index) actual = plotter.get_class_counts(raw_counts_df) expected = pd.DataFrame.from_dict( diff --git a/tiny/entry.py b/tiny/entry.py index 6dc6c44c..4a9dab13 100644 --- a/tiny/entry.py +++ b/tiny/entry.py @@ -5,7 +5,7 @@ data. This entry point also provides options for only returning template files and workflows that can be used separately. -Subcommands: get-template, setup-cwl, recount, replot, run. +Subcommands: get-templates, setup-cwl, recount, replot, run. When installed, run, recount and setup-cwl should be invoked with: tiny --config @@ -46,7 +46,7 @@ def get_args(): parser = ArgumentParser(description=__doc__) - # Parser for subcommands: (run, recount, replot, setup-cwl, get-template, etc.) + # Parser for subcommands: (run, recount, replot, setup-cwl, get-templates, etc.) subparsers = parser.add_subparsers(required=True, dest='command') subcommands_with_configfile = { "run": "Processes the provided config file and executes the workflow it specifies.", @@ -61,8 +61,8 @@ def get_args(): '--config', metavar='configFile', required=True, help=desc ) - # Subcommand get-template has no additional arguments - subparsers.add_parser("get-template", + # Subcommand get-templates has no additional arguments + subparsers.add_parser("get-templates", help="Copies run config, sample, and reference templates to current directory") return parser.parse_args() @@ -257,7 +257,7 @@ def furnish_if_file_record(file_dict): return 0 -def get_template(templates_path: str) -> None: +def get_templates(templates_path: str) -> None: """Copies all configuration file templates to the current working directory Args: @@ -310,7 +310,7 @@ def main(): run: Run the end-to-end analysis based on a config file. recount: Resume pipeline execution at the tiny-count step replot: Resume pipeline execution at the tiny-plot step - get-template: Get the input sheets & template config files. + get-templates: Get the input sheets & template config files. setup-cwl: Get the CWL workflow for a run """ @@ -327,7 +327,7 @@ def main(): "replot": lambda: resume(cwl_path, args.config, "tiny-plot"), "recount": lambda: resume(cwl_path, args.config, "tiny-count"), "setup-cwl": lambda: setup_cwl(cwl_path, args.config), - "get-template": lambda: get_template(templates_path) + "get-templates": lambda: get_templates(templates_path) } command_map[args.command]() diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py index 73546545..21ebe081 100644 --- a/tiny/rna/configuration.py +++ b/tiny/rna/configuration.py @@ -9,13 +9,11 @@ from pkg_resources import resource_filename from collections import Counter, OrderedDict -from datetime import datetime from typing import Union, Any, Optional, List from glob import glob from tiny.rna.counter.validation import GFFValidator - -timestamp_format = re.compile(r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}") +from tiny.rna.util import get_timestamp class ConfigBase: @@ -240,7 +238,7 @@ def setup_file_groups(self): def setup_pipeline(self): """Overall settings for the whole pipeline""" - self.dt = datetime.now().strftime('%Y-%m-%d_%H-%M-%S') + self.dt = get_timestamp() self['run_date'], self['run_time'] = self.dt.split('_') default_run_name = '_'.join(x for x in [self['user'], "tinyrna"] if x) @@ -605,7 +603,7 @@ class CSVReader(csv.DictReader): "Overlap": "Overlap", }), "Samples Sheet": OrderedDict({ - "Input FASTQ Files": "File", + "FASTQ/SAM Files": "File", "Sample/Group Name": "Group", "Replicate Number": "Replicate", "Control": "Control", diff --git a/tiny/rna/counter/counter.py b/tiny/rna/counter/counter.py index f18a272f..718e0593 100644 --- a/tiny/rna/counter/counter.py +++ b/tiny/rna/counter/counter.py @@ -1,22 +1,20 @@ -"""This submodule assigns feature counts for SAM alignments using a Feature Sheet ruleset. - -If you find that you are sourcing all of your input files from a prior run, we recommend -that you instead run `tiny recount` within that run's directory. -""" +"""tiny-count is a precision counting tool for hierarchical classification and quantification of small RNA-seq reads""" import multiprocessing as mp import traceback import argparse +import shutil import sys import os from collections import defaultdict from typing import Tuple, List, Dict +from pkg_resources import resource_filename from tiny.rna.counter.validation import GFFValidator from tiny.rna.counter.features import Features, FeatureCounter from tiny.rna.counter.statistics import MergedStatsManager -from tiny.rna.util import report_execution_time, from_here, ReadOnlyDict +from tiny.rna.util import report_execution_time, from_here, ReadOnlyDict, get_timestamp from tiny.rna.configuration import CSVReader, PathsFile # Global variables for multiprocessing @@ -26,30 +24,42 @@ def get_args(): """Get input arguments from the user/command line.""" - arg_parser = argparse.ArgumentParser(description=__doc__, add_help=False) - required_args = arg_parser.add_argument_group("Required arguments") - optional_args = arg_parser.add_argument_group("Optional arguments") + arg_parser = argparse.ArgumentParser( + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + description=__doc__, + add_help=False, + ) + + required_args = arg_parser.add_argument_group( + title="Required arguments", + description="You must either provide a Paths File or request templates for detailing your configuration.") + optional_args = arg_parser.add_argument_group( + title="Optional arguments", + description="These options can be used in conjunction with the Paths File (-pf) argument mentioned above.") # Required arguments - required_args.add_argument('-pf', '--paths-file', metavar='PATHS', required=True, - help='your Paths File') - required_args.add_argument('-o', '--out-prefix', metavar='OUTPUTPREFIX', required=True, - help='output prefix to use for file names') + mutex_top_grp = required_args.add_mutually_exclusive_group(required=True) + mutex_top_grp.add_argument('-pf', '--paths-file', metavar='FILE', help='your Paths File') + mutex_top_grp.add_argument('--get-templates', action='store_true', + help='Copies the template configuration files required by ' + 'tiny-count into the current directory.') # Optional arguments - optional_args.add_argument('-h', '--help', action="help", help="show this help message and exit") + optional_args.add_argument('-h', '--help', action="help", help=argparse.SUPPRESS) + optional_args.add_argument('-o', '--out-prefix', metavar='PREFIX', default='tiny-count_{timestamp}', + help='The output prefix to use for file names. All occurrences of the ' + 'substring {timestamp} will be replaced with the current date and time.') optional_args.add_argument('-nh', '--normalize-by-hits', metavar='T/F', default='T', - help='If T/true, normalize counts by (selected) ' - 'overlapping feature counts. Default: true.') + help='If T/true, normalize counts by (selected) overlapping feature counts.') optional_args.add_argument('-dc', '--decollapse', action='store_true', - help='Create a decollapsed copy of all SAM files listed in your ' - 'Samples Sheet. This option is ignored for non-collapsed inputs.') + help='Create a decollapsed copy of all SAM files listed in your Samples Sheet. ' + 'This option is ignored for non-collapsed inputs.') optional_args.add_argument('-sv', '--stepvector', choices=['Cython', 'HTSeq'], default='Cython', help='Select which StepVector implementation is used to find ' 'features overlapping an interval.') - optional_args.add_argument('-a', '--all-features', action='store_true', - help='Represent all features in output counts table, ' - 'even if they did not match a Select for / with value.') + optional_args.add_argument('-a', '--all-features', action='store_true', help=argparse.SUPPRESS) + #help='Represent all features in output counts table, ' + # 'even if they did not match in Stage 1 selection.') optional_args.add_argument('-p', '--is-pipeline', action='store_true', help='Indicates that tiny-count was invoked as part of a pipeline run ' 'and that input files should be sourced as such.') @@ -58,9 +68,26 @@ def get_args(): 'selection elements.') args = arg_parser.parse_args() - setattr(args, 'normalize_by_hits', args.normalize_by_hits.lower() in ['t', 'true']) - return ReadOnlyDict(vars(args)) + if args.get_templates: + get_templates() + sys.exit(0) + else: + args_dict = vars(args) + args_dict['out_prefix'] = args.out_prefix.replace('{timestamp}', get_timestamp()) + args_dict['normalize_by_hits'] = args.normalize_by_hits.lower() in ['t', 'true'] + return ReadOnlyDict(args_dict) + + +def get_templates(): + """Copies config file templates required by tiny-count into the current directory""" + + templates_path = resource_filename('tiny', 'templates') + template_files = ['paths.yml', 'samples.csv', 'features.csv'] + + # Copy template files to the current working directory + for template in template_files: + shutil.copyfile(f"{templates_path}/{template}", f"{os.getcwd()}/{template}") def load_samples(samples_csv: str, is_pipeline: bool) -> List[Dict[str, str]]: diff --git a/tiny/rna/plotter.py b/tiny/rna/plotter.py index be523dfb..54db1435 100644 --- a/tiny/rna/plotter.py +++ b/tiny/rna/plotter.py @@ -17,9 +17,8 @@ from typing import Dict, Union, Tuple, DefaultDict from pkg_resources import resource_filename -from tiny.rna.configuration import timestamp_format from tiny.rna.plotterlib import plotterlib -from tiny.rna.util import report_execution_time, make_filename, SmartFormatter +from tiny.rna.util import report_execution_time, make_filename, SmartFormatter, timestamp_format aqplt: plotterlib RASTER: bool diff --git a/tiny/rna/plotterlib.py b/tiny/rna/plotterlib.py index f8cd06c9..5fc30a6e 100644 --- a/tiny/rna/plotterlib.py +++ b/tiny/rna/plotterlib.py @@ -2,7 +2,7 @@ This module contains functions to create relevant plots for small RNA data for use with the tinyRNA pipeline. The plots are built using matplotlib and our style sheet. -You may override these styles by obtaining a copy of the style sheet (tiny get-template), +You may override these styles by obtaining a copy of the style sheet (tiny get-templates), modifying it, and passing it to tiny-plot via the -s/--style-sheet argument. If using this module directly, it may be passed at construction time. """ diff --git a/tiny/rna/resume.py b/tiny/rna/resume.py index 15d29757..f1e2745d 100644 --- a/tiny/rna/resume.py +++ b/tiny/rna/resume.py @@ -6,10 +6,10 @@ from ruamel.yaml.comments import CommentedOrderedMap from pkg_resources import resource_filename from abc import ABC, abstractmethod -from datetime import datetime from glob import glob -from tiny.rna.configuration import ConfigBase, timestamp_format, PathsFile +from tiny.rna.configuration import ConfigBase, PathsFile +from tiny.rna.util import timestamp_format, get_timestamp class ResumeConfig(ConfigBase, ABC): @@ -33,7 +33,7 @@ def __init__(self, processed_config, workflow, steps, entry_inputs): self.workflow: CommentedOrderedMap self.workflow = self.yaml.load(f) - self.dt = datetime.now().strftime('%Y-%m-%d_%H-%M-%S') + self.dt = get_timestamp() self.entry_inputs = entry_inputs self.steps = steps + [f"organize_{s}" for s in steps] diff --git a/tiny/rna/util.py b/tiny/rna/util.py index c222469e..5172e30c 100644 --- a/tiny/rna/util.py +++ b/tiny/rna/util.py @@ -6,6 +6,8 @@ import os import re +from datetime import datetime + class Singleton(type): _instances = {} @@ -98,4 +100,11 @@ def sorted_natural(lines, reverse=False): # File IO interface for reading and writing Gzip files -gzip_open = functools.partial(gzip.GzipFile, compresslevel=6, fileobj=None, mtime=0) \ No newline at end of file +gzip_open = functools.partial(gzip.GzipFile, compresslevel=6, fileobj=None, mtime=0) + + +# For timestamp matching and creation +timestamp_format = re.compile(r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}") +def get_timestamp(): + return datetime.now().strftime('%Y-%m-%d_%H-%M-%S') + diff --git a/tiny/templates/features.csv b/tiny/templates/features.csv index 66977d07..8ced3c8d 100755 --- a/tiny/templates/features.csv +++ b/tiny/templates/features.csv @@ -1,7 +1,2 @@ Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap -Class,mask,,,,1,both,all,all,Partial -Class,miRNA,,,,2,sense,all,16-22,Full -Class,piRNA,5pA,,,2,both,A,24-32,Full -Class,piRNA,5pT,,,2,both,T,24-32,Full -Class,siRNA,,,,2,both,all,15-22,Full -Class,unk,,,,3,both,all,all,Full \ No newline at end of file +,,,,,,,,, \ No newline at end of file diff --git a/tiny/templates/paths.yml b/tiny/templates/paths.yml index 4a8127b1..3f7a332d 100644 --- a/tiny/templates/paths.yml +++ b/tiny/templates/paths.yml @@ -12,13 +12,13 @@ ######-------------------------------------------------------------------------------###### ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --## -samples_csv: './samples.csv' -features_csv: './features.csv' +samples_csv: features.csv +features_csv: samples.csv ##-- Each entry: 1. the file, 2. (optional) list of attribute keys for feature aliases --## gff_files: -- path: "../../START_HERE/reference_data/ram1.gff3" - alias: [ID] +- path: + alias: #- path: # alias: [ ] @@ -48,7 +48,7 @@ ebwt: '' ##-- If you do not have a bowtie index, provide your reference genome file(s) here --## ##-- One file per line, with "- " at the beginning (think: bulleted list) --## reference_genome_files: -- '../../START_HERE/reference_data/ram1.fa' +- # First genome file goes here! ######----------------------------------- fastp -------------------------------------###### # Optional: provide a FASTA file containing the specific adapters you wish to trim @@ -59,7 +59,7 @@ adapter_fasta: ######--------------------------------- tiny-plot -----------------------------------###### # # Optional: override the styles used by tiny-plot by providing your own .mplstyle sheet -# Run "tiny get-template" in your terminal to get a copy of the current style sheet +# Run "tiny get-templates" in your terminal to get a copy of the current style sheet # ######-------------------------------------------------------------------------------###### diff --git a/tiny/templates/run_config_template.yml b/tiny/templates/run_config_template.yml index 87fdeedc..c6f6b77e 100644 --- a/tiny/templates/run_config_template.yml +++ b/tiny/templates/run_config_template.yml @@ -317,7 +317,7 @@ dir_name_plotter: plots # ########################################################################################### -version: 1.2 +version: 1.2.1 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------###### # diff --git a/tiny/templates/samples.csv b/tiny/templates/samples.csv index 34f9a94d..32305007 100755 --- a/tiny/templates/samples.csv +++ b/tiny/templates/samples.csv @@ -1,7 +1,2 @@ -Input FASTQ Files,Sample/Group Name,Replicate Number,Control,Normalization -../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE, -../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,, -../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,, -../../START_HERE/fastq_files/cond2_rep1.fastq.gz,condition2,1,, -../../START_HERE/fastq_files/cond2_rep2.fastq.gz,condition2,2,, -../../START_HERE/fastq_files/cond2_rep3.fastq.gz,condition2,3,, \ No newline at end of file +FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization +,,,, \ No newline at end of file