Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The pipeline requires that you identify:

For more information, please see the [configuration file documentation](doc/Configuration.md). The `START_HERE` directory demonstrates a working configuration using these files. You can also get a copy of them by running the command:
```shell
tiny get-template
tiny get-templates
```


Expand Down
2 changes: 1 addition & 1 deletion START_HERE/paths.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ adapter_fasta:
######--------------------------------- tiny-plot -----------------------------------######
#
# Optional: override the styles used by tiny-plot by providing your own .mplstyle sheet
# Run "tiny get-template" in your terminal to get a copy of the current style sheet
# Run "tiny get-templates" in your terminal to get a copy of the current style sheet
#
######-------------------------------------------------------------------------------######

Expand Down
2 changes: 1 addition & 1 deletion START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ dir_name_plotter: plots
#
###########################################################################################

version: 1.2
version: 1.2.1

######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
#
Expand Down
2 changes: 1 addition & 1 deletion START_HERE/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Input FastQ Files,Sample/Group Name,Replicate number,Control,Normalization
FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization
./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
./fastq_files/cond1_rep2.fastq.gz,condition1,2,,
./fastq_files/cond1_rep3.fastq.gz,condition1,3,,
Expand Down
6 changes: 3 additions & 3 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The pipeline requires that you identify:

The `START_HERE` directory demonstrates a working configuration using these files. You can also get a copy of them (and other optional template files) with:
```
tiny get-template
tiny get-templates
```

## Overview
Expand Down Expand Up @@ -117,7 +117,7 @@ The final output directory name has three components:
- The `run_directory` basename defined in your Paths File

## Samples Sheet Details
| _Column:_ | Input FASTQ Files | Sample/Group Name | Replicate Number | Control | Normalization |
| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization |
|-----------:|---------------------|-------------------|------------------|---------|---------------|
| _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM |

Expand Down Expand Up @@ -151,4 +151,4 @@ Rules that match features in the first stage of selection will be used in a seco
See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.

## Plot Stylesheet Details
Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-template`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-templates`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
85 changes: 43 additions & 42 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,22 +63,22 @@ Optional arguments:

## tiny-count

### All Features
| Run Config Key | Commandline Argument |
|------------------------|------------------------|
| counter_all_features: | `--all-features` |
### Get Templates
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| | `--get-templates` |

By default, tiny-count will only evaluate alignments to features which match a `Select for...` & `with value...` of at least one rule in your Features Sheet. It is this matching feature set, and only this set, which is included in `feature_counts.csv` and therefore available for analysis by tiny-deseq.r and tiny-plot. Switching this option "on" will include all features in every input GFF file, regardless of attribute matches, for tiny-count and downstream steps.
Copies the template configuration files required by tiny-count into the current directory. This argument can't be combined with `--paths-file`. All other arguments are ignored when provided, and once the templates have been copied tiny-count exits.

### Normalize by Hits
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|----------------------------|---------------------------|
| counter-normalize-by-hits: | `--normalize-by-hits T/F` |

By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.

### Decollapse
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|---------------------|------------------------|
| counter_decollapse: | `--decollapse` |

Expand All @@ -89,63 +89,64 @@ The SAM files produced by the tinyRNA pipeline are collapsed by default; alignme
|--------------------|----------------------|
| counter_stepvector | `--stepvector` |

A custom Cython implementation of HTSeq's StepVector is used for finding features that overlap each alignment interval. While the core C++ component of the StepVector is the same, we have found that our Cython implementation can result in runtimes up to 50% faster than HTSeq's implementation. This parameter allows you to use HTSeq's StepVector if you wish (for example, if the Cython StepVector is incompatible with your system)

### Allow Features with Multiple ID Values
| Run Config Key | Commandline Argument |
|------------------------|----------------------|
| counter_allow_multi_id | `--multi-id` |

By default, an error will be produced if a GFF file contains a feature with multiple comma separated values listed under its ID attribute. Switching this option "on" instructs tiny-count to accept these features without error, but only the first listed value is used as the ID.
A custom Cython implementation of HTSeq's StepVector is used for finding features that overlap each alignment interval. While the core C++ component of the StepVector is the same, we have found that our Cython implementation can result in runtimes up to 50% faster than HTSeq's implementation. This parameter allows you to use HTSeq's StepVector if you wish.

### Is Pipeline
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| | `--is-pipeline` |

This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet.

### Report Diags
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| counter_diags: | `--report-diags` |

Diagnostic information will include intermediate alignment files for each library and an additional stats table with information about counts that were not assigned to a feature. See [the description of these outputs](../README.md#Diagnostics) for details.

### Full tiny-count Help String
```
tiny-count -pf PATHS -o OUTPUTPREFIX [-h] [-nh T/F] [-dc]
[-sv {Cython,HTSeq}] [-a] [-p] [-d]
tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-nh T/F] [-dc]
[-sv {Cython,HTSeq}] [-p] [-d]

This submodule assigns feature counts for SAM alignments using a Feature Sheet
ruleset. If you find that you are sourcing all of your input files from a
prior run, we recommend that you instead run `tiny recount` within that run's
directory.
tiny-count is a precision counting tool for hierarchical classification and
quantification of small RNA-seq reads

Required arguments:
-pf PATHS, --paths-file PATHS
your Paths File
-o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX
output prefix to use for file names
You must either provide a Paths File or request templates for detailing
your configuration.

-pf FILE, --paths-file FILE
your Paths File (default: None)
--get-templates Copies the template configuration files required by
tiny-count into the current directory. (default:
False)

Optional arguments:
-h, --help show this help message and exit
These options can be used in conjunction with the Paths File (-pf)
argument mentioned above.

-o PREFIX, --out-prefix PREFIX
The output prefix to use for file names. All
occurrences of the substring {timestamp} will be
replaced with the current date and time. (default:
tiny-count_{timestamp})
-nh T/F, --normalize-by-hits T/F
If T/true, normalize counts by (selected) overlapping
feature counts. Default: true.
feature counts. (default: T)
-dc, --decollapse Create a decollapsed copy of all SAM files listed in
your Samples Sheet. This option is ignored for non-
collapsed inputs.
collapsed inputs. (default: False)
-sv {Cython,HTSeq}, --stepvector {Cython,HTSeq}
Select which StepVector implementation is used to find
features overlapping an interval.
-a, --all-features Represent all features in output counts table, even if
they did not match a Select for / with value.
features overlapping an interval. (default: Cython)
-p, --is-pipeline Indicates that tiny-count was invoked as part of a
pipeline run and that input files should be sourced as
such.
such. (default: False)
-d, --report-diags Produce diagnostic information about
uncounted/eliminated selection elements.
uncounted/eliminated selection elements. (default:
False)
```
## tiny-deseq.r

Expand Down Expand Up @@ -198,36 +199,36 @@ Optional arguments:
## tiny-plot

### Plot Requests
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|----------------|------------------------------|
| plot_requests: | `--plots PLOT PLOT PLOT ...` |

tiny-plot will only produce the list of plots requested.

### P value
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| plot_pval: | `--p-value VALUE` |

Feature expression levels are considered significant if their P value is less than this value, with a default of 0.05. Non-differentially expressed features are plotted as gray points, and in `sample_avg_scatter_by_dge_class`, these points are not colored by feature class.

### Style Sheet
| Run Config Key | Paths File Key | Commandline Argument |
| Run Config Key | Paths File Key | Commandline Argument |
|----------------|-------------------|--------------------------|
| | plot_style_sheet: | `--style-sheet MPLSTYLE` |

The plot style sheet can be used to override the default Matplotlib styles used by tiny-plot. Unlike the other parameters, this option is found in the Paths File. See the [Plot Stylesheet documentation](Configuration.md#plot-stylesheet-details) for more information.

### Vector Scatter
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|---------------------|----------------------|
| plot_vector_points: | `--vector-scatter` |

The scatter plots produced by tiny-plot have rasterized points by default. This allows for faster plot generation, smaller file sizes, and files that are more easily handled by PDF readers. Plots are produced in 300 dpi by default, so in most cases this rasterization is seldom noticeable under normal zoom levels. Switching this option "on" will cause points to be vectorized allowing for zooming without pixelation.
>**Note**: only scatter points are rasterized with this option switched "off"; all other elements are vectorized in every plot type.

### Bounds for len_dist Charts
| Run Config Key | Commandline Argument |
| Run Config Key | Commandline Argument |
|--------------------|------------------------|
| plot_len_dist_min: | `--len-dist-min VALUE` |
| plot_len_dist_max: | `--len-dist-max VALUE` |
Expand All @@ -247,8 +248,8 @@ The labels that should be used for special groups in `class_charts` and `sample_
tiny-plot [-rc RAW_COUNTS] [-nc NORM_COUNTS] [-uc RULE_COUNTS]
[-ss STAT] [-dge COMPARISON [COMPARISON ...]]
[-len 5P_LEN [5P_LEN ...]] [-h] [-o PREFIX] [-pv VALUE]
[-s MPLSTYLE] [-v] [-ldi VALUE] [-lda VALUE] -p PLOT
[PLOT ...]
[-s MPLSTYLE] [-v] [-ldi VALUE] [-lda VALUE] [-una LABEL]
[-unk LABEL] -p PLOT [PLOT ...]

This script produces basic static plots for publication as part of the tinyRNA
workflow. Input file requirements vary by plot type and you are free to supply
Expand Down
2 changes: 1 addition & 1 deletion doc/Pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ The following commands deal with pipeline operations for carrying out end-to-end

```shell
# Retrieving config files
tiny get-template
tiny get-templates
tiny setup-cwl

# End-to-end analysis
Expand Down
11 changes: 5 additions & 6 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,16 @@
For an explanation of tiny-count's parameters in the Run Config and by commandline, see [the parameters documentation](Parameters.md#tiny-count).

## Resuming an End-to-End Analysis
tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps are resource and time intensive, so it is inconvenient to rerun an end-to-end analysis to test new selection rules. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequesites.
tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites.

## Running as a Standalone Tool
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from third party sources are also supported, and can be produced from reads collapsed by tiny-collapse or fastx, or from non-collapsed reads.

>**Important:** reusing the same output filename prefix between standalone runs will result in prior outputs being overwritten.
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.

#### Using Non-collapsed Sequence Alignments
While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1.)
While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1).


# Feature Selection
![Feature Selection Diagram](../images/tiny-count_selection.png)

We provide a Features Sheet (`features.csv`) in which you can define selection rules to more accurately capture counts for the small RNAs of interest. The parameters for these rules include attributes commonly used in the classification of small RNAs, such as length, strandedness, and 5' nucleotide.

Expand All @@ -26,6 +23,8 @@ Selection occurs in three stages, with the output of each stage as input to the
1. Features are matched to rules based on their attributes defined in GFF files
2. At each alignment locus, overlapping features are selected based on the overlap requirements of their matched rules. Selected features are sorted by hierarchy value so that smaller values take precedence in the next stage.
3. Finally, features are selected for read assignment based on the small RNA attributes of the alignment locus. Once reads are assigned to a feature, they are excluded from matches with larger hierarchy values.

![Feature Selection Diagram](../images/tiny-count_selection.png)

## Stage 1: Feature Attribute Parameters
| _features.csv columns:_ | Select for... | with value... | Classify as... | Source Filter | Type Filter |
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
AUTHOR = 'Kristen Brown, Alex Tate'
PLATFORM = 'Unix'
REQUIRES_PYTHON = '>=3.9.0'
VERSION = '1.2'
VERSION = '1.2.1'
REQUIRED = [] # Required packages are installed via Conda's environment.yml


Expand Down
7 changes: 7 additions & 0 deletions tests/testdata/config_files/features.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,,,,1,both,all,all,Partial
Class,miRNA,,,,2,sense,all,16-22,Full
Class,piRNA,5pA,,,2,both,A,24-32,Full
Class,piRNA,5pT,,,2,both,T,24-32,Full
Class,siRNA,,,,2,both,all,15-22,Full
Class,unk,,,,3,both,all,all,Full
Loading