Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
d729546
Cleaning up troublesome uses of dict.get(key) and dict.get(key, defau…
AlexTate Feb 25, 2023
a639c4d
While performing the earlier audit I realized that we don't check for…
AlexTate Feb 25, 2023
163ce43
Removing the % character from tick labels in class_charts and rule_ch…
AlexTate Feb 26, 2023
66e8167
Adding a new option, -m, for bowtie and doing some CWL cleanup.
AlexTate Feb 26, 2023
c699a54
Adding documentation for sequence-based counting and shift parameters…
AlexTate Feb 27, 2023
acb656b
The Features Sheet section in Configuration.md has been cleaned up. I…
AlexTate Feb 27, 2023
4150e71
The Features Sheet section in Configuration.md has been cleaned up. I…
AlexTate Feb 27, 2023
3e2c5a6
Correcting documentation and config file comments to specify that the…
AlexTate Feb 27, 2023
aa39680
Updates to Configuration.md:
AlexTate Feb 27, 2023
23a5da7
Value lists in Source Filter and Type Filter are already covered in C…
AlexTate Feb 27, 2023
6387ba0
Correcting legend for anchored overlap selector
AlexTate Feb 27, 2023
82a029d
Correcting broken links in tiny-plot.md
AlexTate Feb 27, 2023
051808b
Updating the workflow figure's legend to use "Custom ___ utilities" r…
AlexTate Feb 27, 2023
a7153cf
Updating the description of resume runs now that the Paths File can b…
AlexTate Feb 27, 2023
2b0938a
Rephrasing description in sequence based counting mode
AlexTate Feb 27, 2023
2855ced
Making the routine for determining run_name and run_directory more ro…
AlexTate Mar 1, 2023
b8d51be
Removing default `user` values from Run Configs
AlexTate Mar 1, 2023
130b4a4
Clarifying edits in the Reducing Storage Usage section
AlexTate Mar 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ tinyRNA is a set of tools to simplify the analysis of next-generation sequencing

### The Current Workflow

![tinyRNA basic pipeline](images/tinyrna-workflow_current.png)
![tinyRNA basic pipeline](images/tinyrna_workflow_current.png)

## tinyRNA Installation

Expand Down Expand Up @@ -161,7 +161,7 @@ At the core of tinyRNA is tiny-count, a highly flexible counting utility that al
A wrapper R script for DESeq2 facilitates DGE analysis of counted sample files.

### `tiny-plot`
The results of feature counting and DGE are visualized with high resolution plot PDFs. User-defined plot styles are also supported via a Matplotlib stylesheet.
The results of feature counting and DGE analysis are visualized with high resolution plot PDFs. User-defined plot styles are also supported via a Matplotlib stylesheet.

[Full documentation for tiny-plot can be found here.](doc/tiny-plot.md)

Expand Down Expand Up @@ -276,13 +276,14 @@ Simple static plots are generated from the outputs of tiny-count and tiny-deseq.
tiny-deseq.r will produce a standard **PCA plot** from variance stabilizing transformed feature counts. This output is controlled by the `dge_pca_plot` key in the Run Config and by your experiment design. DGE outputs, including the PCA plot, will not be produced for experiments with less than 1 degree of freedom.

### Reducing Storage Usage
The files produced by certain steps can be very large and after several runs this may present significant storage usage. You can remove the following subdirectories from a Run Directory to free up space, but **you will no longer be able to perform repeat analyses within it (i.e. `tiny recount` or `tiny replot`)**:
The files produced by certain steps can be very large and after several runs this may present significant storage usage. You can remove the following subdirectories from a Run Directory to free up space, but **you will no longer be able to perform recount analyses within it** (i.e. `tiny recount`):
- fastp (though we recommend keeping the reports)
- collapser
- bowtie

Cleanup commands will be added to tinyRNA in a future release, but for now the following command will remove commonly large files while preserving report files:
```shell
# Execute within the Run Directory you want to clean
rm {fastp/*.fastq,{collapser,bowtie}/*.fa,bowtie/*.sam}
```

Expand Down
2 changes: 1 addition & 1 deletion START_HERE/paths.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ gff_files:
#- path:
# alias: [ ]

##-- The final output directory for files produced by the pipeline --#
##-- The suffix to use in the final output directory name (optional) --#
run_directory: run_directory

##-- The directory for temporary files. Determined by cwltool if blank. --##
Expand Down
393 changes: 392 additions & 1 deletion START_HERE/run_config.yml

Large diffs are not rendered by default.

58 changes: 41 additions & 17 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ When the pipeline starts up, tinyRNA will process the Run Config based on the co
## Paths File Details

### GFF Files
GFF annotations are required by tinyRNA. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
GFF annotations are optional but recommended. If not provided, tiny-count will perform [sequence-based counting](tiny-count.md#sequence-based-counting-mode) rather than feature-based counting. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
```yaml
- path: 'a/path/to/your/file.gff' # 0 spaces before -
alias: [optional, list, of attributes] # 2 spaces before alias
Expand All @@ -112,24 +112,28 @@ Once your indexes have been built, your Paths File will be modified such that `e

### The Run Directory
The final output directory name has three components:
- The `run_name` defined in your Run Config
- The date and time at pipeline startup
- The `run_directory` basename defined in your Paths File
1. The `run_name` defined in your Run Config
2. The date and time at pipeline startup
3. The basename of `run_directory` defined in your Paths File

The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained.

## Samples Sheet Details
| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization |
|-----------:|---------------------|-------------------|------------------|---------|---------------|
| _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM |

### Assigning the Control Group
Assigning the control group allows the proper DGE comparisons to be made and plotted. The Control column is where you'll make this indication by writing `true` on any corresponding row. Regardless of the number of replicates in each group, only one associated row needs to have this indication. Do not write `false` or anything else for the other groups; this column should only be used to indicate the affirmative.
Assigning the control group allows the proper DGE comparisons to be made and plotted. The Control column is where you'll make this indication by writing `true` on any corresponding row. Regardless of the number of replicates in each group, only one row needs to have this indication.

tinyRNA doesn't support experiments with more than one control condition. However, if you omit all control condition labels then every possible comparison will be made which should include the desired comparisons.

### Applying Custom Normalization
Custom normalization can be applied at the conclusion of feature counting using the Normalization column. Unlike the Control column, values in the Normalization column apply to the specific library that they share a row with.

Supported values are:
- **Blank or 1**: no normalization is applied to the corresponding library
- **Any number**: the corresponding library's counts are divided by this number
- **Any number**: the corresponding library's counts are divided by this number (useful for spike-in normalization)
- **RPM or rpm**: the corresponding library's counts are divided by (its mapped read count / 1,000,000)

>**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-hits commandline option. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by selected feature count at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.
Expand All @@ -138,17 +142,37 @@ Supported values are:
DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.

## Features Sheet Details
| _Column:_ | Select for... | with value... | Classify as... | Source Filter | Type Filter | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap |
|------------|---------------|---------------|----------------|----------------|-------------|-----------|--------|-------------------|--------|-------------|
| _Example:_ | Class | miRNA | miRNA | | | 1 | sense | all | all | 5' anchored |

The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.

Selection first takes place against the feature attributes defined in your GFF files, and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...).

Rules that match features in the first stage of selection will be used in a second stage which evaluates alignment vs. feature interval overlap. These matches are sorted by hierarchy value and passed to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length.

See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.
![Features Sheet Header](../images/features_sheet_header.png)

The Features Sheet allows you to define selection rules that control how reads are assigned to features. We refer to each row as a rule, and columns as a selectors. `Classify as...` isn't a selector because it is used for labelling and subsetting matches rather than determining them. See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of the selection process and the role that each selector plays.

### Selector Formats
Selectors in the Features Sheet can be specified as a single value, a list of comma separated values, a range, or a wildcard. The supported formats vary from selector to selector. For list and range formats, just one of the specified values has to match for the target to be selected. Wildcard formats can be implicitly defined with a blank cell, or explicitly defined using the example keywords below.

| Selector | Wildcard | Single | List | Range |
|:----------------|:--------:|:------:|:----:|:-----:|
| `Select for...` | ✓ | ✓ | | |
| `with value...` | ✓ | ✓ | | |
| `Source Filter` | ✓ | ✓ | ✓ | |
| `Type Filter` | ✓ | ✓ | ✓ | |
| `Hierarchy` | | ✓ | | |
| `Overlap` | ✓ | ✓ | | |
| `Strand` | ✓ | ✓ | | |
| `5' nt` | ✓ | ✓ | ✓ | |
| `Length` | ✓ | ✓ | ✓ | ✓ |

Examples:
- **Wildcard** <sup>†</sup>: `any`, `all`, `*`, or a blank cell
- **Single**: `G` or `22`
- **List**: `C,G,U` or `25, 26` (spaces do not matter)
- **Range**: `20-25`
- **Mixed** <sup>§</sup>: `19, 21-23, 25-30`

<sup>†</sup> the `Strand` selector also supports `both`<br>
<sup>§</sup> only supported by the `Length` selector

### Case Sensitivity
All selectors are case-insensitive.

## Plot Stylesheet Details
Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-templates`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
6 changes: 5 additions & 1 deletion doc/Pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ The commands `tiny recount` and `tiny replot` seek to solve this problem. As dis

You can modify the behavior of a resume run by changing settings in:
- The **processed** Run Config
- The **original** Features Sheet that was used for the end-to-end run (as indicated by the `features_csv` key in the **processed** Run Config)
- The **original** Features Sheet that was used for the end-to-end run (as indicated by `features_csv` in the processed Run Config)
- The **original** Paths File (as indicated by `paths_config` in the processed Run Config)

### The Steps
1. Make and save the desired changes in the files above
Expand All @@ -45,6 +46,9 @@ File inputs are sourced from the **original** output subdirectories of prior ste
### Where to Find Outputs from Resume Runs
Output subdirectories for resume runs can be found alongside the originals, and will have a timestamp appended to their name to differentiate them.

### Auto-Documentation of Resume Runs
A new processed Run Config will be saved in the Run Directory at the beginning of each resume run. It will be labelled with the same timestamp used in the resume run's other outputs to differentiate it. It includes the changes to your Paths File and Run Config. A copy of your Features Sheet is saved to the timestamped tiny-count output directory during `tiny recount` runs.

## Parallelization
Most steps in the pipeline run in parallel to minimize runtimes. This is particularly advantageous for multiprocessor systems like server environments. However, parallelization isn't always beneficial. If your computer doesn't have enough free memory, or if you have a large sample file set and/or reference genome, parallel execution might push your machine to its limits. When this happens you might see memory errors or your computer may become unresponsive. In these cases it makes more sense to run resource intensive steps one at a time, in serial, rather than in parallel. To do so, set `run_parallel: false` in your Run Config. This will affect fastp, tiny-collapse, and bowtie since these steps typically handle the largest volumes of data.

Expand Down
Loading