Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1f53130
Added GFF input handling to process_paths_sheet(). Also includes styl…
AlexTate Oct 15, 2022
53f1683
Commandline arguments for tiny-count have been modified. The required…
AlexTate Oct 15, 2022
4bfa445
Updated the directions in the comments to include providing GFF files…
AlexTate Oct 15, 2022
f179966
Progress commit, switching branches
AlexTate Oct 15, 2022
6467672
Tiny-count is now able to perform counting using the new Paths File i…
AlexTate Oct 23, 2022
9105779
Updated config files in START_HERE for the new GFF input method
AlexTate Oct 23, 2022
c5ecbf5
Bugfixes and reliability improvements. ConfigBase.from_here() can now…
AlexTate Oct 23, 2022
85cbd0c
Merge branch 'master' into issue-234
AlexTate Oct 23, 2022
49330f2
Updates to ConfigBase.joinpath() to make it more robust against input…
AlexTate Oct 25, 2022
e5956fd
A new configuration class has been created: PathsFile. In addition to…
AlexTate Oct 25, 2022
0d1a017
Updating the Configuration class to use the new PathsFile class when …
AlexTate Oct 25, 2022
ae25379
The new PathsFile class has been integrated with tiny-count. This has…
AlexTate Oct 26, 2022
3932062
Correcting the tuple unpacking order in some functions. This isn't ne…
AlexTate Oct 26, 2022
cd9b050
Unit tests have been added for the new PathsFile class
AlexTate Oct 26, 2022
bb0293f
Config files in the templates directory have been updated to reflect …
AlexTate Oct 26, 2022
509f881
ConfigBase.from_here() has slightly cleaner logic and is much more re…
AlexTate Oct 26, 2022
b10da8a
The PathsFile class is now compatible with pipeline-context path reso…
AlexTate Oct 26, 2022
5c7cd06
Updated load_gff_files() to utilize PathsFile's new pipeline-context …
AlexTate Oct 27, 2022
5dfffdd
Updated CWL for new tiny-count inputs
AlexTate Oct 27, 2022
5b02f1e
Cleaning up the Run Config templates
AlexTate Oct 27, 2022
e9257ef
Moving paths_template_file and make_paths_file() into unit_test_helpe…
AlexTate Oct 27, 2022
5949395
Added new test for pipeline-context path mapping. This commit also co…
AlexTate Oct 27, 2022
80dfe5d
Updated tests for new Features Sheet layout, and added PathsFile-spec…
AlexTate Oct 27, 2022
a10c4e9
Minor formatting simplification and helpful comment for the gff_files…
AlexTate Oct 27, 2022
64adab6
Documentation updates for the new Paths-File-hosted GFF files and the…
AlexTate Oct 27, 2022
2a2c4bd
Merge branch 'master' into issue-234
AlexTate Oct 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,8 @@ In most cases you will use this toolset as an end-to-end pipeline. This will run
2. The genome sequence of interest in fasta format.
3. Genome coordinates of small RNA features of interest in GFF format.
4. A completed Samples Sheet (`samples.csv`) with paths to the fastq files.
5. A completed Features Sheet (`features.csv`) with paths to the GFF file(s).
6. An updated Paths File (`paths.yml`) with the path to the genome sequence and/or your bowtie index prefix.
5. A completed Features Sheet (`features.csv`) with feature selection rules.
6. An updated Paths File (`paths.yml`) with paths to your GFF files, the genome sequence and/or your bowtie index prefix, as well as the paths to `samples.csv` and `features.csv`.
7. A Run Config file (`run_config.yml`) located in your working directory or the path to the file. The template provided does not need to be updated if you wish to use the default settings.

To run an end-to-end analysis, be sure that you're working within the conda tinyrna environment ([instructions above](#usage)) in your terminal and optionally navigate to the location of your Run Config file. Then, simply run the following in your terminal:
Expand Down Expand Up @@ -177,13 +177,13 @@ The tiny-count step produces a variety of outputs
Custom Python scripts and HTSeq are used to generate a single table of feature counts which includes each counted library. Each matched feature is represented with the following metadata columns:
- **_Feature ID_** is determined, in order of preference, by one of the following GFF column 9 attributes: `ID`, `gene_id`, `Parent`.
- **_Classifier_** is determined by the rules in your Features Sheet. It is the _Classify as..._ value of each matching rule. Since multiple rules can match a feature, some Feature IDs will be listed multiple times with different classifiers.
- **_Feature Name_** displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.
- **_Feature Name_** displays aliases of your choice, as specified in the `alias` key under each GFF listed in your Paths File. If `alias` is set to `ID`, the _Feature Name_ column is left empty.

For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name`, _Classify as..._ `miRNA`, and the GFF entry for this feature has the following attributes column:
For example, if your Paths File has a GFF entry which specifies `alias: [sequence_name]`, and the corresponding GFF file has a feature with the following attributes column:
```
... ID=406904;sequence_name=mir-1,hsa-miR-1; ...
```
The row for this feature in the feature counts table would read:
And this feature matched a rule in your Features Sheet defining _Classify as..._ `miRNA`, then the entry for this feature in the final counts table would read:

| Feature ID | Classifier | Feature Name | Group1_rep_1 | Group1_rep_2 | ... |
|------------|------------|------------------|--------------|--------------|-----|
Expand Down
14 changes: 7 additions & 7 deletions START_HERE/features.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Select for...,with value...,Alias by...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
Class,mask,Alias,,1,both,all,all,Partial,./reference_data/ram1.gff3
Class,miRNA,Alias,,2,sense,all,16-22,Full,./reference_data/ram1.gff3
Class,piRNA,Alias,5pA,2,both,A,24-32,Full,./reference_data/ram1.gff3
Class,piRNA,Alias,5pT,2,both,T,24-32,Full,./reference_data/ram1.gff3
Class,siRNA,Alias,,2,both,all,15-22,Full,./reference_data/ram1.gff3
Class,unk,Alias,,3,both,all,all,Full,./reference_data/ram1.gff3
Select for...,with value...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
Class,mask,,1,both,all,all,Partial
Class,miRNA,,2,sense,all,16-22,Full
Class,piRNA,5pA,2,both,A,24-32,Full
Class,piRNA,5pT,2,both,T,24-32,Full
Class,siRNA,,2,both,all,15-22,Full
Class,unk,,3,both,all,all,Full
12 changes: 10 additions & 2 deletions START_HERE/paths.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,23 @@
#
# Directions:
# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
# 2. Fill out the Features Sheet with reference files and selection rules [features.csv]
# 3. Set samples_csv and features_csv to point to these files
# 2. Fill out the Features Sheet with selection rules [features.csv]
# 3. Set samples_csv and features_csv (below) to point to these files
# 4. Add annotation files and per-file alias preferences to gff_files
#
######-------------------------------------------------------------------------------######

##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
samples_csv: ./samples.csv
features_csv: ./features.csv

##-- Each entry: 1. the file, 2. (optional) list of attribute keys for feature aliases --##
gff_files:
- path: "./reference_data/ram1.gff3"
alias: [Alias]
#- path:
# alias: [ ]

##-- The final output directory for files produced by the pipeline --#
run_directory: run_directory

Expand Down
47 changes: 22 additions & 25 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
######----------------------------- tinyRNA Configuration -----------------------------######
#
# In this file you may specify your configuration preferences for the workflow and
# In this file you can specify your configuration preferences for the workflow and
# each workflow step.
#
# If you want to use DEFAULT settings for the workflow, all you need to do is provide the path
# to your Samples Sheet and Features Sheet in your Paths file, then make sure that the
# 'paths_config' setting below points to your Paths file.
# to your Samples Sheet and Features Sheet in your Paths File, then make sure that the
# 'paths_config' setting below points to your Paths File.
#
# We suggest that you also:
# 1. Add a username to identify the person performing runs, if desired for record keeping
# 2. Add a run directory name in your Paths file. If not provided, "run_directory" is used
# 2. Add a run directory name in your Paths File. If not provided, "run_directory" is used
# 3. Add a run name to label your run directory and run-specific summary reports.
# If not provided, user_tinyrna will be used.
#
# This file will be further processed at run time to generate the appropriate pipeline
# settings for each workflow step. A copy of this processed configuration will be stored
# in your run directory (as specified by your Paths configuration file).
# in your run directory.
#
######-------------------------------------------------------------------------------######

Expand Down Expand Up @@ -48,7 +48,7 @@ run_native: false
# paths_config file.
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You may change the parameters here.
# You can change the parameters here.
#
######-------------------------------------------------------------------------------######

Expand All @@ -75,7 +75,7 @@ ftabchars: ~
# pipeline github: https://github.com/MontgomeryLab/tinyrna
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You may change the parameters here.
# You can change the parameters here.
#
######-------------------------------------------------------------------------------######

Expand Down Expand Up @@ -135,7 +135,7 @@ compression: 4
# Trimming takes place prior to counting/collapsing.
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You may change the parameters here.
# You can change the parameters here.
#
######-------------------------------------------------------------------------------######

Expand All @@ -157,7 +157,7 @@ compress: False
# We use bowtie for read alignment to a genome.
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You may change the parameters here.
# You can change the parameters here.
#
######-------------------------------------------------------------------------------######

Expand Down Expand Up @@ -263,12 +263,11 @@ dge_drop_zero: False

######-------------------------------- PLOTTING OPTIONS -----------------------------######
#
# We use a custom Python script for creating all plots. The default base style is called
# 'smrna-light'. If you wish to use another matplotlib stylesheet you may specify that in
# the Paths File.
# We use a custom Python script for creating all plots. If you wish to use another matplotlib
# stylesheet you can specify that in the Paths File.
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You may change the parameters here.
# You can change the parameters here.
#
######-------------------------------------------------------------------------------######

Expand Down Expand Up @@ -303,7 +302,7 @@ plot_unassigned_class: "_UNASSIGNED_"
######----------------------------- OUTPUT DIRECTORIES ------------------------------######
#
# Outputs for each step are organized into their own subdirectories in your run
# directory. You may set these folder names here.
# directory. You can set these folder names here.
#
######-------------------------------------------------------------------------------######

Expand All @@ -320,32 +319,34 @@ dir_name_plotter: plots
######################### AUTOMATICALLY GENERATED CONFIGURATIONS #########################
#
# Do not make any changes to the following sections. These options are automatically
# generated using your Paths file, your Samples and Features sheets, and the above
# generated using your Paths File, your Samples and Features sheets, and the above
# settings in this file.
#
###########################################################################################


######--------------------------- DERIVED FROM PATHS SHEET --------------------------######
######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
#
# The following configuration settings are automatically derived from the sample sheet
# The following configuration settings are automatically derived from the Paths File
#
######-------------------------------------------------------------------------------######

run_directory: ~
tmp_directory: ~
features_csv: { }
samples_csv: { }
paths_file: { }
gff_files: [ ]
run_bowtie_build: false
reference_genome_files: [ ]
plot_style_sheet: ~
adapter_fasta: ~
ebwt: ~


######-------------------------- DERIVED FROM SAMPLE SHEET --------------------------######
######------------------------- DERIVED FROM SAMPLES SHEET --------------------------######
#
# The following configuration settings are automatically derived from the sample sheet
# The following configuration settings are automatically derived from the Samples Sheet
#
######-------------------------------------------------------------------------------######

Expand All @@ -370,10 +371,6 @@ run_deseq: True

######------------------------- DERIVED FROM FEATURES SHEET -------------------------######
#
# The following configuration settings are automatically derived from the sample sheet
# The following configuration settings are automatically derived from the Features Sheet
#
######-------------------------------------------------------------------------------######

###-- Utilized by tiny-count --###
# a list of only unique GFF files
gff_files: [ ]
######-------------------------------------------------------------------------------######
31 changes: 22 additions & 9 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ tiny get-template

## Overview

>**Tip**: Each of the following will allow you to map out paths to your input files for analysis. You can use either relative or absolute paths to do so. **Relative paths will be evaluated relative to the file in which they are defined.** This allows you to flexibly organize and share configurations between projects.
>**Tip**: You can use either relative or absolute paths for your file inputs. **Relative paths will be evaluated relative to the file in which they are defined.** This allows you to flexibly organize and share configurations between projects.

#### Run Config

The overall behavior of the pipeline and its steps is determined by the Run Config file (`run_config.yml`). This YAML file can be edited using a simple text editor. Within it you must specify the location of your Paths file (`paths.yml`). All other settings are optional. [More info](#run-config-details).
The overall behavior of the pipeline and its steps is determined by the Run Config file (`run_config.yml`). This YAML file can be edited using a simple text editor. Within it you must specify the location of your Paths File (`paths.yml`). All other settings are optional. [More info](#run-config-details).

#### Paths File

The locations of pipeline file inputs are defined in the Paths file (`paths.yml`). This YAML file includes paths to your Samples and Features Sheets, in addition to your bowtie index prefix (optional) and the final run directory name. The final run directory will contain all pipeline outputs. The directory name is prepended with the `run_name` and current date and time to keep outputs separate. [More info](#paths-file-details).
The locations of pipeline file inputs are defined in the Paths file (`paths.yml`). This YAML file includes paths to your configuration files, your GFF files, and your bowtie indexes and/or reference genome. [More info](#paths-file-details).

#### Samples Sheet

Expand Down Expand Up @@ -91,6 +91,15 @@ When the pipeline starts up, tinyRNA will process the Run Config based on the co

## Paths File Details

### GFF Files
GFF annotations are required by tinyRNA. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
```yaml
- path: 'a/path/to/your/file.gff' # 0 spaces before -
alias: [optional, list, of attributes] # 2 spaces before alias

# ^ Each new GFF path must begin with -
```

### Building Bowtie Indexes
If you don't have bowtie indexes already built for your reference genome, tinyRNA can build them for you at the beginning of an end-to-end run and reuse them on subsequent runs with the same Paths File.

Expand All @@ -101,6 +110,12 @@ To build bowtie indexes:

Once your indexes have been built, your Paths File will be modified such that `ebwt` points to their location (prefix) within your Run Directory. This means that indexes will not be unnecessarily rebuilt on subsequent runs as long as the same Paths File is used. If you need them rebuilt, simply repeat steps 2 and 3 above.

### The Run Directory
The final output directory name has three components:
- The `run_name` defined in your Run Config
- The date and time at pipeline startup
- The `run_directory` basename defined in your Paths File

## Samples Sheet Details
| _Column:_ | Input FASTQ Files | Sample/Group Name | Replicate Number | Control | Normalization |
|-----------:|---------------------|-------------------|------------------|---------|---------------|
Expand All @@ -123,19 +138,17 @@ Supported values are:
DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.

## Features Sheet Details
| _Column:_ | Select for... | with value... | Alias by... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap | Feature Source |
|------------|---------------|---------------|-------------|----------------|-----------|--------|-------------------|--------|-------------|----------------|
| _Example:_ | Class | miRNA | Name | miRNA | 1 | sense | all | all | 5' anchored | ram1.gff3 |
| _Column:_ | Select for... | with value... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap |
|------------|---------------|---------------|----------------|-----------|--------|-------------------|--------|-------------|
| _Example:_ | Class | miRNA | miRNA | 1 | sense | all | all | 5' anchored |

The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.

Rules apply to features parsed from **all** Feature Sources, with the exception of "Alias by..." which only applies to the Feature Source on the same row. Selection first takes place against feature attributes (GFF column 9), and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...).
Selection first takes place against the feature attributes defined in your GFF files, and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...).

Rules that match features in the first stage of selection will be used in a second stage which evaluates alignment vs. feature interval overlap. These matches are sorted by hierarchy value and passed to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length.

See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.

>**Tip**: Don't worry about having duplicate Feature Source entries. Each GFF file is parsed only once.

## Plot Stylesheet Details
Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-template`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
10 changes: 4 additions & 6 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,8 @@ Diagnostic information will include intermediate alignment files for each librar

### Full tiny-count Help String
```
tiny-count -i SAMPLES -f FEATURES -o OUTPUTPREFIX [-h]
[-sf [SOURCE ...]] [-tf [TYPE ...]] [-nh T/F] [-dc] [-a]
tiny-count -pf PATHS -o OUTPUTPREFIX [-h] [-sf [SOURCE ...]]
[-tf [TYPE ...]] [-nh T/F] [-dc] [-sv {Cython,HTSeq}] [-a]
[-p] [-d]

This submodule assigns feature counts for SAM alignments using a Feature Sheet
Expand All @@ -132,10 +132,8 @@ prior run, we recommend that you instead run `tiny recount` within that run's
directory.

Required arguments:
-i SAMPLES, --samples-csv SAMPLES
your Samples Sheet
-f FEATURES, --features-csv FEATURES
your Features Sheet
-pf PATHS, --paths-file PATHS
your Paths File
-o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX
output prefix to use for file names

Expand Down
Loading