MontgomeryLab · taimontgomery · Nov 2, 2022 · Oct 15, 2022 · Oct 15, 2022 · Oct 15, 2022
diff --git a/README.md b/README.md
@@ -105,8 +105,8 @@ In most cases you will use this toolset as an end-to-end pipeline. This will run
 2. The genome sequence of interest in fasta format.
 3. Genome coordinates of small RNA features of interest in GFF format.
 4. A completed Samples Sheet (`samples.csv`) with paths to the fastq files.
-5. A completed Features Sheet (`features.csv`) with paths to the GFF file(s).
-6. An updated Paths File (`paths.yml`) with the path to the genome sequence and/or your bowtie index prefix.
+5. A completed Features Sheet (`features.csv`) with feature selection rules.
+6. An updated Paths File (`paths.yml`) with paths to your GFF files, the genome sequence and/or your bowtie index prefix, as well as the paths to `samples.csv` and `features.csv`.
 7. A Run Config file (`run_config.yml`) located in your working directory or the path to the file. The template provided does not need to be updated if you wish to use the default settings.
 
 To run an end-to-end analysis, be sure that you're working within the conda tinyrna environment ([instructions above](#usage)) in your terminal and optionally navigate to the location of your Run Config file. Then, simply run the following in your terminal:
@@ -177,13 +177,13 @@ The tiny-count step produces a variety of outputs
 Custom Python scripts and HTSeq are used to generate a single table of feature counts which includes each counted library. Each matched feature is represented with the following metadata columns:
 - **_Feature ID_** is determined, in order of preference, by one of the following GFF column 9 attributes: `ID`, `gene_id`, `Parent`. 
 - **_Classifier_** is determined by the rules in your Features Sheet. It is the _Classify as..._ value of each matching rule. Since multiple rules can match a feature, some Feature IDs will be listed multiple times with different classifiers.
-- **_Feature Name_** displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.
+- **_Feature Name_** displays aliases of your choice, as specified in the `alias` key under each GFF listed in your Paths File. If `alias` is set to `ID`, the _Feature Name_ column is left empty.
 
-For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name`, _Classify as..._ `miRNA`, and the GFF entry for this feature has the following attributes column:
+For example, if your Paths File has a GFF entry which specifies `alias: [sequence_name]`, and the corresponding GFF file has a feature with the following attributes column:
 ```
 ... ID=406904;sequence_name=mir-1,hsa-miR-1; ...
 ```
-The row for this feature in the feature counts table would read:
+And this feature matched a rule in your Features Sheet defining _Classify as..._ `miRNA`, then the entry for this feature in the final counts table would read:
 
 | Feature ID | Classifier | Feature Name     | Group1_rep_1 | Group1_rep_2 | ... |
 |------------|------------|------------------|--------------|--------------|-----|

diff --git a/START_HERE/features.csv b/START_HERE/features.csv
@@ -1,7 +1,7 @@
-Select for...,with value...,Alias by...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
-Class,mask,Alias,,1,both,all,all,Partial,./reference_data/ram1.gff3
-Class,miRNA,Alias,,2,sense,all,16-22,Full,./reference_data/ram1.gff3
-Class,piRNA,Alias,5pA,2,both,A,24-32,Full,./reference_data/ram1.gff3
-Class,piRNA,Alias,5pT,2,both,T,24-32,Full,./reference_data/ram1.gff3
-Class,siRNA,Alias,,2,both,all,15-22,Full,./reference_data/ram1.gff3
-Class,unk,Alias,,3,both,all,all,Full,./reference_data/ram1.gff3
+Select for...,with value...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap
+Class,mask,,1,both,all,all,Partial
+Class,miRNA,,2,sense,all,16-22,Full
+Class,piRNA,5pA,2,both,A,24-32,Full
+Class,piRNA,5pT,2,both,T,24-32,Full
+Class,siRNA,,2,both,all,15-22,Full
+Class,unk,,3,both,all,all,Full
diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml
@@ -5,15 +5,23 @@
 #
 # Directions:
 #   1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
-#   2. Fill out the Features Sheet with reference files and selection rules [features.csv]
-#   3. Set samples_csv and features_csv to point to these files
+#   2. Fill out the Features Sheet with selection rules [features.csv]
+#   3. Set samples_csv and features_csv (below) to point to these files
+#   4. Add annotation files and per-file alias preferences to gff_files
 #
 ######-------------------------------------------------------------------------------######
 
 ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
 samples_csv: ./samples.csv
 features_csv: ./features.csv
 
+##-- Each entry: 1. the file, 2. (optional) list of attribute keys for feature aliases --##
+gff_files:
+- path: "./reference_data/ram1.gff3"
+  alias: [Alias]
+#- path:
+#  alias: [ ]
+
 ##-- The final output directory for files produced by the pipeline --#
 run_directory: run_directory
 

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
@@ -1,21 +1,21 @@
 ######----------------------------- tinyRNA Configuration -----------------------------######
 #
-# In this file you may specify your configuration preferences for the workflow and
+# In this file you can specify your configuration preferences for the workflow and
 # each workflow step.
 #
 # If you want to use DEFAULT settings for the workflow, all you need to do is provide the path
-# to your Samples Sheet and Features Sheet in your Paths file, then make sure that the
-# 'paths_config' setting below points to your Paths file.
+# to your Samples Sheet and Features Sheet in your Paths File, then make sure that the
+# 'paths_config' setting below points to your Paths File.
 #
 # We suggest that you also:
 #   1. Add a username to identify the person performing runs, if desired for record keeping
-#   2. Add a run directory name in your Paths file. If not provided, "run_directory" is used
+#   2. Add a run directory name in your Paths File. If not provided, "run_directory" is used
 #   3. Add a run name to label your run directory and run-specific summary reports.
 #      If not provided, user_tinyrna will be used.
 #
 # This file will be further processed at run time to generate the appropriate pipeline
 # settings for each workflow step. A copy of this processed configuration will be stored
-# in your run directory (as specified by your Paths configuration file).
+# in your run directory.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -48,7 +48,7 @@ run_native: false
 # paths_config file.
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
-# You may change the parameters here.
+# You can change the parameters here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -75,7 +75,7 @@ ftabchars: ~
 # pipeline github: https://github.com/MontgomeryLab/tinyrna
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
-# You may change the parameters here.
+# You can change the parameters here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -135,7 +135,7 @@ compression: 4
 # Trimming takes place prior to counting/collapsing.
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
-# You may change the parameters here.
+# You can change the parameters here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -157,7 +157,7 @@ compress: False
 # We use bowtie for read alignment to a genome.
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
-# You may change the parameters here.
+# You can change the parameters here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -263,12 +263,11 @@ dge_drop_zero: False
 
 ######-------------------------------- PLOTTING OPTIONS -----------------------------######
 #
-# We use a custom Python script for creating all plots. The default base style is called
-# 'smrna-light'. If you wish to use another matplotlib stylesheet you may specify that in
-# the Paths File.
+# We use a custom Python script for creating all plots. If you wish to use another matplotlib
+# stylesheet you can specify that in the Paths File.
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
-# You may change the parameters here.
+# You can change the parameters here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -303,7 +302,7 @@ plot_unassigned_class: "_UNASSIGNED_"
 ######----------------------------- OUTPUT DIRECTORIES ------------------------------######
 #
 # Outputs for each step are organized into their own subdirectories in your run
-# directory. You may set these folder names here.
+# directory. You can set these folder names here.
 #
 ######-------------------------------------------------------------------------------######
 
@@ -320,32 +319,34 @@ dir_name_plotter: plots
 #########################  AUTOMATICALLY GENERATED CONFIGURATIONS #########################
 #
 # Do not make any changes to the following sections. These options are automatically
-# generated using your Paths file, your Samples and Features sheets, and the above
+# generated using your Paths File, your Samples and Features sheets, and the above
 # settings in this file.
 #
 ###########################################################################################
 
 
-######--------------------------- DERIVED FROM PATHS SHEET --------------------------######
+######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
 #
-# The following configuration settings are automatically derived from the sample sheet
+# The following configuration settings are automatically derived from the Paths File
 #
 ######-------------------------------------------------------------------------------######
 
 run_directory: ~
 tmp_directory: ~
 features_csv: { }
 samples_csv: { }
+paths_file: { }
+gff_files: [ ]
 run_bowtie_build: false
 reference_genome_files: [ ]
 plot_style_sheet: ~
 adapter_fasta: ~
 ebwt: ~
 
 
-######-------------------------- DERIVED FROM SAMPLE SHEET --------------------------######
+######------------------------- DERIVED FROM SAMPLES SHEET --------------------------######
 #
-# The following configuration settings are automatically derived from the sample sheet
+# The following configuration settings are automatically derived from the Samples Sheet
 #
 ######-------------------------------------------------------------------------------######
 
@@ -370,10 +371,6 @@ run_deseq: True
 
 ######------------------------- DERIVED FROM FEATURES SHEET -------------------------######
 #
-# The following configuration settings are automatically derived from the sample sheet
+# The following configuration settings are automatically derived from the Features Sheet
 #
-######-------------------------------------------------------------------------------######
-
-###-- Utilized by tiny-count --###
-# a list of only unique GFF files
-gff_files: [ ]
+######-------------------------------------------------------------------------------######
diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -14,15 +14,15 @@ tiny get-template
 
 ## Overview
 
->**Tip**: Each of the following will allow you to map out paths to your input files for analysis. You can use either relative or absolute paths to do so. **Relative paths will be evaluated relative to the file in which they are defined.** This allows you to flexibly organize and share configurations between projects.
+>**Tip**: You can use either relative or absolute paths for your file inputs. **Relative paths will be evaluated relative to the file in which they are defined.** This allows you to flexibly organize and share configurations between projects.
 
 #### Run Config
 
-The overall behavior of the pipeline and its steps is determined by the Run Config file (`run_config.yml`). This YAML file can be edited using a simple text editor. Within it you must specify the location of your Paths file (`paths.yml`). All other settings are optional. [More info](#run-config-details).
+The overall behavior of the pipeline and its steps is determined by the Run Config file (`run_config.yml`). This YAML file can be edited using a simple text editor. Within it you must specify the location of your Paths File (`paths.yml`). All other settings are optional. [More info](#run-config-details).
 
 #### Paths File
 
-The locations of pipeline file inputs are defined in the Paths file (`paths.yml`). This YAML file includes paths to your Samples and Features Sheets, in addition to your bowtie index prefix (optional) and the final run directory name. The final run directory will contain all pipeline outputs. The directory name is prepended with the `run_name` and current date and time to keep outputs separate. [More info](#paths-file-details).
+The locations of pipeline file inputs are defined in the Paths file (`paths.yml`). This YAML file includes paths to your configuration files, your GFF files, and your bowtie indexes and/or reference genome. [More info](#paths-file-details).
 
 #### Samples Sheet
 
@@ -91,6 +91,15 @@ When the pipeline starts up, tinyRNA will process the Run Config based on the co
 
 ## Paths File Details
 
+### GFF Files
+GFF annotations are required by tinyRNA. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
+```yaml
+  - path: 'a/path/to/your/file.gff'         # 0 spaces before -
+    alias: [optional, list, of attributes]  # 2 spaces before alias
+
+# ^ Each new GFF path must begin with -
+```
+
 ### Building Bowtie Indexes
 If you don't have bowtie indexes already built for your reference genome, tinyRNA can build them for you at the beginning of an end-to-end run and reuse them on subsequent runs with the same Paths File.
 
@@ -101,6 +110,12 @@ To build bowtie indexes:
 
 Once your indexes have been built, your Paths File will be modified such that `ebwt` points to their location (prefix) within your Run Directory. This means that indexes will not be unnecessarily rebuilt on subsequent runs as long as the same Paths File is used. If you need them rebuilt, simply repeat steps 2 and 3 above.
 
+### The Run Directory
+The final output directory name has three components: 
+- The `run_name` defined in your Run Config
+- The date and time at pipeline startup
+- The `run_directory` basename defined in your Paths File
+
 ## Samples Sheet Details
 |  _Column:_ | Input FASTQ Files   | Sample/Group Name | Replicate Number | Control | Normalization |
 |-----------:|---------------------|-------------------|------------------|---------|---------------|
@@ -123,19 +138,17 @@ Supported values are:
 DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.
 
 ## Features Sheet Details
-| _Column:_  | Select for... | with value... | Alias by... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap     | Feature Source |
-|------------|---------------|---------------|-------------|----------------|-----------|--------|-------------------|--------|-------------|----------------|
-| _Example:_ | Class         | miRNA         | Name        | miRNA          | 1         | sense  | all               | all    | 5' anchored | ram1.gff3      |
+| _Column:_  | Select for... | with value... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap     |
+|------------|---------------|---------------|----------------|-----------|--------|-------------------|--------|-------------|
+| _Example:_ | Class         | miRNA         | miRNA          | 1         | sense  | all               | all    | 5' anchored |
 
 The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.
 
-Rules apply to features parsed from **all** Feature Sources, with the exception of "Alias by..." which only applies to the Feature Source on the same row. Selection first takes place against feature attributes (GFF column 9), and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...). 
+Selection first takes place against the feature attributes defined in your GFF files, and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...). 
 
 Rules that match features in the first stage of selection will be used in a second stage which evaluates alignment vs. feature interval overlap. These matches are sorted by hierarchy value and passed to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length. 
 
 See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.
 
->**Tip**: Don't worry about having duplicate Feature Source entries. Each GFF file is parsed only once.
-
 ## Plot Stylesheet Details
 Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-template`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
diff --git a/doc/Parameters.md b/doc/Parameters.md
@@ -122,8 +122,8 @@ Diagnostic information will include intermediate alignment files for each librar
 
 ### Full tiny-count Help String
 ```
-tiny-count -i SAMPLES -f FEATURES -o OUTPUTPREFIX [-h]
-           [-sf [SOURCE ...]] [-tf [TYPE ...]] [-nh T/F] [-dc] [-a]
+tiny-count -pf PATHS -o OUTPUTPREFIX [-h] [-sf [SOURCE ...]]
+           [-tf [TYPE ...]] [-nh T/F] [-dc] [-sv {Cython,HTSeq}] [-a]
            [-p] [-d]
 
 This submodule assigns feature counts for SAM alignments using a Feature Sheet
@@ -132,10 +132,8 @@ prior run, we recommend that you instead run `tiny recount` within that run's
 directory.
 
 Required arguments:
-  -i SAMPLES, --samples-csv SAMPLES
-                        your Samples Sheet
-  -f FEATURES, --features-csv FEATURES
-                        your Features Sheet
+  -pf PATHS, --paths-file PATHS
+                        your Paths File
   -o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX
                         output prefix to use for file names