MontgomeryLab · taimontgomery · Mar 1, 2023 · Feb 25, 2023 · Feb 25, 2023 · Feb 26, 2023
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ tinyRNA is a set of tools to simplify the analysis of next-generation sequencing
 
 ### The Current Workflow
 
-![tinyRNA basic pipeline](images/tinyrna-workflow_current.png)
+![tinyRNA basic pipeline](images/tinyrna_workflow_current.png)
 
 ## tinyRNA Installation
 
@@ -161,7 +161,7 @@ At the core of tinyRNA is tiny-count, a highly flexible counting utility that al
 A wrapper R script for DESeq2 facilitates DGE analysis of counted sample files.
 
 ### `tiny-plot`
-The results of feature counting and DGE are visualized with high resolution plot PDFs. User-defined plot styles are also supported via a Matplotlib stylesheet. 
+The results of feature counting and DGE analysis are visualized with high resolution plot PDFs. User-defined plot styles are also supported via a Matplotlib stylesheet. 
 
 [Full documentation for tiny-plot can be found here.](doc/tiny-plot.md)
 
@@ -276,13 +276,14 @@ Simple static plots are generated from the outputs of tiny-count and tiny-deseq.
 tiny-deseq.r will produce a standard **PCA plot** from variance stabilizing transformed feature counts. This output is controlled by the `dge_pca_plot` key in the Run Config and by your experiment design. DGE outputs, including the PCA plot, will not be produced for experiments with less than 1 degree of freedom.
 
 ### Reducing Storage Usage
-The files produced by certain steps can be very large and after several runs this may present significant storage usage. You can remove the following subdirectories from a Run Directory to free up space, but **you will no longer be able to perform repeat analyses within it (i.e. `tiny recount` or `tiny replot`)**:
+The files produced by certain steps can be very large and after several runs this may present significant storage usage. You can remove the following subdirectories from a Run Directory to free up space, but **you will no longer be able to perform recount analyses within it** (i.e. `tiny recount`):
 - fastp (though we recommend keeping the reports)
 - collapser
 - bowtie
 
 Cleanup commands will be added to tinyRNA in a future release, but for now the following command will remove commonly large files while preserving report files:
 ```shell
+# Execute within the Run Directory you want to clean
 rm {fastp/*.fastq,{collapser,bowtie}/*.fa,bowtie/*.sam}
 ```
 

diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml
@@ -22,7 +22,7 @@ gff_files:
 #- path:
 #  alias: [ ]
 
-##-- The final output directory for files produced by the pipeline --#
+##-- The suffix to use in the final output directory name (optional) --#
 run_directory: run_directory
 
 ##-- The directory for temporary files. Determined by cwltool if blank. --##

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -92,7 +92,7 @@ When the pipeline starts up, tinyRNA will process the Run Config based on the co
 ## Paths File Details
 
 ### GFF Files
-GFF annotations are required by tinyRNA. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
+GFF annotations are optional but recommended. If not provided, tiny-count will perform [sequence-based counting](tiny-count.md#sequence-based-counting-mode) rather than feature-based counting. For each file, you can optionally provide an `alias` which is a list of attributes to represent each feature in the Feature Name column of output counts tables. Each entry under the `gff_files` parameter must look something like the following mock example:
 ```yaml
   - path: 'a/path/to/your/file.gff'         # 0 spaces before -
     alias: [optional, list, of attributes]  # 2 spaces before alias
@@ -112,24 +112,28 @@ Once your indexes have been built, your Paths File will be modified such that `e
 
 ### The Run Directory
 The final output directory name has three components: 
-- The `run_name` defined in your Run Config
-- The date and time at pipeline startup
-- The `run_directory` basename defined in your Paths File
+1. The `run_name` defined in your Run Config
+2. The date and time at pipeline startup
+3. The basename of `run_directory` defined in your Paths File
+
+The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained. 
 
 ## Samples Sheet Details
 |  _Column:_ | FASTQ/SAM Files     | Sample/Group Name | Replicate Number | Control | Normalization |
 |-----------:|---------------------|-------------------|------------------|---------|---------------|
 | _Example:_ | cond1_rep1.fastq.gz | condition1        | 1                | True    | RPM           |
 
 ### Assigning the Control Group
-Assigning the control group allows the proper DGE comparisons to be made and plotted. The Control column is where you'll make this indication by writing `true` on any corresponding row. Regardless of the number of replicates in each group, only one associated row needs to have this indication. Do not write `false` or anything else for the other groups; this column should only be used to indicate the affirmative.
+Assigning the control group allows the proper DGE comparisons to be made and plotted. The Control column is where you'll make this indication by writing `true` on any corresponding row. Regardless of the number of replicates in each group, only one row needs to have this indication.
+
+tinyRNA doesn't support experiments with more than one control condition. However, if you omit all control condition labels then every possible comparison will be made which should include the desired comparisons.
 
 ### Applying Custom Normalization
 Custom normalization can be applied at the conclusion of feature counting using the Normalization column. Unlike the Control column, values in the Normalization column apply to the specific library that they share a row with.
 
 Supported values are:
 - **Blank or 1**: no normalization is applied to the corresponding library
-- **Any number**: the corresponding library's counts are divided by this number
+- **Any number**: the corresponding library's counts are divided by this number (useful for spike-in normalization)
 - **RPM or rpm**: the corresponding library's counts are divided by (its mapped read count / 1,000,000)
 
 >**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-hits commandline option. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by selected feature count at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.
@@ -138,17 +142,37 @@ Supported values are:
 DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.
 
 ## Features Sheet Details
-| _Column:_  | Select for... | with value... | Classify as... |  Source Filter | Type Filter | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap     |
-|------------|---------------|---------------|----------------|----------------|-------------|-----------|--------|-------------------|--------|-------------|
-| _Example:_ | Class         | miRNA         | miRNA          |                |             | 1         | sense  | all               | all    | 5' anchored |
-
-The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.
-
-Selection first takes place against the feature attributes defined in your GFF files, and is directed by defining the attribute you want to be considered (Select for...) and the acceptable values for that attribute (with value...). 
-
-Rules that match features in the first stage of selection will be used in a second stage which evaluates alignment vs. feature interval overlap. These matches are sorted by hierarchy value and passed to the third and final stage of selection which examines characteristics of the alignment itself: strand relative to the feature of interest, 5' end nucleotide, and length. 
-
-See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of each column.
+![Features Sheet Header](../images/features_sheet_header.png)
+
+The Features Sheet allows you to define selection rules that control how reads are assigned to features. We refer to each row as a rule, and columns as a selectors. `Classify as...` isn't a selector because it is used for labelling and subsetting matches rather than determining them. See [tiny-count's documentation](tiny-count.md#feature-selection) for an explanation of the selection process and the role that each selector plays.
+
+### Selector Formats
+Selectors in the Features Sheet can be specified as a single value, a list of comma separated values, a range, or a wildcard. The supported formats vary from selector to selector.  For list and range formats, just one of the specified values has to match for the target to be selected. Wildcard formats can be implicitly defined with a blank cell, or explicitly defined using the example keywords below.
+
+| Selector        | Wildcard | Single | List | Range | 
+|:----------------|:--------:|:------:|:----:|:-----:|
+| `Select for...` |    ✓     |   ✓    |      |       |
+| `with value...` |    ✓     |   ✓    |      |       |
+| `Source Filter` |    ✓     |   ✓    |  ✓   |       |
+| `Type Filter`   |    ✓     |   ✓    |  ✓   |       |
+| `Hierarchy`     |          |   ✓    |      |       |
+| `Overlap`       |    ✓     |   ✓    |      |       |
+| `Strand`        |    ✓     |   ✓    |      |       |
+| `5' nt`         |    ✓     |   ✓    |  ✓   |       |
+| `Length`        |    ✓     |   ✓    |  ✓   |   ✓   |
+
+Examples:
+- **Wildcard** <sup>†</sup>: `any`, `all`, `*`, or a blank cell
+- **Single**: `G` or `22`
+- **List**: `C,G,U` or `25, 26` (spaces do not matter)
+- **Range**: `20-25`
+- **Mixed** <sup>§</sup>: `19, 21-23, 25-30` 
+
+<sup>†</sup> the `Strand` selector also supports `both`<br>
+<sup>§</sup> only supported by the `Length` selector
+
+### Case Sensitivity
+All selectors are case-insensitive.
 
 ## Plot Stylesheet Details
 Matplotlib uses key-value "rc parameters" to allow for customization of its properties and styles, and one way these parameters can be specified is with a [matplotlibrc file](https://matplotlib.org/3.4.3/tutorials/introductory/customizing.html#a-sample-matplotlibrc-file), which we simply refer to as the Plot Stylesheet. You can obtain a copy of the default stylesheet used by tiny-plot with the command `tiny get-templates`. Please keep in mind that tiny-plot overrides these defaults for a few specific elements of certain plots. Feel free to reach out if there is a plot style you wish to override but find you are unable to.
diff --git a/doc/Pipeline.md b/doc/Pipeline.md
@@ -32,7 +32,8 @@ The commands `tiny recount` and `tiny replot` seek to solve this problem. As dis
 
 You can modify the behavior of a resume run by changing settings in:
 - The **processed** Run Config
-- The **original** Features Sheet that was used for the end-to-end run (as indicated by the `features_csv` key in the **processed** Run Config)
+- The **original** Features Sheet that was used for the end-to-end run (as indicated by `features_csv` in the processed Run Config)
+- The **original** Paths File (as indicated by `paths_config` in the processed Run Config)
 
 ### The Steps
 1. Make and save the desired changes in the files above
@@ -45,6 +46,9 @@ File inputs are sourced from the **original** output subdirectories of prior ste
 ### Where to Find Outputs from Resume Runs
 Output subdirectories for resume runs can be found alongside the originals, and will have a timestamp appended to their name to differentiate them.
 
+### Auto-Documentation of Resume Runs
+A new processed Run Config will be saved in the Run Directory at the beginning of each resume run. It will be labelled with the same timestamp used in the resume run's other outputs to differentiate it. It includes the changes to your Paths File and Run Config. A copy of your Features Sheet is saved to the timestamped tiny-count output directory during `tiny recount` runs.
+
 ## Parallelization
 Most steps in the pipeline run in parallel to minimize runtimes. This is particularly advantageous for multiprocessor systems like server environments. However, parallelization isn't always beneficial. If your computer doesn't have enough free memory, or if you have a large sample file set and/or reference genome, parallel execution might push your machine to its limits. When this happens you might see memory errors or your computer may become unresponsive. In these cases it makes more sense to run resource intensive steps one at a time, in serial, rather than in parallel. To do so, set `run_parallel: false` in your Run Config. This will affect fastp, tiny-collapse, and bowtie since these steps typically handle the largest volumes of data.