MontgomeryLab · taimontgomery · Oct 27, 2022 · Oct 16, 2022 · Oct 16, 2022 · Oct 16, 2022
diff --git a/README.md b/README.md
@@ -90,11 +90,11 @@ tiny get-template
 
 ### Requirements for User-Provided Input Files
 
-| Input Type                                                                 | File Extension    | Requirements                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-|----------------------------------------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Reference annotations<br/>[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):<ul><li>Each feature must have an `ID` or `gene_id`  or `Parent` tag (referred to as `ID` henceforth).</li><li>Feature classes can be defined with the `Class` tag. If undefined, the default value \__UNKNOWN_\_ will be used.</li><li>Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.</li><li>Attribute values containing commas must represent lists.</li><li>`Parent` tags with multiple values are not yet supported.</li><li>See the example link (left) for col. 9 formatting.</li></ul> |
-| Sequencing data<br/>[(example)](START_HERE/fastq_files)                    | FASTQ(.gz)        | Files must be demultiplexed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| Reference genome<br/>[(example)](START_HERE/reference_data/ram1.fa)        | FASTA             | Chromosome identifiers (e.g. Chr1): <ul><li>Must match your reference annotation file chromosome identifiers</li><li>Are case sensitive</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| Input Type                                                                 | File Extension    | Requirements                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+|----------------------------------------------------------------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Reference annotations<br/>[(example)](START_HERE/reference_data/ram1.gff3) | GFF3 / GFF2 / GTF | Column 9 attributes (defined as "tag=value" or "tag "):<ul><li>Each feature must have an `ID` or `gene_id`  or `Parent` tag (referred to as `ID` henceforth).</li><li>Discontinuous features must be defined with the `Parent` tag whose value is the logical parent's `ID`, or by sharing the same `ID`.</li><li>Attribute values containing commas must represent lists.</li><li>`Parent` tags with multiple values are not yet supported.</li><li>See the example link (left) for col. 9 formatting.</li></ul> |
+| Sequencing data<br/>[(example)](START_HERE/fastq_files)                    | FASTQ(.gz)        | Files must be demultiplexed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| Reference genome<br/>[(example)](START_HERE/reference_data/ram1.fa)        | FASTA             | Chromosome identifiers (e.g. Chr1): <ul><li>Should match your reference annotation file chromosome identifiers</li><li>Are case sensitive</li></ul>                                                                                                                                                                                                                                                                                                                                                               |
 
 
 
@@ -174,17 +174,20 @@ A "collapsed" FASTA contains unique reads found in fastp's quality filtered FAST
 The tiny-count step produces a variety of outputs
 
 #### Feature Counts
-Custom Python scripts and HTSeq are used to generate a single table of feature counts that includes columns for each library analyzed. A feature's _Feature ID_ and _Feature Class_ are simply the values of its `ID` and `Class` attributes. Features lacking a Class attribute will be assigned class `_UNKNOWN_`. We have also included a _Feature Name_ column which displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.
+Custom Python scripts and HTSeq are used to generate a single table of feature counts which includes each counted library. Each matched feature is represented with the following metadata columns:
+- **_Feature ID_** is determined, in order of preference, by one of the following GFF column 9 attributes: `ID`, `gene_id`, `Parent`. 
+- **_Classifier_** is determined by the rules in your Features Sheet. It is the _Classify as..._ value of each matching rule. Since multiple rules can match a feature, some Feature IDs will be listed multiple times with different classifiers.
+- **_Feature Name_** displays aliases of your choice, as specified in the _Alias by..._ column of the Features Sheet. If _Alias by..._ is set to`ID`, the _Feature Name_ column is left empty.
 
-For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name` and the GFF entry for this feature has the following attributes column:
+For example, if your Features Sheet has a rule which specifies _Alias by..._ `sequence_name`, _Classify as..._ `miRNA`, and the GFF entry for this feature has the following attributes column:
 ```
-... ID=406904;sequence_name=mir-1,hsa-miR-1;Class=miRNA; ...
+... ID=406904;sequence_name=mir-1,hsa-miR-1; ...
 ```
 The row for this feature in the feature counts table would read:
 
-| Feature ID | Feature Name     | Feature Class | Group1_rep_1 | Group1_rep_2 | ... |
-|------------|------------------|---------------|--------------|--------------|-----|
-| 406904     | mir-1, hsa-miR-1 | miRNA         | 1234         | 999          | ... |
+| Feature ID | Classifier | Feature Name     | Group1_rep_1 | Group1_rep_2 | ... |
+|------------|------------|------------------|--------------|--------------|-----|
+| 406904     | miRNA      | mir-1, hsa-miR-1 | 1234         | 999          | ... |
 
 #### Normalized Counts
 If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied.

diff --git a/START_HERE/features.csv b/START_HERE/features.csv
@@ -1,4 +1,4 @@
-Select for...,with value...,Alias by...,Tag,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
+Select for...,with value...,Alias by...,Classify as...,Hierarchy,Strand,5' End Nucleotide,Length,Overlap,Feature Source
 Class,mask,Alias,,1,both,all,all,Partial,./reference_data/ram1.gff3
 Class,miRNA,Alias,,2,sense,all,16-22,Full,./reference_data/ram1.gff3
 Class,piRNA,Alias,5pA,2,both,A,24-32,Full,./reference_data/ram1.gff3

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
@@ -293,6 +293,12 @@ plot_vector_points: False
 plot_len_dist_min:
 plot_len_dist_max:
 
+##-- Use this label in class plots for counts assigned by rules lacking a classifier --##
+plot_unknown_class: "_UNKNOWN_"
+
+##-- Use this label in class plots for unassigned counts --##
+plot_unassigned_class: "_UNASSIGNED_"
+
 
 ######----------------------------- OUTPUT DIRECTORIES ------------------------------######
 #

diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -123,9 +123,9 @@ Supported values are:
 DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.
 
 ## Features Sheet Details
-| _Column:_  | Select for... | with value... | Alias by... | Tag | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap     | Feature Source |
-|------------|---------------|---------------|-------------|-----|-----------|--------|-------------------|--------|-------------|----------------|
-| _Example:_ | Class         | miRNA         | Name        |     | 1         | sense  | all               | all    | 5' anchored | ram1.gff3      |
+| _Column:_  | Select for... | with value... | Alias by... | Classify as... | Hierarchy | Strand | 5' End Nucleotide | Length | Overlap     | Feature Source |
+|------------|---------------|---------------|-------------|----------------|-----------|--------|-------------------|--------|-------------|----------------|
+| _Example:_ | Class         | miRNA         | Name        | miRNA          | 1         | sense  | all               | all    | 5' anchored | ram1.gff3      |
 
 The Features Sheet allows you to define selection rules that determine how features are chosen when multiple features are found overlap an alignment locus. Selected features are "assigned" a portion of the reads associated with the alignment.
 

diff --git a/doc/Parameters.md b/doc/Parameters.md
@@ -254,6 +254,14 @@ The scatter plots produced by tiny-plot have rasterized points by default. This
 
 The min and/or max bounds for plotted lengths can be set with this option. See [tiny-plot's documentation](tiny-plot.md#length-bounds) for more information about how these values are determined if they aren't set.
 
+### Labels for Class-related Plots
+| Run Config Key         | Commandline Argument |
+|------------------------|----------------------|
+| plot_unknown_class:    | `--unknown-class`    | 
+| plot_unassigned_class: | `--unassigned-class` |
+
+The labels that should be used for special groups in `class_charts` and `sample_avg_scatter_by_dge_class` plots. The "unknown" class group represents counts which were assigned by a Features Sheet rule which lacked a "Classify as..." label. The "unassigned" class group represents counts which weren't assigned to a feature.
+
 ### Full tiny-plot Help String
 ```
 tiny-plot [-rc RAW_COUNTS] [-nc NORM_COUNTS] [-uc RULE_COUNTS]
@@ -318,4 +326,11 @@ Optional arguments:
                         len_dist plots will start at this value
   -lda VALUE, --len-dist-max VALUE
                         len_dist plots will end at this value
+  -una LABEL, --unassigned-class LABEL
+                        Use this label in class-related plots for unassigned
+                        counts
+  -unk LABEL, --unknown-class LABEL
+                        Use this label in class-related plots for counts which
+                        were assigned by rules lacking a "Classify as..."
+                        value
 ```
diff --git a/doc/tiny-count.md b/doc/tiny-count.md
@@ -26,11 +26,14 @@ Selection occurs in three stages, with the output of each stage as input to the
 3. Finally, features are selected for read assignment based on the small RNA attributes of the alignment locus. Once reads are assigned to a feature, they are excluded from matches with larger hierarchy values.
 
 ## Stage 1: Feature Attribute Parameters
-| _features.csv columns:_ | Select for... | with value... | Tag |
-|-------------------------|---------------|---------------|-----|
+| _features.csv columns:_ | Select for... | with value... | Classify as... |
+|-------------------------|---------------|---------------|----------------|
 
 Each feature's column 9 attributes are searched for the key-value combinations defined in the `Select for...` and `with value...` columns. Features, and the rules they matched, are retained for later evaluation at alignment loci in Stages 2 and 3.
 
+#### Feature Classification
+You can optionally specify a classifier for each rule. These classifiers are later used to group and label counts in class-related plots. Features that match rules with a classifier are counted separately; the classifier becomes part of the feature's ID to create a distinct "sub-feature", and these sub-features continue to be treated as distinct in downstream DGE analysis. Classified features receive counts exclusively from the rule(s) which hold the same `Classify as...` value. Counts from multiple rules can be pooled by using the same classifier. In the final counts table, this value is displayed in the Classifier column of features matching the rule, and each feature-classifier pair is shown on its own row.
+
 #### Value Lists
 Attribute keys are allowed to have multiple comma separated values, and these values are treated as a list; only one of the listed values needs to match the `with value...` to be considered a valid match to the rule. For example, if a rule contained `Class` and `WAGO` in these columns, then a feature with attributes `... ;Class=CSR,WAGO; ...` would be considered a match for the rule.
 
@@ -39,9 +42,6 @@ Attribute keys are allowed to have multiple comma separated values, and these va
 #### Wildcard Support
 Wildcard values (`all`, `*`, or an empty cell) can be used in the `Select for...` / `with value...` fields. With this functionality you can evaluate features for the presence of an attribute key without regarding its values, or you can check all attribute keys for the presence of a specific value, or you can skip Stage 1 selection altogether to permit the evaluation of the complete feature set in Stage 2. In the later case, feature-rule matching pairs still serve as the basis for selection; each rule still applies only to its matching subset from previous Stages.
 
-#### Tagged Counting (advanced)
-You can optionally specify a tag for each rule. Feature assignments resulting from tagged rules will have reads counted separately from those assigned by non-tagged rules. This essentially creates a new "sub-feature" for each feature that a tagged rule matches, and these "sub-features" are treated as distinct during downstream DGE analysis. Additionally, these counts subsets can be pooled across any number of rules by specifying the same tag. We recommend using tag names which _do not_ pertain to the `Select for...` / `with value...` in order to avoid potentially confusing results in class-related plots. 
-
 ## Stage 2: Overlap and Hierarchy Parameters
 | _features.csv columns:_ | Hierarchy | Overlap |
 |-------------------------|-----------|---------|

diff --git a/doc/tiny-plot.md b/doc/tiny-plot.md
@@ -64,19 +64,21 @@ Percentage label darkness and bar colors reflect the magnitude of the rule's con
 
 
 ## class_charts
-Features can have multiple classes associated with them, so it is useful to see the proportions of counts by class. The class_charts plot type shows the percentage of _mapped_ reads that were assigned to features by class. Each feature's associated classes are determined by the `Class=` attribute in your GFF files.
+Features can have multiple classes associated with them, so it is useful to see the proportions of counts by class. The class_charts plot type shows the percentage of _mapped_ reads that were assigned to features by class. Each feature's associated classes are determined by the rules that it matched during Stage 1 selection, and is therefore determined by its GFF annotations.
 
 <p float="left" align="center">
     <img src="../images/plots/class_chart.jpg" width="80%" alt="class_chart with 8 classes"/>
 </p>
 
 #### Class \_UNASSIGNED_
-This category represents the percentage of mapped reads that were unassigned. Sources of unassigned reads include:
+This category represents the percentage of mapped reads that weren't assigned to any features. Sources of unassigned reads include:
 - A lack of features passing selection at alignment loci
 - Alignments which do not overlap with any features
 
-#### Count Normalization
-A feature with multiple associated classes will have its counts split evenly across these classes before being grouped and summed.
+You can customize this label using the [unassigned class parameter.](Parameters.md#labels-for-class-related-plots)
+
+#### Class \_UNKNOWN_
+This category represents the percentage of mapped reads that matched rules which did not have a specified `Classify as...` value. You can customize this label using the [unknown class parameter.](Parameters.md#labels-for-class-related-plots)
 
 #### Class Chart Styles
 Proportions in rule_charts and class_charts are plotted using the same function. Styles are the same between the two. See [rule chart styles](#rule-chart-styles) for more info.

diff --git a/tests/unit_test_helpers.py b/tests/unit_test_helpers.py
@@ -19,7 +19,7 @@
 rules_template = [{'Identity': ("Name", "N/A"),
                    'Strand': "both",
                    'Hierarchy': 0,
-                   'Tag': '',
+                   'Class': '',
                    'nt5end': "all",
                    'Length': "all",   # A string is expected by FeatureSelector due to support for lists and ranges
                    'Overlap': "partial"}]

diff --git a/tests/unit_tests_counter.py b/tests/unit_tests_counter.py
@@ -33,7 +33,7 @@ def setUpClass(self):
             'Key':       "Class",
             'Value':     "CSR",
             'Name':      "Alias",
-            'Tag':       "",
+            'Class':     "",
             'Hierarchy': "1",
             'Strand':    "antisense",
             "nt5end":    '"C,G,U"',  # Needs to be double-quoted due to commas
@@ -47,7 +47,7 @@ def setUpClass(self):
         _row = self.csv_feat_row_dict
         self.parsed_feat_rule = [{
             'Identity':  (_row['Key'], _row['Value']),
-            'Tag':       _row['Tag'],
+            'Class':     _row['Class'],
             'Hierarchy': int(_row['Hierarchy']),
             'Strand':    _row['Strand'],
             'nt5end':    _row["nt5end"].upper().translate({ord('U'): 'T'}),