MontgomeryLab · taimontgomery · Apr 17, 2023 · Apr 6, 2023 · Apr 6, 2023 · Apr 6, 2023
diff --git a/README.md b/README.md
@@ -196,7 +196,7 @@ And this feature matched a rule in your Features Sheet defining _Classify as..._
 | 406904     | miRNA      | mir-1, hsa-miR-1 | 1234         | 999          | ... |
 
 #### Normalized Counts
-If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied.
+If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied. Note that these normalizations are [unrelated to normalization by genomic/feature hits](doc/Configuration.md#applying-custom-normalization).
 
 #### Counts by Rule
 This table shows the counts assigned by each rule on a per-library basis. It is indexed by the rule's corresponding row number in the Features Sheet, where the first non-header row is considered row 0. For convenience a Rule String column is added which contains a human friendly concatenation of each rule. Finally, a Mapped Reads row is added which represents each library's total read counts which were available for assignment prior to counting/selection.
@@ -229,6 +229,13 @@ A single table of summary statistics includes columns for each library and the f
 | Mapped Reads                  | Total genome-mapping reads passing quality filtering prior to counting/selection                                                           |                                                          |
 | Assigned Reads                | Total genome-mapping reads passing quality filtering that were assigned to at least one feature due to a rule match in your Features Sheet |
 
+When normalization by feature and/or genomic hits is disabled, the following stats are reported instead of `Mapped Reads`:
+
+| Stat                        | Description                                                                      |
+|-----------------------------|----------------------------------------------------------------------------------|
+| Normalized Mapped Reads     | The true mapped read count                                                       |
+| Non-normalized Mapped Reads | The sum of assigned and unassigned reads according to the normalization settings |
+
 #### 5'nt vs. Length Matrix
 
 During counting, size and 5' nt distribution tables are created for each library. The distribution of lengths and 5' nt can be used to assess the overall quality of your libraries. This can also be used for analyzing small RNA distributions in non-model organisms without annotations.

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
@@ -225,9 +225,11 @@ shared_memory: False
 ##-- If True: show all parsed features in the counts csv, regardless of count/identity --##
 counter_all_features: False
 
-##-- If True: counts will be normalized by genomic hits AND selected feature count --##
-##-- If False: counts will only be normalized by genomic hits --##
-counter_normalize_by_hits: True
+##-- If True: counts are normalized by genomic hits (number of multi-alignments) --##
+counter_normalize_by_genomic_hits: True
+
+##-- If True: counts are normalized by feature hits (selected feature count per-locus) --##
+counter_normalize_by_feature_hits: True
 
 ##-- If True: a decollapsed copy of each SAM file will be produced (useful for IGV) --##
 counter_decollapse: False
@@ -326,7 +328,7 @@ dir_name_logs: logs
 #
 ###########################################################################################
 
-version: 1.3.0
+version: 1.4.0
 
 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
 #

diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -136,7 +136,7 @@ Supported values are:
 - **Any number**: the corresponding library's counts are divided by this number (useful for spike-in normalization)
 - **RPM or rpm**: the corresponding library's counts are divided by (its mapped read count / 1,000,000)
 
->**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-hits commandline option. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by selected feature count at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.
+>**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-genomic/feature-hits commandline options. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by each sequence's alignment count and the number of selected features at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.
 
 ### Low DF Experiments
 DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.

diff --git a/doc/Parameters.md b/doc/Parameters.md
@@ -70,12 +70,19 @@ Optional arguments:
 
 Copies the template configuration files required by tiny-count into the current directory. This argument can't be combined with `--paths-file`. All other arguments are ignored when provided, and once the templates have been copied tiny-count exits.
 
-### Normalize by Hits
-| Run Config Key             | Commandline Argument      |
-|----------------------------|---------------------------|
-| counter-normalize-by-hits: | `--normalize-by-hits T/F` |
+### Normalize by Genomic Hits
+| Run Config Key                     | Commandline Argument              |
+|------------------------------------|-----------------------------------|
+| counter_normalize_by_genomic_hits: | `--normalize-by-genomic-hits T/F` |
 
-By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.
+By default, tiny-count will increment feature counts by a normalized amount to avoid overcounting. Each unique sequence's read count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. For sequences with multiple alignments, a portion of the sequence's original count is allocated to each of its alignments to be assigned to features that pass selection at the locus. This portion is the original count divided by the number of alignments, or _genomic hits_. By disabling this normalization step, each of the sequence's alignments will be allocated the full original read count rather than the normalized portion.
+
+### Normalize by Feature Hits
+| Run Config Key                     | Commandline Argument              |
+|------------------------------------|-----------------------------------|
+| counter_normalize_by_feature_hits: | `--normalize-by-feature-hits T/F` |
+
+By default, tiny-count will increment feature counts by a normalized amount to avoid overcounting. Each sequence alignment locus is allocated a portion of the sequence's original read count (depending on `counter_normalize_by_genomic_hits`), and once selection is complete the allocated count is divided by the number of selected features, or _feature hits_, at the alignment. The resulting value is added to the totals for each matching feature. By disabling this normalization step, each selected feature will receive the full amount allocated to the locus rather than the normalized portion.
 
 ### Decollapse
 | Run Config Key      | Commandline Argument   |
@@ -107,8 +114,8 @@ Diagnostic information will include intermediate alignment files for each librar
 
 ### Full tiny-count Help String
 ```
-tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-nh T/F] [-dc]
-           [-sv {Cython,HTSeq}] [-p] [-d]
+tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-ng T/F] [-nf T/F]
+                  [-vs T/F] [-dc] [-sv {Cython,HTSeq}] [-p] [-d]
 
 tiny-count is a precision counting tool for hierarchical classification and
 quantification of small RNA-seq reads
@@ -132,9 +139,13 @@ Optional arguments:
                         occurrences of the substring {timestamp} will be
                         replaced with the current date and time. (default:
                         tiny-count_{timestamp})
-  -nh T/F, --normalize-by-hits T/F
-                        If T/true, normalize counts by (selected) overlapping
-                        feature counts. (default: T)
+  -ng T/F, --normalize-by-genomic-hits T/F
+                        Normalize counts by genomic hits. (default: T)
+  -nf T/F, --normalize-by-feature-hits T/F
+                        Normalize counts by feature hits. (default: T)
+  -vs T/F, --verify-stats T/F
+                        Verify that all reported stats are internally
+                        consistent. (default: T)
   -dc, --decollapse     Create a decollapsed copy of all SAM files listed in
                         your Samples Sheet. This option is ignored for non-
                         collapsed inputs. (default: False)

diff --git a/doc/tiny-count.md b/doc/tiny-count.md
@@ -138,7 +138,7 @@ Examples:
 >**Tip:** you may specify U and T bases in your rules. Uracil bases will be converted to thymine when your Features Sheet is loaded. N bases are also allowed.
 
 ## Count Normalization
-Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. The second normalization step can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided: 
+Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided: 
 1. By the number of loci it aligns to in the genome.
 2. By the number of _selected_ features for each of its alignments.
 

diff --git a/setup.py b/setup.py
@@ -15,7 +15,7 @@
 AUTHOR = 'Kristen Brown, Alex Tate'
 PLATFORM = 'Unix'
 REQUIRES_PYTHON = '>=3.9.0'
-VERSION = '1.3.0'
+VERSION = '1.4.0'
 REQUIRED = []  # Required packages are installed via Conda's environment.yml
 
 

diff --git a/tests/testdata/config_files/run_config_template.yml b/tests/testdata/config_files/run_config_template.yml
@@ -225,9 +225,11 @@ shared_memory: False
 ##-- If True: show all parsed features in the counts csv, regardless of count/identity --##
 counter_all_features: False
 
-##-- If True: counts will be normalized by genomic hits AND selected feature count --##
-##-- If False: counts will only be normalized by genomic hits --##
-counter_normalize_by_hits: True
+##-- If True: counts are normalized by genomic hits (number of multi-alignments) --##
+counter_normalize_by_genomic_hits: True
+
+##-- If True: counts are normalized by feature hits (selected feature count per-locus) --##
+counter_normalize_by_feature_hits: True
 
 ##-- If True: a decollapsed copy of each SAM file will be produced (useful for IGV) --##
 counter_decollapse: False
@@ -326,7 +328,7 @@ dir_name_logs: logs
 #
 ###########################################################################################
 
-version: 1.3.0
+version: 1.4.0
 
 ######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
 #

diff --git a/tiny/cwl/tools/tiny-count.cwl b/tiny/cwl/tools/tiny-count.cwl
@@ -30,10 +30,15 @@ inputs:
 
   # Optional inputs
 
-  normalize_by_hits:
+  normalize_by_feature_hits:
     type: string?
     inputBinding:
-      prefix: --normalize-by-hits
+      prefix: --normalize-by-feature-hits
+
+  normalize_by_genomic_hits:
+    type: string?
+    inputBinding:
+      prefix: --normalize-by-genomic-hits
 
   decollapse:
     type: boolean?
@@ -140,5 +145,10 @@ outputs:
     outputBinding:
       glob: $(inputs.out_prefix)_selection_diags.txt
 
+  stats_check:
+    type: File?
+    outputBinding:
+      glob: "*_stats_check.csv"
+
   console_output:
     type: stdout
diff --git a/tiny/cwl/workflows/tinyrna_wf.cwl b/tiny/cwl/workflows/tinyrna_wf.cwl
@@ -87,7 +87,8 @@ inputs:
   counter_decollapse: boolean?
   counter_stepvector: string?
   counter_all_features: boolean?
-  counter_normalize_by_hits: boolean?
+  counter_normalize_by_feature_hits: boolean?
+  counter_normalize_by_genomic_hits: boolean?
 
   # tiny-deseq inputs
   run_deseq: boolean
@@ -214,8 +215,11 @@ steps:
       gff_files: gff_files
       out_prefix: run_name
       all_features: counter_all_features
-      normalize_by_hits:
-        source: counter_normalize_by_hits
+      normalize_by_feature_hits:
+        source: counter_normalize_by_feature_hits
+        valueFrom: $(String(self))  # convert boolean -> string
+      normalize_by_genomic_hits:
+        source: counter_normalize_by_genomic_hits
         valueFrom: $(String(self))  # convert boolean -> string
       decollapse: counter_decollapse
       stepvector: counter_stepvector
@@ -225,7 +229,7 @@ steps:
       collapsed_fa: preprocessing/uniq_seqs
     out: [ feature_counts, rule_counts, norm_counts, mapped_nt_len_dist, assigned_nt_len_dist,
            alignment_stats, summary_stats, decollapsed_sams, alignment_tables,
-           assignment_diags, selection_diags ]
+           assignment_diags, selection_diags, stats_check ]
 
   tiny-deseq:
     run: ../tools/tiny-deseq.cwl
@@ -322,7 +326,7 @@ steps:
                   tiny-count/mapped_nt_len_dist, tiny-count/assigned_nt_len_dist,
                   tiny-count/alignment_stats, tiny-count/summary_stats, tiny-count/decollapsed_sams,
                   tiny-count/assignment_diags, tiny-count/selection_diags, tiny-count/alignment_tables,
-                  features_csv ]
+                  tiny-count/stats_check, features_csv ]
       dir_name: dir_name_tiny-count
     out: [ subdir ]
 

diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py
@@ -896,8 +896,8 @@ def check_backward_compatibility(self, header_vals):
             if 'mismatches' not in header_vals_lc:
                 compat_errors.append('\n'.join([
                     "It looks like you're using a Features Sheet from an earlier version of",
-                    'tinyRNA. An additional column, "Mismatches", is now expected. Please review'
-                    "the Stage 2 section in tiny-count's documentation for more info, then add"
+                    'tinyRNA. An additional column, "Mismatches", is now expected. Please review',
+                    "the Stage 2 section in tiny-count's documentation for more info, then add",
                     "the new column to your Features Sheet to avoid this error."
                 ]))
 

diff --git a/tiny/rna/counter/counter.py b/tiny/rna/counter/counter.py
@@ -53,8 +53,12 @@ def get_args():
     optional_args.add_argument('-o', '--out-prefix', metavar='PREFIX', default='tiny-count_{timestamp}',
                                help='The output prefix to use for file names. All occurrences of the '
                                     'substring {timestamp} will be replaced with the current date and time.')
-    optional_args.add_argument('-nh', '--normalize-by-hits', metavar='T/F', default='T',
-                               help='If T/true, normalize counts by (selected) overlapping feature counts.')
+    optional_args.add_argument('-ng', '--normalize-by-genomic-hits', metavar='T/F', default='T',
+                               help='Normalize counts by genomic hits.')
+    optional_args.add_argument('-nf', '--normalize-by-feature-hits', metavar='T/F', default='T',
+                               help='Normalize counts by feature hits.')
+    optional_args.add_argument('-vs', '--verify-stats', metavar='T/F', default='T',
+                               help='Verify that all reported stats are internally consistent.')
     optional_args.add_argument('-dc', '--decollapse', action='store_true',
                                help='Create a decollapsed copy of all SAM files listed in your Samples Sheet. '
                                     'This option is ignored for non-collapsed inputs.')
@@ -80,7 +84,8 @@ def get_args():
     else:
         args_dict = vars(args)
         args_dict['out_prefix'] = args.out_prefix.replace('{timestamp}', get_timestamp())
-        args_dict['normalize_by_hits'] = args.normalize_by_hits.lower() in ['t', 'true']
+        for tf in ('normalize_by_feature_hits', 'normalize_by_genomic_hits', 'verify_stats'):
+            args_dict[tf] = args_dict[tf].lower() in ['t', 'true']
         return ReadOnlyDict(args_dict)
 
 
@@ -221,11 +226,11 @@ def map_and_reduce(libraries, paths, prefs):
         with mp.Pool(len(libraries)) as pool:
             async_results = pool.imap_unordered(counter.count_reads, libraries)
 
-            for stats_result in async_results:
-                summary.add_library(stats_result)
+            for result in async_results:
+                summary.add_library_stats(result)
     else:
         # Only one library, multiprocessing not beneficial for task
-        summary.add_library(counter.count_reads(libraries[0]))
+        summary.add_library_stats(counter.count_reads(libraries[0]))
 
     return summary