Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
41b819f
Preliminary refactor and addition of normalization options to accommo…
AlexTate Apr 6, 2023
ba2fb5e
Adding command line option to tiny-count, CWL, and the Run Config for…
AlexTate Apr 6, 2023
2c9c17a
Updating the MergedStat abstract base class to include a finalize() a…
AlexTate Apr 6, 2023
37574d5
Refactoring FeatureCounts to use the new ABC methods
AlexTate Apr 6, 2023
9f33018
Refactoring RuleCounts to use the new ABC methods. Also cleaning up t…
AlexTate Apr 6, 2023
fc967d2
Refactoring NtLenMatrices to use the new ABC methods.
AlexTate Apr 6, 2023
8d095cb
Refactoring AlignmentStats to use the new ABC methods.
AlexTate Apr 6, 2023
38db0d2
Refactoring SummaryStats to use the new ABC methods. Also changing me…
AlexTate Apr 6, 2023
e776d16
Making corrections to the NT/length stats to make them compatible wit…
AlexTate Apr 6, 2023
fe61785
Introducing the StatisticsValidator class. It performs a thorough eva…
AlexTate Apr 8, 2023
b7195e2
Detail and readability improvements
AlexTate Apr 8, 2023
be25ac7
Merge branch 'issue-299' into issue-295
AlexTate Apr 8, 2023
dae09ca
Removing unused parameter from the Diagnostics class
AlexTate Apr 8, 2023
a82c57c
Version bump to 1.4
AlexTate Apr 8, 2023
fe18584
Adding the optional stats_check.csv file to the CWL. This file is pro…
AlexTate Apr 8, 2023
51ffb71
Raising the floating point error tolerance in StatisticsValidator.app…
AlexTate Apr 8, 2023
f4432ff
Correcting the calculation of total unassigned reads and mapped nt5/l…
AlexTate Apr 8, 2023
5279a61
SummaryStats now reports Mapped Reads as Normalized Mapped Reads and …
AlexTate Apr 8, 2023
aeda3eb
tiny-plot has been updated to use Non-normalized Mapped Reads instead…
AlexTate Apr 8, 2023
9d98294
Minor bugfix unrelated to issue-295
AlexTate Apr 8, 2023
23e1079
Documentation updates for the new normalization options
AlexTate Apr 8, 2023
5739a94
Cleaning up the tiny-count helpstring for normalization and stats ver…
AlexTate Apr 8, 2023
7743fa2
Increasing rounding precision. 2 decimal places was far too strict fo…
AlexTate Apr 9, 2023
5f489ff
Removing rounding steps for floating point error mitigation. After fu…
AlexTate Apr 16, 2023
5ee804f
Adding clarified docstrings to the finalize() and write_output_logfil…
AlexTate Apr 16, 2023
3bde3d3
Updating df_to_csv() so that FeatureCounts can use it with its multii…
AlexTate Apr 16, 2023
f45ab41
Updating FeatureCounts to label its multiindex during construction, t…
AlexTate Apr 16, 2023
63ed696
Updating RuleCounts to populate and label its index at construction a…
AlexTate Apr 16, 2023
fcda06d
Updating NtLenMatrices to not round its values until write_output_log…
AlexTate Apr 16, 2023
7f671ee
Updating AlignmentStats to not round its values until write_output_lo…
AlexTate Apr 16, 2023
e78431c
Updating SummaryStats to not round its values until write_output_logf…
AlexTate Apr 16, 2023
1107180
Consistency updates for MergedDiags
AlexTate Apr 16, 2023
203b4f9
Minor consistency updates in StaticsticsValidator. Also updating hand…
AlexTate Apr 16, 2023
dae930d
Removing the sort_cols_and_round() helper function because it is no l…
AlexTate Apr 16, 2023
0672c3c
Small correction for mapped len dist tables. Rounding should be perfo…
AlexTate Apr 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ And this feature matched a rule in your Features Sheet defining _Classify as..._
| 406904 | miRNA | mir-1, hsa-miR-1 | 1234 | 999 | ... |

#### Normalized Counts
If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied.
If your Samples Sheet has settings for Normalization, an additional copy of the Feature Counts table is produced with the specified per-library normalizations applied. Note that these normalizations are [unrelated to normalization by genomic/feature hits](doc/Configuration.md#applying-custom-normalization).

#### Counts by Rule
This table shows the counts assigned by each rule on a per-library basis. It is indexed by the rule's corresponding row number in the Features Sheet, where the first non-header row is considered row 0. For convenience a Rule String column is added which contains a human friendly concatenation of each rule. Finally, a Mapped Reads row is added which represents each library's total read counts which were available for assignment prior to counting/selection.
Expand Down Expand Up @@ -229,6 +229,13 @@ A single table of summary statistics includes columns for each library and the f
| Mapped Reads | Total genome-mapping reads passing quality filtering prior to counting/selection | |
| Assigned Reads | Total genome-mapping reads passing quality filtering that were assigned to at least one feature due to a rule match in your Features Sheet |

When normalization by feature and/or genomic hits is disabled, the following stats are reported instead of `Mapped Reads`:

| Stat | Description |
|-----------------------------|----------------------------------------------------------------------------------|
| Normalized Mapped Reads | The true mapped read count |
| Non-normalized Mapped Reads | The sum of assigned and unassigned reads according to the normalization settings |

#### 5'nt vs. Length Matrix

During counting, size and 5' nt distribution tables are created for each library. The distribution of lengths and 5' nt can be used to assess the overall quality of your libraries. This can also be used for analyzing small RNA distributions in non-model organisms without annotations.
Expand Down
10 changes: 6 additions & 4 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,9 +225,11 @@ shared_memory: False
##-- If True: show all parsed features in the counts csv, regardless of count/identity --##
counter_all_features: False

##-- If True: counts will be normalized by genomic hits AND selected feature count --##
##-- If False: counts will only be normalized by genomic hits --##
counter_normalize_by_hits: True
##-- If True: counts are normalized by genomic hits (number of multi-alignments) --##
counter_normalize_by_genomic_hits: True

##-- If True: counts are normalized by feature hits (selected feature count per-locus) --##
counter_normalize_by_feature_hits: True

##-- If True: a decollapsed copy of each SAM file will be produced (useful for IGV) --##
counter_decollapse: False
Expand Down Expand Up @@ -326,7 +328,7 @@ dir_name_logs: logs
#
###########################################################################################

version: 1.3.0
version: 1.4.0

######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
#
Expand Down
2 changes: 1 addition & 1 deletion doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ Supported values are:
- **Any number**: the corresponding library's counts are divided by this number (useful for spike-in normalization)
- **RPM or rpm**: the corresponding library's counts are divided by (its mapped read count / 1,000,000)

>**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-hits commandline option. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by selected feature count at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.
>**NOTE**: These normalizations operate independently of tiny-count's --normalize-by-genomic/feature-hits commandline options. The former is concerned with per-library normalization, whereas the latter is concerned with normalization by each sequence's alignment count and the number of selected features at each locus ([more info](tiny-count.md#count-normalization)). The commandline option does not enable or disable the normalizations detailed above.

### Low DF Experiments
DESeq2 requires that your experiment design has at least one degree of freedom. If your experiment doesn't include at least one sample group with more than one replicate, tiny-deseq.r will be skipped and DGE related plots will not be produced.
Expand Down
31 changes: 21 additions & 10 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,19 @@ Optional arguments:

Copies the template configuration files required by tiny-count into the current directory. This argument can't be combined with `--paths-file`. All other arguments are ignored when provided, and once the templates have been copied tiny-count exits.

### Normalize by Hits
| Run Config Key | Commandline Argument |
|----------------------------|---------------------------|
| counter-normalize-by-hits: | `--normalize-by-hits T/F` |
### Normalize by Genomic Hits
| Run Config Key | Commandline Argument |
|------------------------------------|-----------------------------------|
| counter_normalize_by_genomic_hits: | `--normalize-by-genomic-hits T/F` |

By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.
By default, tiny-count will increment feature counts by a normalized amount to avoid overcounting. Each unique sequence's read count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. For sequences with multiple alignments, a portion of the sequence's original count is allocated to each of its alignments to be assigned to features that pass selection at the locus. This portion is the original count divided by the number of alignments, or _genomic hits_. By disabling this normalization step, each of the sequence's alignments will be allocated the full original read count rather than the normalized portion.

### Normalize by Feature Hits
| Run Config Key | Commandline Argument |
|------------------------------------|-----------------------------------|
| counter_normalize_by_feature_hits: | `--normalize-by-feature-hits T/F` |

By default, tiny-count will increment feature counts by a normalized amount to avoid overcounting. Each sequence alignment locus is allocated a portion of the sequence's original read count (depending on `counter_normalize_by_genomic_hits`), and once selection is complete the allocated count is divided by the number of selected features, or _feature hits_, at the alignment. The resulting value is added to the totals for each matching feature. By disabling this normalization step, each selected feature will receive the full amount allocated to the locus rather than the normalized portion.

### Decollapse
| Run Config Key | Commandline Argument |
Expand Down Expand Up @@ -107,8 +114,8 @@ Diagnostic information will include intermediate alignment files for each librar

### Full tiny-count Help String
```
tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-nh T/F] [-dc]
[-sv {Cython,HTSeq}] [-p] [-d]
tiny-count (-pf FILE | --get-templates) [-o PREFIX] [-ng T/F] [-nf T/F]
[-vs T/F] [-dc] [-sv {Cython,HTSeq}] [-p] [-d]

tiny-count is a precision counting tool for hierarchical classification and
quantification of small RNA-seq reads
Expand All @@ -132,9 +139,13 @@ Optional arguments:
occurrences of the substring {timestamp} will be
replaced with the current date and time. (default:
tiny-count_{timestamp})
-nh T/F, --normalize-by-hits T/F
If T/true, normalize counts by (selected) overlapping
feature counts. (default: T)
-ng T/F, --normalize-by-genomic-hits T/F
Normalize counts by genomic hits. (default: T)
-nf T/F, --normalize-by-feature-hits T/F
Normalize counts by feature hits. (default: T)
-vs T/F, --verify-stats T/F
Verify that all reported stats are internally
consistent. (default: T)
-dc, --decollapse Create a decollapsed copy of all SAM files listed in
your Samples Sheet. This option is ignored for non-
collapsed inputs. (default: False)
Expand Down
2 changes: 1 addition & 1 deletion doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ Examples:
>**Tip:** you may specify U and T bases in your rules. Uracil bases will be converted to thymine when your Features Sheet is loaded. N bases are also allowed.

## Count Normalization
Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. The second normalization step can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided:
Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided:
1. By the number of loci it aligns to in the genome.
2. By the number of _selected_ features for each of its alignments.

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
AUTHOR = 'Kristen Brown, Alex Tate'
PLATFORM = 'Unix'
REQUIRES_PYTHON = '>=3.9.0'
VERSION = '1.3.0'
VERSION = '1.4.0'
REQUIRED = [] # Required packages are installed via Conda's environment.yml


Expand Down
10 changes: 6 additions & 4 deletions tests/testdata/config_files/run_config_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,9 +225,11 @@ shared_memory: False
##-- If True: show all parsed features in the counts csv, regardless of count/identity --##
counter_all_features: False

##-- If True: counts will be normalized by genomic hits AND selected feature count --##
##-- If False: counts will only be normalized by genomic hits --##
counter_normalize_by_hits: True
##-- If True: counts are normalized by genomic hits (number of multi-alignments) --##
counter_normalize_by_genomic_hits: True

##-- If True: counts are normalized by feature hits (selected feature count per-locus) --##
counter_normalize_by_feature_hits: True

##-- If True: a decollapsed copy of each SAM file will be produced (useful for IGV) --##
counter_decollapse: False
Expand Down Expand Up @@ -326,7 +328,7 @@ dir_name_logs: logs
#
###########################################################################################

version: 1.3.0
version: 1.4.0

######--------------------------- DERIVED FROM PATHS FILE ---------------------------######
#
Expand Down
14 changes: 12 additions & 2 deletions tiny/cwl/tools/tiny-count.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,15 @@ inputs:

# Optional inputs

normalize_by_hits:
normalize_by_feature_hits:
type: string?
inputBinding:
prefix: --normalize-by-hits
prefix: --normalize-by-feature-hits

normalize_by_genomic_hits:
type: string?
inputBinding:
prefix: --normalize-by-genomic-hits

decollapse:
type: boolean?
Expand Down Expand Up @@ -140,5 +145,10 @@ outputs:
outputBinding:
glob: $(inputs.out_prefix)_selection_diags.txt

stats_check:
type: File?
outputBinding:
glob: "*_stats_check.csv"

console_output:
type: stdout
14 changes: 9 additions & 5 deletions tiny/cwl/workflows/tinyrna_wf.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,8 @@ inputs:
counter_decollapse: boolean?
counter_stepvector: string?
counter_all_features: boolean?
counter_normalize_by_hits: boolean?
counter_normalize_by_feature_hits: boolean?
counter_normalize_by_genomic_hits: boolean?

# tiny-deseq inputs
run_deseq: boolean
Expand Down Expand Up @@ -214,8 +215,11 @@ steps:
gff_files: gff_files
out_prefix: run_name
all_features: counter_all_features
normalize_by_hits:
source: counter_normalize_by_hits
normalize_by_feature_hits:
source: counter_normalize_by_feature_hits
valueFrom: $(String(self)) # convert boolean -> string
normalize_by_genomic_hits:
source: counter_normalize_by_genomic_hits
valueFrom: $(String(self)) # convert boolean -> string
decollapse: counter_decollapse
stepvector: counter_stepvector
Expand All @@ -225,7 +229,7 @@ steps:
collapsed_fa: preprocessing/uniq_seqs
out: [ feature_counts, rule_counts, norm_counts, mapped_nt_len_dist, assigned_nt_len_dist,
alignment_stats, summary_stats, decollapsed_sams, alignment_tables,
assignment_diags, selection_diags ]
assignment_diags, selection_diags, stats_check ]

tiny-deseq:
run: ../tools/tiny-deseq.cwl
Expand Down Expand Up @@ -322,7 +326,7 @@ steps:
tiny-count/mapped_nt_len_dist, tiny-count/assigned_nt_len_dist,
tiny-count/alignment_stats, tiny-count/summary_stats, tiny-count/decollapsed_sams,
tiny-count/assignment_diags, tiny-count/selection_diags, tiny-count/alignment_tables,
features_csv ]
tiny-count/stats_check, features_csv ]
dir_name: dir_name_tiny-count
out: [ subdir ]

Expand Down
4 changes: 2 additions & 2 deletions tiny/rna/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -896,8 +896,8 @@ def check_backward_compatibility(self, header_vals):
if 'mismatches' not in header_vals_lc:
compat_errors.append('\n'.join([
"It looks like you're using a Features Sheet from an earlier version of",
'tinyRNA. An additional column, "Mismatches", is now expected. Please review'
"the Stage 2 section in tiny-count's documentation for more info, then add"
'tinyRNA. An additional column, "Mismatches", is now expected. Please review',
"the Stage 2 section in tiny-count's documentation for more info, then add",
"the new column to your Features Sheet to avoid this error."
]))

Expand Down
17 changes: 11 additions & 6 deletions tiny/rna/counter/counter.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,12 @@ def get_args():
optional_args.add_argument('-o', '--out-prefix', metavar='PREFIX', default='tiny-count_{timestamp}',
help='The output prefix to use for file names. All occurrences of the '
'substring {timestamp} will be replaced with the current date and time.')
optional_args.add_argument('-nh', '--normalize-by-hits', metavar='T/F', default='T',
help='If T/true, normalize counts by (selected) overlapping feature counts.')
optional_args.add_argument('-ng', '--normalize-by-genomic-hits', metavar='T/F', default='T',
help='Normalize counts by genomic hits.')
optional_args.add_argument('-nf', '--normalize-by-feature-hits', metavar='T/F', default='T',
help='Normalize counts by feature hits.')
optional_args.add_argument('-vs', '--verify-stats', metavar='T/F', default='T',
help='Verify that all reported stats are internally consistent.')
optional_args.add_argument('-dc', '--decollapse', action='store_true',
help='Create a decollapsed copy of all SAM files listed in your Samples Sheet. '
'This option is ignored for non-collapsed inputs.')
Expand All @@ -80,7 +84,8 @@ def get_args():
else:
args_dict = vars(args)
args_dict['out_prefix'] = args.out_prefix.replace('{timestamp}', get_timestamp())
args_dict['normalize_by_hits'] = args.normalize_by_hits.lower() in ['t', 'true']
for tf in ('normalize_by_feature_hits', 'normalize_by_genomic_hits', 'verify_stats'):
args_dict[tf] = args_dict[tf].lower() in ['t', 'true']
return ReadOnlyDict(args_dict)


Expand Down Expand Up @@ -221,11 +226,11 @@ def map_and_reduce(libraries, paths, prefs):
with mp.Pool(len(libraries)) as pool:
async_results = pool.imap_unordered(counter.count_reads, libraries)

for stats_result in async_results:
summary.add_library(stats_result)
for result in async_results:
summary.add_library_stats(result)
else:
# Only one library, multiprocessing not beneficial for task
summary.add_library(counter.count_reads(libraries[0]))
summary.add_library_stats(counter.count_reads(libraries[0]))

return summary

Expand Down
Loading