tiny-count: option for normalization by genomic hits, improvements in stats collection precision by AlexTate · Pull Request #301 · MontgomeryLab/tinyRNA

AlexTate · 2023-04-08T21:18:59Z

The two normalization-by-hits options, by genomic hits and feature hits, can now be disabled both independently and in tandem. When either normalization step is disabled, two new stats are reported in summary stats in place of Mapped Reads:

Non-normalized Mapped Reads: the sum of assigned and unassigned reads according to the normalization config
Normalized Mapped Reads: the true read count

tiny-plot automatically uses the appropriate stat for calculating proportions in rule_charts and class_charts.

Internal Consistency Checks for Reported Stats

The counts reported in all output stats files are now thoroughly checked for internal consistency. Discrepancies are reported to console with a clear description rather than being treated as an error. If alignment stats and summary stats have internal disagreement, a checksum table is produced as a CSV file for further diagnosis. Care has been taken to ensure that the consistency checker can fail gracefully if any unforeseen exceptions are raised. This prevents counting outputs from being lost at the fault of the consistency checker. Consistency is checked after every run and is typically a very swift process (usually a fraction of a second). Nonetheless, a command line option has been added to tiny-count to allow this step to be turned off if needed. The following checks are performed for each library:

Internal consistency for all assigned/unassigned read/sequence counts in alignment stats and summary stats
Non-normalized Mapped Reads >= Normalized Mapped Reads
Reported count totals in feature counts == reported count totals in rule counts
Reported count totals in rule counts == assigned read count in summary stats (same for feature counts by transitive property)
Reported count totals in rule counts <= its mapped reads row
Reported counts per classifier have equal sums in feature counts and rule counts
Total reads in mapped nt/len matrices == mapped reads in summary stats
Total reads in assigned nt/len matrices == assigned reads in summary stats

Codebase Improvements

MergedStats classes have been mildly refactored to ensure that stats are complete and final before being validated and written to output files.

Closes #295

…date normalization by genomic hits

… disabling normalization by genomic hits. I renamed "normalize hits" to "normalize by feature hits" to disambiguate and make consistent with the new option

…bstract method. The goal is to separate the finalization steps from write_output_logfile() leaving minimal calls to whichever writing function is used. This is more inline with the function's name and allows the stats objects to be validated before writing output files

…he code used for reading Features Sheet rules. It will be useful for statistics.py to have a rudimentary copy of the rules table to work with for validation. The get_inverted_classifiers() function returns a dictionary of {Classifier: [rule, indexes]} so that we can look up rules by classifier. This will be used to validate feature counts per classifier to make sure the feature counts table agrees with the rule counts table

…thod order so that get_mapped_seqs/reads is listed near the other stats calculations.

…h disabled normalization. Also introducing some strategic rounding to prevent floating point error accumulation. It turns out this can be quite significant given enough alignments, and it grows unpredictably. It costs a small amount of runtime but allows stats to be validated and internally consistent

…luation of the internal consistency of all reported stats, between output files and in some cases within. I've made an effort to write this class in a way that it won't bring the rest of tiny-count down if it encounters an error. Validation is performed in a floating-point-error-conscious way. Issues are reported but not treated as errors.

# Conflicts: # tiny/rna/counter/statistics.py

Updating the two normalization keys in all Run Configs. Updating compatibility definitions as well.

…duced when alignment stats are found to not be internally consistent

…rox_equal() now that testing is complete

…ength in finalize_bundle()

…Non-normalized Mapped Reads when the hit normalization steps are disabled. This allows us to report the true mapped read count while still maintaining valid proportion calculations in tiny-plot for rule_charts and class_charts. Otherwise, only the Mapped Read stat is reported when default normalization is used. StatisticsValidator verifies that Normalized Mapped Reads <= Non-normalized Mapped Reads when these stats are reported

… of Mapped Reads when it is reported so that proper proportion calculations can be made for class_charts and rule_charts

…ification so that it reads and formats more cleanly. Also updating the helpstring in Parameters.md

…r intermediate operations. Also increasing the equality tolerance by a small amount to help with erroneous warnings for large counting tasks

…rther reflection and testing, this was hurting accuracy more than it was helping

…e() abstract methods. After further testing I realized that the trouble we were seeing with internal consistency disagreement was actually caused by prematurely rounding final counts in finalize(). Counts should instead be rounded to 2 decimal places in write_output_logfile().

…ndex. It also makes more sense to set the index name in each class' constructor instead of in this method

…o not round its values until write_output_logfile(), and to use MergedStats.df_to_csv() to write the feature_counts and norm_counts tables for consistency

…nd not round its values until write_output_logfile(). Also simplifying the dataframe operations in add_library() and finalize() so that they are consistent with other MergedStats classes.

…file(), and to use MergedStats.df_to_csv() to write its outputs for consistency.

…gfile(), and to label its index name at construction

…ile(), and to label its index name at construction

…ling of FeatureCounts now that it properly maintains its multiindex past finalize() and through the csv writing step

…onger used by MergedStats classes

…rmed before conversion to int since .astype('int64') is functionally equivalent to .apply(np.floor). Also raising the default equality tolerance in StatisticsValidator back to 1.0

taimontgomery · 2023-04-17T17:05:33Z

Tested successfully with ram1 and Lib303 data.

AlexTate added 22 commits April 5, 2023 19:12

Preliminary refactor and addition of normalization options to accommo…

41b819f

…date normalization by genomic hits

Adding command line option to tiny-count, CWL, and the Run Config for…

ba2fb5e

… disabling normalization by genomic hits. I renamed "normalize hits" to "normalize by feature hits" to disambiguate and make consistent with the new option

Refactoring FeatureCounts to use the new ABC methods

37574d5

Refactoring NtLenMatrices to use the new ABC methods.

fc967d2

Refactoring AlignmentStats to use the new ABC methods.

8d095cb

Refactoring SummaryStats to use the new ABC methods. Also changing me…

38db0d2

…thod order so that get_mapped_seqs/reads is listed near the other stats calculations.

Detail and readability improvements

b7195e2

Merge branch 'issue-299' into issue-295

be25ac7

# Conflicts: # tiny/rna/counter/statistics.py

Removing unused parameter from the Diagnostics class

dae09ca

Version bump to 1.4

a82c57c

Updating the two normalization keys in all Run Configs. Updating compatibility definitions as well.

Adding the optional stats_check.csv file to the CWL. This file is pro…

fe18584

…duced when alignment stats are found to not be internally consistent

Raising the floating point error tolerance in StatisticsValidator.app…

51ffb71

…rox_equal() now that testing is complete

Correcting the calculation of total unassigned reads and mapped nt5/l…

f4432ff

…ength in finalize_bundle()

tiny-plot has been updated to use Non-normalized Mapped Reads instead…

aeda3eb

… of Mapped Reads when it is reported so that proper proportion calculations can be made for class_charts and rule_charts

Minor bugfix unrelated to issue-295

9d98294

Documentation updates for the new normalization options

23e1079

Cleaning up the tiny-count helpstring for normalization and stats ver…

5739a94

…ification so that it reads and formats more cleanly. Also updating the helpstring in Parameters.md

AlexTate requested a review from taimontgomery April 8, 2023 21:18

AlexTate added 7 commits April 9, 2023 12:29

Increasing rounding precision. 2 decimal places was far too strict fo…

7743fa2

…r intermediate operations. Also increasing the equality tolerance by a small amount to help with erroneous warnings for large counting tasks

Removing rounding steps for floating point error mitigation. After fu…

5f489ff

…rther reflection and testing, this was hurting accuracy more than it was helping

Updating df_to_csv() so that FeatureCounts can use it with its multii…

3bde3d3

…ndex. It also makes more sense to set the index name in each class' constructor instead of in this method

Updating FeatureCounts to label its multiindex during construction, t…

f45ab41

…o not round its values until write_output_logfile(), and to use MergedStats.df_to_csv() to write the feature_counts and norm_counts tables for consistency

Updating RuleCounts to populate and label its index at construction a…

63ed696

…nd not round its values until write_output_logfile(). Also simplifying the dataframe operations in add_library() and finalize() so that they are consistent with other MergedStats classes.

Updating NtLenMatrices to not round its values until write_output_log…

fcda06d

…file(), and to use MergedStats.df_to_csv() to write its outputs for consistency.

AlexTate added 6 commits April 16, 2023 16:44

Updating AlignmentStats to not round its values until write_output_lo…

7f671ee

…gfile(), and to label its index name at construction

Updating SummaryStats to not round its values until write_output_logf…

e78431c

…ile(), and to label its index name at construction

Consistency updates for MergedDiags

1107180

Minor consistency updates in StaticsticsValidator. Also updating hand…

203b4f9

…ling of FeatureCounts now that it properly maintains its multiindex past finalize() and through the csv writing step

Removing the sort_cols_and_round() helper function because it is no l…

dae930d

…onger used by MergedStats classes

Small correction for mapped len dist tables. Rounding should be perfo…

0672c3c

…rmed before conversion to int since .astype('int64') is functionally equivalent to .apply(np.floor). Also raising the default equality tolerance in StatisticsValidator back to 1.0

taimontgomery merged commit 5deeb87 into master Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301

tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301
taimontgomery merged 35 commits intomasterfrom
issue-295

AlexTate commented Apr 8, 2023 •

edited

Loading

Uh oh!

taimontgomery commented Apr 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexTate commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal Consistency Checks for Reported Stats

Codebase Improvements

Uh oh!

taimontgomery commented Apr 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexTate commented Apr 8, 2023 •

edited

Loading