Skip to content

tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301

Merged
taimontgomery merged 35 commits intomasterfrom
issue-295
Apr 17, 2023
Merged

tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301
taimontgomery merged 35 commits intomasterfrom
issue-295

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Apr 8, 2023

The two normalization-by-hits options, by genomic hits and feature hits, can now be disabled both independently and in tandem. When either normalization step is disabled, two new stats are reported in summary stats in place of Mapped Reads:

  • Non-normalized Mapped Reads: the sum of assigned and unassigned reads according to the normalization config
  • Normalized Mapped Reads: the true read count

tiny-plot automatically uses the appropriate stat for calculating proportions in rule_charts and class_charts.

Internal Consistency Checks for Reported Stats

The counts reported in all output stats files are now thoroughly checked for internal consistency. Discrepancies are reported to console with a clear description rather than being treated as an error. If alignment stats and summary stats have internal disagreement, a checksum table is produced as a CSV file for further diagnosis. Care has been taken to ensure that the consistency checker can fail gracefully if any unforeseen exceptions are raised. This prevents counting outputs from being lost at the fault of the consistency checker. Consistency is checked after every run and is typically a very swift process (usually a fraction of a second). Nonetheless, a command line option has been added to tiny-count to allow this step to be turned off if needed. The following checks are performed for each library:

  • Internal consistency for all assigned/unassigned read/sequence counts in alignment stats and summary stats
  • Non-normalized Mapped Reads >= Normalized Mapped Reads
  • Reported count totals in feature counts == reported count totals in rule counts
  • Reported count totals in rule counts == assigned read count in summary stats (same for feature counts by transitive property)
  • Reported count totals in rule counts <= its mapped reads row
  • Reported counts per classifier have equal sums in feature counts and rule counts
  • Total reads in mapped nt/len matrices == mapped reads in summary stats
  • Total reads in assigned nt/len matrices == assigned reads in summary stats

Codebase Improvements

MergedStats classes have been mildly refactored to ensure that stats are complete and final before being validated and written to output files.

Closes #295

AlexTate added 22 commits April 5, 2023 19:12
… disabling normalization by genomic hits. I renamed "normalize hits" to "normalize by feature hits" to disambiguate and make consistent with the new option
…bstract method. The goal is to separate the finalization steps from write_output_logfile() leaving minimal calls to whichever writing function is used. This is more inline with the function's name and allows the stats objects to be validated before writing output files
…he code used for reading Features Sheet rules. It will be useful for statistics.py to have a rudimentary copy of the rules table to work with for validation. The get_inverted_classifiers() function returns a dictionary of {Classifier: [rule, indexes]} so that we can look up rules by classifier. This will be used to validate feature counts per classifier to make sure the feature counts table agrees with the rule counts table
…thod order so that get_mapped_seqs/reads is listed near the other stats calculations.
…h disabled normalization. Also introducing some strategic rounding to prevent floating point error accumulation. It turns out this can be quite significant given enough alignments, and it grows unpredictably. It costs a small amount of runtime but allows stats to be validated and internally consistent
…luation of the internal consistency of all reported stats, between output files and in some cases within.

I've made an effort to write this class in a way that it won't bring the rest of tiny-count down if it encounters an error. Validation is performed in a floating-point-error-conscious way. Issues are reported but not treated as errors.
# Conflicts:
#	tiny/rna/counter/statistics.py
Updating the two normalization keys in all Run Configs. Updating compatibility definitions as well.
…duced when alignment stats are found to not be internally consistent
…Non-normalized Mapped Reads when the hit normalization steps are disabled. This allows us to report the true mapped read count while still maintaining valid proportion calculations in tiny-plot for rule_charts and class_charts. Otherwise, only the Mapped Read stat is reported when default normalization is used.

 StatisticsValidator verifies that Normalized Mapped Reads <= Non-normalized Mapped Reads when these stats are reported
… of Mapped Reads when it is reported so that proper proportion calculations can be made for class_charts and rule_charts
…ification so that it reads and formats more cleanly. Also updating the helpstring in Parameters.md
@AlexTate AlexTate requested a review from taimontgomery April 8, 2023 21:18
AlexTate added 7 commits April 9, 2023 12:29
…r intermediate operations. Also increasing the equality tolerance by a small amount to help with erroneous warnings for large counting tasks
…rther reflection and testing, this was hurting accuracy more than it was helping
…e() abstract methods. After further testing I realized that the trouble we were seeing with internal consistency disagreement was actually caused by prematurely rounding final counts in finalize(). Counts should instead be rounded to 2 decimal places in write_output_logfile().
…ndex. It also makes more sense to set the index name in each class' constructor instead of in this method
…o not round its values until write_output_logfile(), and to use MergedStats.df_to_csv() to write the feature_counts and norm_counts tables for consistency
…nd not round its values until write_output_logfile(). Also simplifying the dataframe operations in add_library() and finalize() so that they are consistent with other MergedStats classes.
…file(), and to use MergedStats.df_to_csv() to write its outputs for consistency.
…gfile(), and to label its index name at construction
…ile(), and to label its index name at construction
…ling of FeatureCounts now that it properly maintains its multiindex past finalize() and through the csv writing step
…rmed before conversion to int since .astype('int64') is functionally equivalent to .apply(np.floor). Also raising the default equality tolerance in StatisticsValidator back to 1.0
@taimontgomery
Copy link
Collaborator

Tested successfully with ram1 and Lib303 data.

@taimontgomery taimontgomery merged commit 5deeb87 into master Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiny-count: make normalization by genomic hits optional

2 participants