tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301
Merged
taimontgomery merged 35 commits intomasterfrom Apr 17, 2023
Merged
tiny-count: option for normalization by genomic hits, improvements in stats collection precision #301taimontgomery merged 35 commits intomasterfrom
taimontgomery merged 35 commits intomasterfrom
Conversation
…date normalization by genomic hits
… disabling normalization by genomic hits. I renamed "normalize hits" to "normalize by feature hits" to disambiguate and make consistent with the new option
…bstract method. The goal is to separate the finalization steps from write_output_logfile() leaving minimal calls to whichever writing function is used. This is more inline with the function's name and allows the stats objects to be validated before writing output files
…he code used for reading Features Sheet rules. It will be useful for statistics.py to have a rudimentary copy of the rules table to work with for validation. The get_inverted_classifiers() function returns a dictionary of {Classifier: [rule, indexes]} so that we can look up rules by classifier. This will be used to validate feature counts per classifier to make sure the feature counts table agrees with the rule counts table
…thod order so that get_mapped_seqs/reads is listed near the other stats calculations.
…h disabled normalization. Also introducing some strategic rounding to prevent floating point error accumulation. It turns out this can be quite significant given enough alignments, and it grows unpredictably. It costs a small amount of runtime but allows stats to be validated and internally consistent
…luation of the internal consistency of all reported stats, between output files and in some cases within. I've made an effort to write this class in a way that it won't bring the rest of tiny-count down if it encounters an error. Validation is performed in a floating-point-error-conscious way. Issues are reported but not treated as errors.
# Conflicts: # tiny/rna/counter/statistics.py
Updating the two normalization keys in all Run Configs. Updating compatibility definitions as well.
…duced when alignment stats are found to not be internally consistent
…rox_equal() now that testing is complete
…ength in finalize_bundle()
…Non-normalized Mapped Reads when the hit normalization steps are disabled. This allows us to report the true mapped read count while still maintaining valid proportion calculations in tiny-plot for rule_charts and class_charts. Otherwise, only the Mapped Read stat is reported when default normalization is used. StatisticsValidator verifies that Normalized Mapped Reads <= Non-normalized Mapped Reads when these stats are reported
… of Mapped Reads when it is reported so that proper proportion calculations can be made for class_charts and rule_charts
…ification so that it reads and formats more cleanly. Also updating the helpstring in Parameters.md
…r intermediate operations. Also increasing the equality tolerance by a small amount to help with erroneous warnings for large counting tasks
…rther reflection and testing, this was hurting accuracy more than it was helping
…e() abstract methods. After further testing I realized that the trouble we were seeing with internal consistency disagreement was actually caused by prematurely rounding final counts in finalize(). Counts should instead be rounded to 2 decimal places in write_output_logfile().
…ndex. It also makes more sense to set the index name in each class' constructor instead of in this method
…o not round its values until write_output_logfile(), and to use MergedStats.df_to_csv() to write the feature_counts and norm_counts tables for consistency
…nd not round its values until write_output_logfile(). Also simplifying the dataframe operations in add_library() and finalize() so that they are consistent with other MergedStats classes.
…file(), and to use MergedStats.df_to_csv() to write its outputs for consistency.
…gfile(), and to label its index name at construction
…ile(), and to label its index name at construction
…ling of FeatureCounts now that it properly maintains its multiindex past finalize() and through the csv writing step
…onger used by MergedStats classes
…rmed before conversion to int since .astype('int64') is functionally equivalent to .apply(np.floor). Also raising the default equality tolerance in StatisticsValidator back to 1.0
Collaborator
|
Tested successfully with ram1 and Lib303 data. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The two normalization-by-hits options, by genomic hits and feature hits, can now be disabled both independently and in tandem. When either normalization step is disabled, two new stats are reported in summary stats in place of Mapped Reads:
tiny-plot automatically uses the appropriate stat for calculating proportions in rule_charts and class_charts.
Internal Consistency Checks for Reported Stats
The counts reported in all output stats files are now thoroughly checked for internal consistency. Discrepancies are reported to console with a clear description rather than being treated as an error. If alignment stats and summary stats have internal disagreement, a checksum table is produced as a CSV file for further diagnosis. Care has been taken to ensure that the consistency checker can fail gracefully if any unforeseen exceptions are raised. This prevents counting outputs from being lost at the fault of the consistency checker. Consistency is checked after every run and is typically a very swift process (usually a fraction of a second). Nonetheless, a command line option has been added to tiny-count to allow this step to be turned off if needed. The following checks are performed for each library:
Codebase Improvements
MergedStats classes have been mildly refactored to ensure that stats are complete and final before being validated and written to output files.
Closes #295