Pipeline: tagged counting repurposed as classifier#241
Pipeline: tagged counting repurposed as classifier#241taimontgomery merged 18 commits intomasterfrom
Conversation
Removing the accounting of the Class= attribute from ReferenceTables, usages of this output in the Features data class, and the FeatureCounts output file class. Also correcting some erroneous use of the StepVector typehint (this should have been GenomicArray). The Feature Class column has been removed from the output feature_counts.csv table, and the Tag column has been renamed to "Classifier"
…table format. The Feature Class column has been dropped and the Tag column has been renamed to Classifier. Also further improved code flexibility for tiny-deseq.r by not hardcoding character columns when calling write.csv
…Tag multiindex column as a classifier. Wow, this opened the door for some really satisfying simplifications to the class-related codebase. Backward compatibility is not offered with this commit. I've been thinking about how to reconcile this but I think ultimately that will be a bad idea. Number one, the counting semantics of the old Feature Class vs Tag columns are completely different; these inputs wouldn't be interchangeable. I've also removed the show_unknown option from scatter_dges() because we have yet to use this feature and we've never added the option to any user-facing config files
--unassigned-class-label and --unknown-class-label. These options retain their previous default values of _UNASSIGNED_ and _UNKNOWN_. Class labels in class_charts are now sorted.
…ed from the file requirements table. Description of the feature_counts.csv output has also been updated
…lassified" counting approach.
…. The count normalization section has been removed because it no longer applies with the new counting method. A link to the corresponding parameters has been added to the _UNKNOWN_ and _UNASSIGNED_ sections.
…ts with a Tag column. Previously, the user would have been notified that the "Classify as..." column was missing from their Features Sheet, which isn't quite as helpful.
|
Since this PR introduces changes that are backward incompatible, I would like to make a release for the project in its current state before this one is merged. |
…eferenceTables.get(). The get() function is significantly shorter after the recent changes for GFF validation, so it can accommodate the finalization routine without becoming too crowded.
# Conflicts: # README.md # tests/unit_tests_counter.py
|
With this new, much improved approach to classification, won't the class and rule plots always be the same? And thus can we get rid of the rules plots? Perhaps also change counts_by_rule.csv to counts_by_classification.csv, changing the Rule String column to Classification? |
|
No, class and rule plots will differ if any rules share a |
|
I see. In that case, perhaps we can add a counts_by_classification.csv table at some point. |
|
Tested successfully with ram1 data. |
The
Tagcolumn has been renamed toClassify as...and will be used to apply a user-defined class to features that match the rule. TheClass=attribute is no longer used to determine a feature's class. Tagged counting semantics still apply.The counts table produced by tiny-count therefore now has a multiindex of (Feature ID, Classifier). Backward compatibility is not offered for counts tables produced by an earlier version of tinyRNA. The Features Sheet is checked for the presence of a
Tagcolumn at pipeline/tiny-count startup and, if present, an error is produced along with steps to fix it.These changes opened the door for some very satisfying improvements to the code quality in plotter.py. Two additional parameters have been added to the pipeline/tiny-plot:
Classify as...value. This is used in class_charts and scatter_dge_class.Closes #240