Skip to content

Pipeline: tagged counting repurposed as classifier#241

Merged
taimontgomery merged 18 commits intomasterfrom
issue-240
Oct 27, 2022
Merged

Pipeline: tagged counting repurposed as classifier#241
taimontgomery merged 18 commits intomasterfrom
issue-240

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Oct 18, 2022

The Tag column has been renamed to Classify as... and will be used to apply a user-defined class to features that match the rule. The Class= attribute is no longer used to determine a feature's class. Tagged counting semantics still apply.

The counts table produced by tiny-count therefore now has a multiindex of (Feature ID, Classifier). Backward compatibility is not offered for counts tables produced by an earlier version of tinyRNA. The Features Sheet is checked for the presence of a Tag column at pipeline/tiny-count startup and, if present, an error is produced along with steps to fix it.

These changes opened the door for some very satisfying improvements to the code quality in plotter.py. Two additional parameters have been added to the pipeline/tiny-plot:

  • --unassigned-class: the label to use for unassigned counts in class_charts
  • --unknown-class: the label to use for counts assigned by rules lacking a Classify as... value. This is used in class_charts and scatter_dge_class.

Closes #240

Removing the accounting of the Class= attribute from ReferenceTables, usages of this output in the Features data class, and the FeatureCounts output file class. Also correcting some erroneous use of the StepVector typehint (this should have been GenomicArray).

The Feature Class column has been removed from the output feature_counts.csv table, and the Tag column has been renamed to "Classifier"
…table format. The Feature Class column has been dropped and the Tag column has been renamed to Classifier.

Also further improved code flexibility for tiny-deseq.r by not hardcoding character columns when calling write.csv
…Tag multiindex column as a classifier. Wow, this opened the door for some really satisfying simplifications to the class-related codebase.

Backward compatibility is not offered with this commit. I've been thinking about how to reconcile this but I think ultimately that will be a bad idea. Number one, the counting semantics of the old Feature Class vs Tag columns are completely different; these inputs wouldn't be interchangeable.

I've also removed the show_unknown option from scatter_dges() because we have yet to use this feature and we've never added the option to any user-facing config files
--unassigned-class-label and --unknown-class-label. These options retain their previous default values of _UNASSIGNED_ and _UNKNOWN_.

Class labels in class_charts are now sorted.
…ed from the file requirements table.

Description of the feature_counts.csv output has also been updated
…. The count normalization section has been removed because it no longer applies with the new counting method. A link to the corresponding parameters has been added to the _UNKNOWN_ and _UNASSIGNED_ sections.
…ts with a Tag column. Previously, the user would have been notified that the "Classify as..." column was missing from their Features Sheet, which isn't quite as helpful.
@AlexTate AlexTate marked this pull request as draft October 19, 2022 19:29
@AlexTate
Copy link
Member Author

Since this PR introduces changes that are backward incompatible, I would like to make a release for the project in its current state before this one is merged.

…eferenceTables.get(). The get() function is significantly shorter after the recent changes for GFF validation, so it can accommodate the finalization routine without becoming too crowded.
# Conflicts:
#	README.md
#	tests/unit_tests_counter.py
@AlexTate AlexTate marked this pull request as ready for review October 22, 2022 21:20
@taimontgomery
Copy link
Collaborator

With this new, much improved approach to classification, won't the class and rule plots always be the same? And thus can we get rid of the rules plots? Perhaps also change counts_by_rule.csv to counts_by_classification.csv, changing the Rule String column to Classification?

@AlexTate
Copy link
Member Author

No, class and rule plots will differ if any rules share a Classify as... value. Rule plots can be used in this case to see how much each rule contributed to the pooled classes. For this reason I think the proposed changes to output files would be incorrect

@taimontgomery
Copy link
Collaborator

I see. In that case, perhaps we can add a counts_by_classification.csv table at some point.

@taimontgomery
Copy link
Collaborator

Tested successfully with ram1 data.

@taimontgomery taimontgomery merged commit 139ebc1 into master Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline: changes to tag/class counting

2 participants