tiny-count: new selector: Mismatches#298
Merged
taimontgomery merged 21 commits intomasterfrom Apr 7, 2023
Merged
Conversation
… SamSqValidator class has been updated accordingly. SAM header validation is delegated to pysam's AlignmentHeader.to_dict() method. When the user elects to have decollapsed SAM files produced, alignments are buffered as AlignedSegment objects because they use about 50% of the memory that the string representation does, and much less for the previous tuple representation. On the other hand, using pysam to parse alignments is 3.5x slower in _parse_alignments() and results in an overall runtime increase of about 20%. This might pay off in the form of BAM file support, but after looking over pysam's codebase I don't think there's a lot I can do to speed this up without writing our own in Cython. Ultimately it might be worthwhile to resurrect the old code and just have two separate parsing functions for SAM and BAM
…_alignments(). It's better but not quite enough. Need to move this function into Cython space for any further gains
…edicated Cython extension class. The extension acts as an iterable over pysam alignments that have been converted to dictionary form on the fly. Importantly, it allows us to eliminate nearly all calls to the Python-space API for pysam and instead retrieve information directly from Cython space. It is much faster. The runtime cost of switching to pysam is now a mere 5% slowdown (down from 20-30% initially). In the spirit of minimizing Cython footprint due to debugging complications, the extension class still accumulates alignments for decollapsing, but uses a callback method from SAM_reader's Python space to actually write the decollapsed alignments
…and the Features Sheet. An additional class has not been added to matching.py for this selector because it is most efficiently achieved using the existing NumericalMatch class. The Mismatch selector is embedded in the feature's match tuples alongside the Overlap selector. It is unpacked from the match tuple and evaluated in Stage 2 selection.
…type annotations to the AlignmentIter constructor
- QNAME splitting with the collapser token is slightly more reliable - Updating type hints
…d to make sure it has a read sequence. If it doesn't it's an error
…necessary to use SAM_reader here)
… column. Column order now matches current selection diagram.
…te strings are now just strings)
Collaborator
|
Tested with ram1 data and 0, 1, or 3 mutations introduced into the genome fa and using Mismatch selector. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new selector for tiny-count: Mismatches. It is used for placing constraints on the edit distance between an alignment and the reference, and it is evaluated in Stage 2 after the Overlap selector. Users can specify ranges, lists, wildcards, and single values in this column.
Edit distance is determined from:
The former function for producing alignment dictionaries, SAM_reader._parse_alignments(), has been converted to a standalone Cython class which utilizes pysam's Cython API. As a result, runtimes appear to be negligibly affected (~4-5% slower) rather than the 20-30% reduction measured while using pysam's Python API. This dedicated class is also responsible for accumulating alignments for decollapsed outputs, but delegates all other decollapsing responsibility to the Python-space SAM_reader class. I've made an effort to minimize the Cython surface area due to its complications with debugging.
Additionally:
Closes #296