Skip to content

tiny-count: new selector: Mismatches#298

Merged
taimontgomery merged 21 commits intomasterfrom
issue-296
Apr 7, 2023
Merged

tiny-count: new selector: Mismatches#298
taimontgomery merged 21 commits intomasterfrom
issue-296

Conversation

@AlexTate
Copy link
Member

@AlexTate AlexTate commented Apr 5, 2023

This PR introduces a new selector for tiny-count: Mismatches. It is used for placing constraints on the edit distance between an alignment and the reference, and it is evaluated in Stage 2 after the Overlap selector. Users can specify ranges, lists, wildcards, and single values in this column.

Edit distance is determined from:

  • The NM tag, if present
  • The CIGAR string (I, D, and X operations) if the NM tag is not present in the first alignment
  • If the NM tag is present in the first alignment but missing from a subsequent alignment, then the subsequent alignment's edit distance assumes a default value of zero

The former function for producing alignment dictionaries, SAM_reader._parse_alignments(), has been converted to a standalone Cython class which utilizes pysam's Cython API. As a result, runtimes appear to be negligibly affected (~4-5% slower) rather than the 20-30% reduction measured while using pysam's Python API. This dedicated class is also responsible for accumulating alignments for decollapsed outputs, but delegates all other decollapsing responsibility to the Python-space SAM_reader class. I've made an effort to minimize the Cython surface area due to its complications with debugging.

Additionally:

  • The first sequence in each alignment file is checked to make sure that it contains the read sequence (error otherwise)
  • If the user elects to have decollapsed outputs produced, alignments are accumulated as pysam's AlignedSegmet objects rather than alignment dictionaries because they have a significantly smaller memory footprint
  • The column order of the Features Sheet (specifically the Overlap column) has been updated to match the order shown in the selection diagram

Closes #296

AlexTate added 21 commits March 23, 2023 18:54
… SamSqValidator class has been updated accordingly.

SAM header validation is delegated to pysam's AlignmentHeader.to_dict() method.

When the user elects to have decollapsed SAM files produced, alignments are buffered as AlignedSegment objects because they use about 50% of the memory that the string representation does, and much less for the previous tuple representation.

On the other hand, using pysam to parse alignments is 3.5x slower in _parse_alignments() and results in an overall runtime increase of about 20%. This might pay off in the form of BAM file support, but after looking over pysam's codebase I don't think there's a lot I can do to speed this up without writing our own in Cython. Ultimately it might be worthwhile to resurrect the old code and just have two separate parsing functions for SAM and BAM
…_alignments(). It's better but not quite enough. Need to move this function into Cython space for any further gains
…edicated Cython extension class. The extension acts as an iterable over pysam alignments that have been converted to dictionary form on the fly. Importantly, it allows us to eliminate nearly all calls to the Python-space API for pysam and instead retrieve information directly from Cython space.

It is much faster. The runtime cost of switching to pysam is now a mere 5% slowdown (down from 20-30% initially).

In the spirit of minimizing Cython footprint due to debugging complications, the extension class still accumulates alignments for decollapsing, but uses a callback method from SAM_reader's Python space to actually write the decollapsed alignments
…and the Features Sheet. An additional class has not been added to matching.py for this selector because it is most efficiently achieved using the existing NumericalMatch class.

The Mismatch selector is embedded in the feature's match tuples alongside the Overlap selector. It is unpacked from the match tuple and evaluated in Stage 2 selection.
…type annotations to the AlignmentIter constructor
- QNAME splitting with the collapser token is slightly more reliable
- Updating type hints
…d to make sure it has a read sequence. If it doesn't it's an error
… column. Column order now matches current selection diagram.
@taimontgomery
Copy link
Collaborator

Tested with ram1 data and 0, 1, or 3 mutations introduced into the genome fa and using Mismatch selector.

@taimontgomery taimontgomery merged commit e4caf29 into master Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiny-count: new selector for filtering alignments by edit distance

2 participants