Potential idea: add PIF format with CIGAR-less features #5224
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Showing whole-genome overviews can be slow because large amounts of CIGAR data and other tags can slow down the data fetching and display code.
This proposes making our make-pif format strip the CIGAR string from a separate "prefix areas" of the PIF format
Background
The PIF format is a special tabix file with the data sorted by both query and target coordinates. It already has two "prefix areas" of the tabix file:
prefix q: query by query genome coordinates
prefix t: query by target genome coordinates
This PR
This PR makes two new prefixes
prefix a: query by query genome coordinates, without CIGAR
prefix b: query by target genome coordinates, stripped CIGAR
The hope would be fast whole genome overviews, plotted without CIGAR, that can be zoomed in to show CIGAR at arbitrary zoom levels. that are relatively faithful to the data
Alternatives
An alternative idea would be to use "reduced CIGAR" where it preserves large deletions and insertions relative to your zoom level but this is sort of hard to do because the CIGAR intricately maps to a specific coordinates in a way that makes you iterate through the whole thing
I think potentially the concept of tracepoints (Gene Myers blog https://dazzlerblog.wordpress.com/2015/11/05/trace-points/) could be an alternative but I don't fully understand them yet. Some pangenome people have been making tooling around tracepoints https://github.com/AndreaGuarracino/lib_tracepoints