Skip to content

Conversation

@cmdcolin
Copy link
Collaborator

@cmdcolin cmdcolin commented Nov 15, 2025

Showing whole-genome overviews can be slow because large amounts of CIGAR data and other tags can slow down the data fetching and display code.

This proposes making our make-pif format strip the CIGAR string from a separate "prefix areas" of the PIF format

Background

The PIF format is a special tabix file with the data sorted by both query and target coordinates. It already has two "prefix areas" of the tabix file:

prefix q: query by query genome coordinates
prefix t: query by target genome coordinates

This PR

This PR makes two new prefixes

prefix a: query by query genome coordinates, without CIGAR
prefix b: query by target genome coordinates, stripped CIGAR

The hope would be fast whole genome overviews, plotted without CIGAR, that can be zoomed in to show CIGAR at arbitrary zoom levels. that are relatively faithful to the data

Alternatives

An alternative idea would be to use "reduced CIGAR" where it preserves large deletions and insertions relative to your zoom level but this is sort of hard to do because the CIGAR intricately maps to a specific coordinates in a way that makes you iterate through the whole thing

I think potentially the concept of tracepoints (Gene Myers blog https://dazzlerblog.wordpress.com/2015/11/05/trace-points/) could be an alternative but I don't fully understand them yet. Some pangenome people have been making tooling around tracepoints https://github.com/AndreaGuarracino/lib_tracepoints

@cmdcolin cmdcolin changed the title Potential idea: add PIF format with CIGAR-less prefixes Potential idea: add PIF format with CIGAR-less features Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants