Skip to content

Track regualrity through distance index#4857

Merged
adamnovak merged 16 commits intohublabelfrom
hublabel-debug
Mar 26, 2026
Merged

Track regualrity through distance index#4857
adamnovak merged 16 commits intohublabelfrom
hublabel-debug

Conversation

@adamnovak
Copy link
Copy Markdown
Member

This should hopefully help @electricEpilith with the minimizer indexing from hub labeling distance indexes.

Instead of traversing the net graph/hub labeling distance data when trying to decide if a snarl is regular while creating the zip codes during minimizer indexing, we now track the (strict, so in internal reversals) notion of regularity as we build up the distance index.

This includes a distance index version bump.

I'm PRing this against the hublabel branch.

@adamnovak
Copy link
Copy Markdown
Member Author

Right now this passes all the unit tests for the distance index:

Filters: [snarl_distance]
===============================================================================
All tests passed (1565997 assertions in 55 test cases)

I'm not sure it really covers all the edge cases for regular snarls, like turnarounds in the middle of chains.

@adamnovak
Copy link
Copy Markdown
Member Author

Looks like I'm introducing a bunch of test failures in the Simple chain zipcode and Nested snarl zipcode unit tests for zipcodes. Also, @electricEpilith noted that this doesn't actually let minimizer indexing complete in a timely fashion after rebuilding the distance index. I'll have to fix this.

electricEpilith and others added 6 commits March 23, 2026 18:28
cache_payloads was single-threaded despite the -t flag; with 164M nodes
on an HPRC graph it hung for hours. Two fixes:

1. Pass `true` to for_each_handle to enable OpenMP parallelism; guard
   the non-thread-safe writes (oversized_zipcodes vector and
   node_id_to_payload map) with named omp critical sections.

2. Call distance_index->preload(true) immediately before cache_payloads
   in build_minimizer_index. find_frequent_kmers runs for ~3300 s before
   this point and evicts the mmap'd index pages, causing a page fault on
   every snarl-tree lookup in fill_in_zipcode_from_pos. Reloading here
   ensures the index is warm when the parallel loop starts.

Also add a depth guard (abort at >10000) in fill_in_zipcode_from_pos to
catch any future infinite loops in the snarl tree traversal.

Also use distance_index.get_snarl_child_count() (O(1) record read)
instead of for_each_child iteration in get_regular/irregular_snarl_code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@adamnovak adamnovak merged commit 957ebeb into hublabel Mar 26, 2026
@electricEpilith electricEpilith deleted the hublabel-debug branch March 31, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants