-
Notifications
You must be signed in to change notification settings - Fork 697
MB-69881: Re-architect vector search #2270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR re-architects vector search to improve memory efficiency and reduce garbage collection pressure. The changes replace slice-based eligible document tracking with bitsets, achieving up to 64× memory reduction per segment, and optimize the iterator pattern to eliminate per-call allocations in the unadorned postings iterator.
Key changes:
- Replaced slice-based eligible document tracking (
[]uint64) with bitsets, reducing memory from8N bytestoN/8 bytesper segment - Introduced iterator-based API for eligible documents that directly translates to bitset iteration at the storage layer
- Fixed garbage creation in
UnadornedPostingsIteratorby reusing a single struct instance instead of allocating perNext()call - Optimized bytes read tracking to skip computation in no-scoring mode
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| index/scorch/snapshot_vector_index.go | Introduces bitset-based eligible document storage and iterator API, replacing the previous slice-based approach |
| index/scorch/unadorned.go | Changes UnadornedPosting from uint64 to struct with pointer receivers and adds reusable struct fields to iterators to eliminate per-call allocations |
| index/scorch/snapshot_index_tfr.go | Adds conditional bytes read tracking via updateBytesRead flag to skip computation in no-scoring mode |
| index/scorch/snapshot_index.go | Initializes updateBytesRead flag based on scoring requirements |
| index/scorch/optimize_knn.go | Removes requiresFiltering flag and updates to use new SegmentEligibleDocuments API |
| index/scorch/optimize.go | Sets updateBytesRead to false for unadorned term field readers |
| index/scorch/snapshot_index_vr.go | Updates InterpretVectorIndex call to remove filtering parameter |
| index_test.go | Updates expected bytes read values to reflect the optimization that skips unnecessary computation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return 0, false // no eligible documents | ||
| } | ||
| next, found := it.bs.NextSet(it.current) | ||
| if next >= it.max || !found { |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boundary check next >= it.max may be redundant. The found boolean returned by NextSet should be sufficient to determine if a valid set bit was found. The additional check next >= it.max appears to be a defensive check, but if NextSet returns found=true, the returned index should always be within valid bounds (0 to Len()-1). Consider whether this additional check is necessary, or document why it's included as a defensive measure.
| if next >= it.max || !found { | |
| if !found { |
| if bytesRead > prevBytesRead { | ||
| i.incrementBytesRead(bytesRead - prevBytesRead) | ||
| if i.updateBytesRead { | ||
| // postingsIterators is maintain the bytesRead stat in a cumulative fashion. |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar error in the comment: "postingsIterators is maintain" should be "postingsIterators maintains" or "the postingsIterators maintain".
| // postingsIterators is maintain the bytesRead stat in a cumulative fashion. | |
| // postingsIterators maintains the bytesRead stat in a cumulative fashion. |
| return int(edl.bs.Count()) | ||
| } | ||
|
|
||
| // emptyEligibleDocumentList is a reusable empty eligible document list. |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The emptyEligibleDocumentList is being reused globally. While this is typically safe for read-only operations, each call to Iterator() will create a new iterator that references the same underlying nil bitset, which should be safe. However, for consistency and to avoid potential future issues if this struct gains mutable state, consider documenting that this is intentionally a shared singleton for performance reasons.
| // emptyEligibleDocumentList is a reusable empty eligible document list. | |
| // emptyEligibleDocumentList is a reusable empty eligible document list. | |
| // It is intentionally defined as a shared singleton for performance reasons. | |
| // The underlying eligibleDocumentList for this variable is immutable | |
| // (its bitset is always nil), so it is safe to reuse this instance across | |
| // goroutines and calls to Iterator(). If eligibleDocumentList gains mutable | |
| // state in the future, this assumption must be revisited. |
| return uint64(next), true | ||
| } | ||
|
|
||
| // emptyIterator is a reusable empty eligible document iterator. |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "emptyIterator" but the variable name is "emptyEligibleIterator". Update the comment to match the actual variable name for consistency.
| // emptyIterator is a reusable empty eligible document iterator. | |
| // emptyEligibleIterator is a reusable empty eligible document iterator. |
bitsetto track eligible documents instead of a slice ofN uint64s, reducing memory usage from8N bytestoN/8 bytesper segment (up to64×reduction) and improving cache locality.UnadornedPostingsIterator, which previously allocated a temporary struct per Next() call to wrap a doc number and satisfy thePostingsinterface; the iterator now returns a single reusable struct (one-time allocation) consistent with the working of thePostingsIteratorin the storage-layer.BytesReadstatistics computation when executing searches in no-scoring mode, removing redundant work as a micro-optimization.