-
Notifications
You must be signed in to change notification settings - Fork 14
MB-69881: [v17] Re-architect vector search #356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable-v17
Are you sure you want to change the base?
Conversation
443efd2 to
fc56fde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR re-architects the vector search implementation in the zap segment format by transitioning from hash-based vector ID assignment to monotonically increasing sequential IDs (0 to N-1). The refactoring enables significant performance improvements through faster FAISS APIs, reduced memory overhead, and optimized bitmap-based filtering.
Key changes:
- Replaces hash-based vector IDs with sequential IDs (0...N-1) per segment, enabling use of FAISS
Addinstead ofAddWithIDsand array-based direct maps - Refactors vector cache to use slice-based bidirectional mappings (vectorID ↔ documentID), reducing memory from ~30N to ~8N bytes per N vectors
- Introduces custom bitmap abstraction for inclusion/exclusion filtering, replacing
IDSelectorBatchwithIDSelectorBitmapfor zero-copy CGO sharing
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| zap.md | Updates file format documentation to reflect sequential vector IDs and simplified mapping structure |
| segment.go | Adds VectorAddr helper function to compute vector index file offset for a specified field |
| section_faiss_vector_index.go | Refactors vector indexing to use sequential IDs, removes hash-based assignment, updates merge logic and index creation |
| faiss_vector_wrapper.go | Introduces bitmap and idMapping abstractions, refactors search methods to use bitmap selectors instead of ID slices |
| faiss_vector_test.go | Updates test calls to match simplified InterpretVectorIndex API signature |
| faiss_vector_posting.go | Simplifies InterpretVectorIndex function signature by removing requiresFiltering parameter |
| faiss_vector_cache.go | Refactors cache to use idMapping instead of hash maps, updates exclusion bitmap generation |
| cmd/zap/cmd/vector.go | Simplifies vector command implementation to use new VectorAddr helper and sequential ID format |
| build.go | Changes version constant and adds type annotation to fieldNotUninverted constant |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
(0 … N-1)per segment, replacing the previous hash-based and non-stable ID assignment. This enables use of the faster Add FAISS API instead ofAddWithIDs, allowsDirectMap::Arrayinstead ofDirectMap::Map, and removes hash-table overhead from both indexing and lookup paths.~30Nbytes to~8Nbytes forNvectors per segment.IDSelectorBatch(hash-map + Bloom filter) withIDSelectorBitmapto express vector inclusion/exclusion directly. The bitmap is constructed once on the Go side and shared across the CGO boundary, allowing FAISS to evaluate eligibility via direct bit access with zero intermediate materialization.bitsetabstraction to build and manageinclusion/exclusionsets, providing a stable, contiguous representation that can be wrapped around directly byFaiss::IDSelectorBitmap, allowing FAISS to directly access a GO allocated[]byte.SearchWithFilterto operate entirely on bitsets, with a single construction of the eligible-vector bitmap that is reused across all FAISS calls involved in a query, eliminating repeated selector reconstruction.zap.mdand command line tooling to account for the new file format.