MB-69881: [v17] Re-architect vector search #356

CascadingRadium · 2025-12-22T13:49:44Z

Enforce monotonically increasing vector IDs (0 … N-1) per segment, replacing the previous hash-based and non-stable ID assignment. This enables use of the faster Add FAISS API instead of AddWithIDs, allows DirectMap::Array instead of DirectMap::Map, and removes hash-table overhead from both indexing and lookup paths.
Refactor the vector cache to use slice-based forward and reverse mappings (vectorID <=> documentID), enabling O(1) indexed access and linear iteration, eliminating key hashing costs, and reducing memory usage from ~30N bytes to ~8N bytes for N vectors per segment.
Replace IDSelectorBatch (hash-map + Bloom filter) with IDSelectorBitmap to express vector inclusion/exclusion directly. The bitmap is constructed once on the Go side and shared across the CGO boundary, allowing FAISS to evaluate eligibility via direct bit access with zero intermediate materialization.
Introduce a custom bitset abstraction to build and manage inclusion/exclusion sets, providing a stable, contiguous representation that can be wrapped around directly by Faiss::IDSelectorBitmap, allowing FAISS to directly access a GO allocated []byte.
Overhaul SearchWithFilter to operate entirely on bitsets, with a single construction of the eligible-vector bitmap that is reused across all FAISS calls involved in a query, eliminating repeated selector reconstruction.
Optimize multi-vector document retrieval by reusing the same inclusion/exclusion bitsets across repeated IVF searches during docSearch, significantly reducing per-iteration overhead in the FAISS layer and improving filtered retrieval throughput.
Update zap.md and command line tooling to account for the new file format.

Copilot

Pull request overview

This PR re-architects the vector search implementation in the zap segment format by transitioning from hash-based vector ID assignment to monotonically increasing sequential IDs (0 to N-1). The refactoring enables significant performance improvements through faster FAISS APIs, reduced memory overhead, and optimized bitmap-based filtering.

Key changes:

Replaces hash-based vector IDs with sequential IDs (0...N-1) per segment, enabling use of FAISS Add instead of AddWithIDs and array-based direct maps
Refactors vector cache to use slice-based bidirectional mappings (vectorID ↔ documentID), reducing memory from ~30N to ~8N bytes per N vectors
Introduces custom bitmap abstraction for inclusion/exclusion filtering, replacing IDSelectorBatch with IDSelectorBitmap for zero-copy CGO sharing

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
zap.md	Updates file format documentation to reflect sequential vector IDs and simplified mapping structure
segment.go	Adds `VectorAddr` helper function to compute vector index file offset for a specified field
section_faiss_vector_index.go	Refactors vector indexing to use sequential IDs, removes hash-based assignment, updates merge logic and index creation
faiss_vector_wrapper.go	Introduces bitmap and idMapping abstractions, refactors search methods to use bitmap selectors instead of ID slices
faiss_vector_test.go	Updates test calls to match simplified `InterpretVectorIndex` API signature
faiss_vector_posting.go	Simplifies `InterpretVectorIndex` function signature by removing `requiresFiltering` parameter
faiss_vector_cache.go	Refactors cache to use idMapping instead of hash maps, updates exclusion bitmap generation
cmd/zap/cmd/vector.go	Simplifies vector command implementation to use new `VectorAddr` helper and sequential ID format
build.go	Changes version constant and adds type annotation to `fieldNotUninverted` constant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

faiss_vector_wrapper.go

zap.md

section_faiss_vector_index.go

faiss_vector_wrapper.go

build.go

abhinavdangeti

First pass.

build.go

faiss_vector_cache.go

abhinavdangeti

This is a massive refactor, and being completely honest here - I think we should either have you URGENTLY do a proper code walk through for everyone to quickly follow the changes here OR just ditch this effort for now and simplify it down to only JUST the code that helps with the pre-filtering improvement.

Time spent reviewing this is time spent away from reviewing the nested search code which we need to get in by end of next week - which is looking impossible at the moment.

section_faiss_vector_index.go

abhinavdangeti · 2026-01-06T23:40:27Z

Ok @CascadingRadium - let's do a quick code walk through this Thursday.

CascadingRadium · 2026-01-07T08:22:36Z

This is a massive refactor, and being completely honest here - I think we should either have you URGENTLY do a proper code walk through for everyone to quickly follow the changes here OR just ditch this effort for now and simplify it down to only JUST the code that helps with the pre-filtering improvement. Time spent reviewing this is time spent away from reviewing the nested search code which we need to get in by end of next week - which is looking impossible at the moment.

Hi @abhinavdangeti the code is already written, doing a simplification will basically be another rewrite, which will take more effort to execute.

Ok @CascadingRadium - let's do a quick code walk through this Thursday.

Sounds good 👍🏻

Thejas-bhat · 2026-01-07T21:37:58Z

section_faiss_vector_index.go

 					// ones won't be reconstructed in the final index.
-					vecToDocID[vecID] = newDocID
-					indexes[curIdx].vecIds = append(indexes[curIdx].vecIds, vecID)
+					newIndexInfo.vecIds = append(newIndexInfo.vecIds, int64(vecID))


in case of updates/deletes, the vecIds array wouldn't be a continuous set of 0-N right? so while flushing out the vector index and writing out the vecID -> docID mapping (where the assumption is that the ids are 0-N) wouldn't there be data loss type situations since the vectors are not really associated with the right ids? as an example of 5 vecs and one of them deleted, over here the new vecIDs would be (0,1,3,4) and while writing out the vec index and writing the docid mapping it would be doing operations with (0,1,2,3) in consideration right?

okay i don't think this should happen since the mapping of docID and vecID (new) is preserved and same in the vector index, but i think you should still test out the update/delete path thoroughly + force merges for more confidence

The vecID -> docID mapping is being changed - so this should work fine I think ?

Base automatically changed from refactor to unstable-v17 December 22, 2025 16:26

rebase

fc56fde

CascadingRadium force-pushed the perf branch from 443efd2 to fc56fde Compare December 22, 2025 17:06

CascadingRadium added 18 commits December 23, 2025 00:41

cleanup

996d5c7

close early

d190261

formatting

9d23851

cleanup

00ea556

format

be4633d

comments

0d0569e

fix comment

ab27c8d

cmd line tooling + zap.md

910818c

fix md

133cbaa

reposition

093f89e

wip cleanup

c4fb0c9

final fix

afd47ba

fix minor bugs

99b4ac6

add nil checks

86508a5

fix

9974a40

nil check

d3089b3

sanity checks

b463af2

reuse the empty postings list for performance

131a898

CascadingRadium changed the title ~~Improve performance of Vector Search~~ MB-69881: Re-architect vector search Dec 28, 2025

CascadingRadium changed the title ~~MB-69881: Re-architect vector search~~ [v17] MB-69881: Re-architect vector search Dec 28, 2025

CascadingRadium changed the title ~~[v17] MB-69881: Re-architect vector search~~ MB-69881: [v17] Re-architect vector search Dec 28, 2025

CascadingRadium added this to Vector Search v2 Dec 28, 2025

github-project-automation bot moved this to Todo in Vector Search v2 Dec 28, 2025

CascadingRadium moved this from Todo to Done in Vector Search v2 Dec 28, 2025

CascadingRadium requested review from Likith101, Thejas-bhat and abhinavdangeti December 28, 2025 06:36

CascadingRadium requested review from capemox, Copilot and maneuvertomars December 28, 2025 06:36

Copilot started reviewing on behalf of CascadingRadium December 28, 2025 06:36 View session

Copilot AI reviewed Dec 28, 2025

View reviewed changes

CascadingRadium added 3 commits December 28, 2025 13:52

exclude opt

5230325

bug fix

5b1d913

nits

f6333af

abhinavdangeti requested changes Jan 6, 2026

View reviewed changes

build.go Show resolved Hide resolved

build.go Show resolved Hide resolved

faiss_vector_cache.go Outdated Show resolved Hide resolved

faiss_vector_cache.go Show resolved Hide resolved

faiss_vector_cache.go Show resolved Hide resolved

faiss_vector_cache.go Show resolved Hide resolved

abhinavdangeti reviewed Jan 6, 2026

View reviewed changes

section_faiss_vector_index.go Show resolved Hide resolved

code review

904aa80

CascadingRadium requested a review from abhinavdangeti January 7, 2026 16:58

Thejas-bhat reviewed Jan 7, 2026

View reviewed changes

MB-69881: [v17] Re-architect vector search #356

Are you sure you want to change the base?

MB-69881: [v17] Re-architect vector search #356

Conversation

CascadingRadium commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavdangeti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abhinavdangeti commented Jan 6, 2026

Uh oh!

CascadingRadium commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thejas-bhat Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Thejas-bhat Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdangeti Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CascadingRadium commented Dec 22, 2025 •

edited

Loading

CascadingRadium commented Jan 7, 2026 •

edited

Loading

Thejas-bhat Jan 7, 2026 •

edited

Loading

Thejas-bhat Jan 7, 2026 •

edited

Loading