Skip to content

Conversation

@CascadingRadium
Copy link
Member

@CascadingRadium CascadingRadium commented Dec 28, 2025

  • Use a bitset to track eligible documents instead of a slice of N uint64s, reducing memory usage from 8N bytes to N/8 bytes per segment (up to 64× reduction) and improving cache locality.
  • Pass an iterator over eligible documents that iterates the bitset directly, allowing direct translation into a bitset of eligible vector IDs in the storage layer and eliminating the need for a separate slice intermediary.
  • Fix garbage creation in the UnadornedPostingsIterator, which previously allocated a temporary struct per Next() call to wrap a doc number and satisfy the Postings interface; the iterator now returns a single reusable struct (one-time allocation) consistent with the working of the PostingsIterator in the storage-layer.
  • Avoid unnecessary BytesRead statistics computation when executing searches in no-scoring mode, removing redundant work as a micro-optimization.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR re-architects vector search to improve memory efficiency and reduce garbage collection pressure. The changes replace slice-based eligible document tracking with bitsets, achieving up to 64× memory reduction per segment, and optimize the iterator pattern to eliminate per-call allocations in the unadorned postings iterator.

Key changes:

  • Replaced slice-based eligible document tracking ([]uint64) with bitsets, reducing memory from 8N bytes to N/8 bytes per segment
  • Introduced iterator-based API for eligible documents that directly translates to bitset iteration at the storage layer
  • Fixed garbage creation in UnadornedPostingsIterator by reusing a single struct instance instead of allocating per Next() call
  • Optimized bytes read tracking to skip computation in no-scoring mode

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
index/scorch/snapshot_vector_index.go Introduces bitset-based eligible document storage and iterator API, replacing the previous slice-based approach
index/scorch/unadorned.go Changes UnadornedPosting from uint64 to struct with pointer receivers and adds reusable struct fields to iterators to eliminate per-call allocations
index/scorch/snapshot_index_tfr.go Adds conditional bytes read tracking via updateBytesRead flag to skip computation in no-scoring mode
index/scorch/snapshot_index.go Initializes updateBytesRead flag based on scoring requirements
index/scorch/optimize_knn.go Removes requiresFiltering flag and updates to use new SegmentEligibleDocuments API
index/scorch/optimize.go Sets updateBytesRead to false for unadorned term field readers
index/scorch/snapshot_index_vr.go Updates InterpretVectorIndex call to remove filtering parameter
index_test.go Updates expected bytes read values to reflect the optimization that skips unnecessary computation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return 0, false // no eligible documents
}
next, found := it.bs.NextSet(it.current)
if next >= it.max || !found {
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boundary check next >= it.max may be redundant. The found boolean returned by NextSet should be sufficient to determine if a valid set bit was found. The additional check next >= it.max appears to be a defensive check, but if NextSet returns found=true, the returned index should always be within valid bounds (0 to Len()-1). Consider whether this additional check is necessary, or document why it's included as a defensive measure.

Suggested change
if next >= it.max || !found {
if !found {

Copilot uses AI. Check for mistakes.
if bytesRead > prevBytesRead {
i.incrementBytesRead(bytesRead - prevBytesRead)
if i.updateBytesRead {
// postingsIterators is maintain the bytesRead stat in a cumulative fashion.
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error in the comment: "postingsIterators is maintain" should be "postingsIterators maintains" or "the postingsIterators maintain".

Suggested change
// postingsIterators is maintain the bytesRead stat in a cumulative fashion.
// postingsIterators maintains the bytesRead stat in a cumulative fashion.

Copilot uses AI. Check for mistakes.
return int(edl.bs.Count())
}

// emptyEligibleDocumentList is a reusable empty eligible document list.
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emptyEligibleDocumentList is being reused globally. While this is typically safe for read-only operations, each call to Iterator() will create a new iterator that references the same underlying nil bitset, which should be safe. However, for consistency and to avoid potential future issues if this struct gains mutable state, consider documenting that this is intentionally a shared singleton for performance reasons.

Suggested change
// emptyEligibleDocumentList is a reusable empty eligible document list.
// emptyEligibleDocumentList is a reusable empty eligible document list.
// It is intentionally defined as a shared singleton for performance reasons.
// The underlying eligibleDocumentList for this variable is immutable
// (its bitset is always nil), so it is safe to reuse this instance across
// goroutines and calls to Iterator(). If eligibleDocumentList gains mutable
// state in the future, this assumption must be revisited.

Copilot uses AI. Check for mistakes.
return uint64(next), true
}

// emptyIterator is a reusable empty eligible document iterator.
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "emptyIterator" but the variable name is "emptyEligibleIterator". Update the comment to match the actual variable name for consistency.

Suggested change
// emptyIterator is a reusable empty eligible document iterator.
// emptyEligibleIterator is a reusable empty eligible document iterator.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants