Skip to content

feat: implement vector index details#6099

Open
wjones127 wants to merge 28 commits intolance-format:mainfrom
wjones127:feat/vector-index-details
Open

feat: implement vector index details#6099
wjones127 wants to merge 28 commits intolance-format:mainfrom
wjones127:feat/vector-index-details

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Mar 4, 2026

Cache vector index configuration within the index metadata, such as the distance type and build parameters.

Previously, to determine things like the distance type or index type of a vector index, the index file itself had to be opened. This PR stores that information in VectorIndexDetails within the manifest's index_details field, which is fetched and cached eagerly when loading the manifest.

Old indexes have this field left blank. When blank, the details are extracted from the index files and cached. This migration happens on the first write with a new library version.

What's stored in VectorIndexDetails

Core build parameters (typed fields — required for any runtime to build the index):

  • metric_type
  • target_partition_size (IVF)
  • hnsw_index_configmax_connections, construction_ef, max_level (HNSW)
  • compression — PQ/SQ/RQ/flat, including num_bits, num_sub_vectors, rotation_type
  • index_version

Runtime hints (map<string, string> runtime_hints):
Optional build preferences that don't affect index structure. Stored so a background rebuild process can reproduce the original configuration. Runtimes that don't recognize a key must silently ignore it. Only non-default values are written.

Keys use reverse-DNS namespacing: lance.* for core Lance hints, other prefixes for runtime-specific hints (e.g., lancedb.accelerator for GPU acceleration in LanceDB Enterprise).

Current lance.* hints: lance.ivf.max_iters, lance.ivf.sample_rate, lance.ivf.shuffle_partition_batches, lance.ivf.shuffle_partition_concurrency, lance.pq.max_iters, lance.pq.sample_rate, lance.pq.kmeans_redos, lance.sq.sample_rate, lance.hnsw.prefetch_distance, lance.skip_transpose.

Also adds apply_runtime_hints() to read hints back into build params for future rebuild logic.

Closes #5963

@github-actions github-actions Bot added the enhancement New feature or request label Mar 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

PR Review: feat: implement vector index details

P0: Bits enum doesn't cover RaBitQ's actual bit-width

RaBitQ in this codebase uses 1 bit per dimension (see rust/lance-index/src/vector/bq/builder.rs:40 where num_bits: 1). The Bits enum only has BIT_8 = 0 and BIT_4 = 1 — there's no BIT_1 variant. This means RabitQuantization.num_bits cannot represent the actual quantization used by RaBitQ today.

Either:

  • Add a BIT_1 = 2 variant, or
  • Remove num_bits from RabitQuantization since RaBitQ is always 1-bit in practice (the field in the Rust struct exists but is always set to 1).

P1: No corresponding Rust code to populate the new fields

All existing call sites create VectorIndexDetails::default() (e.g., rust/lance/src/index.rs:361, rust/lance/src/dataset/index.rs:93). This PR only changes the proto — is there a follow-up PR planned to actually populate these fields during index creation? Without that, describe_indices() still won't have the metadata. Consider noting this in the PR description or linking a follow-up issue.

P1: target_partition_size may be better as uint32 or have docs clarifying semantics

target_partition_size is uint64 but IVF partition counts are typically small numbers (sqrt of dataset size). More importantly, is this the target number of partitions (i.e. num_partitions) or the target size per partition? The issue mentions num_partitions — if this represents partition count, the field name is misleading. Consider renaming to num_partitions to match the terminology used elsewhere in the codebase and in the issue description.

Minor

  • The Bits enum name is generic — if other proto messages in this file or future messages need a Bits enum with different semantics, there could be a naming collision. Since it's nested inside VectorIndexDetails this is scoped, but worth noting.
  • The comment on VectorIndexDetails still says "Empty details messages for older indexes" — should be updated to reflect the message is no longer empty.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 87.71186% with 116 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/details.rs 87.59% 67 Missing and 30 partials ⚠️
rust/lance/src/dataset/scanner.rs 25.00% 9 Missing ⚠️
rust/lance/src/index.rs 93.67% 4 Missing and 1 partial ⚠️
rust/lance/src/index/vector/ivf.rs 77.27% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Previously, vector indices returned index_type "Unknown" and empty details
in describe_indices(). This populates VectorIndexDetails at creation time
from build params, derives a human-readable index type string (e.g.
"IVF_PQ"), serializes details as JSON, and infers details from index files
on disk as a fallback for legacy indices.

Also changes proto num_bits from Bits enum to uint32 to support RQ's
default of 1 bit, and adds rotation_type to RabitQuantization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the python label Mar 4, 2026
wjones127 and others added 8 commits March 4, 2026 15:30
Replace imperative serde_json::Map construction with #[derive(Serialize)]
structs for clearer, more maintainable JSON serialization. This also adds
the missing rotation_type field to RQ compression output.

Add snapshot-style unit tests that assert exact JSON strings to guard
backwards compatibility of the describe_indices() output format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, vector index details for legacy indices were only inferred
lazily in describe_indices(). This moves inference to load_indices() and
migrate_indices(), so details are populated before caching and persisted
into new manifest versions. Inference runs once per index name,
concurrently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also handles the case where index_details is None (very old indices)
by checking if the indexed field is a vector type. Moves inference
outside the cache-miss branch in load_indices so it also runs on
indices that were opportunistically cached during Dataset::open.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix compression oneof field numbers (5,6,7 -> 4,5,6) to avoid gap
- Add comment that target_partition_size = 0 means unset
- Extract infer_missing_vector_details helper to deduplicate logic
  between load_indices and migrate_indices

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…etails

# Conflicts:
#	rust/lance/src/index/append.rs
#	rust/lance/src/index/create.rs
#	rust/lance/src/index/vector.rs
Comment thread protos/table.proto Outdated
Comment on lines +463 to +512
// Details for vector indexes.
message VectorIndexDetails {
enum VectorMetricType {
L2 = 0;
COSINE = 1;
DOT = 2;
HAMMING = 3;
}

VectorMetricType metric_type = 1;

// 0 means unset (unknown or not applicable).
uint64 target_partition_size = 2;

optional HnswIndexDetails hnsw_index_config = 3;

message ProductQuantization {
uint32 num_bits = 1;
uint32 num_sub_vectors = 2;
}
message ScalarQuantization {
uint32 num_bits = 1;
}
message RabitQuantization {
enum RotationType {
FAST = 0;
MATRIX = 1;
}
uint32 num_bits = 1;
RotationType rotation_type = 2;
}

// An unset compression oneof means flat / no quantization.
oneof compression {
ProductQuantization pq = 4;
ScalarQuantization sq = 5;
RabitQuantization rq = 6;
}
}

// Hierarchical Navigable Small World (HNSW) index details, used as an optional configuration for IVF indexes.
message HnswIndexDetails {
// The maximum number of outgoing edges per node in the HNSW graph. Higher values
// means more connections, better recall, but more memory and slower builds.
// Referred to as "M" in the HNSW literature.
uint32 max_connections = 1;
// "construction exploration factor": The size of the dynamic list used during
// index construction.
uint32 construction_ef = 2;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This representation works for the existing set of vector indices, but wondering if this is good for future plans. The current internal design has the concept of "stages", so you could have something like ivf-ivf-pq or hnsw-pq (no IVF). Some sequence of stages just doesn't make sense, like pq-pq. So I'm not sure the stages representation makes sense.

I was thinking it could be a tree-like system, where IVF and HNSW could have children, but PQ and SQ can't.

What do you think? @BubbleCal @eddyxu

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always seen it as three layers. The first is partitioning (IVF is really the only choice here). The second is searching within partition (flat vs hnsw) and the third is quantization (pq, rq, sq, etc.) So I don't think the concept of stages makes sense. For example, I don't see ivf-ivf-pq. I see ivf (layers=2) - pq.

…x.proto

These messages belong with other index-related protos. After the move,
VectorIndexDetails reuses the existing top-level VectorMetricType enum
instead of defining its own nested copy. Rust imports updated from
lance_table::format::pb to lance_index::pb throughout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rd, type_url

- Fix inverted comment on hnsw_index_config field in index.proto
- Use tracing::warn! instead of log::warn! in details.rs
- Prefer non-empty index_details when carrying forward in append
- Revert describe_indices to original chunk_by pattern
- Update Python test type_url to match new proto package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread protos/table.proto Outdated
@wjones127 wjones127 marked this pull request as ready for review March 5, 2026 22:21
@wjones127 wjones127 requested a review from BubbleCal March 5, 2026 22:21
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure we can get away with changing the protobuf type URL. Can you create a compatibility test to ensure that old versions can read new indexes created with these new details?

Comment thread protos/index.proto
}

// Details for vector indexes, stored in the manifest's index_details field.
message VectorIndexDetails {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the IVF details? was it hierarchical? How many partitions in each stage?.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using VectorIndexDetails here for the IVF stage. I suppose I could make it a substruct to make it clearer. The only parameter right now is target_partition_size.

How many partitions in each stage?

What do you mean by this?

Comment thread protos/index.proto Outdated
}

// Hierarchical Navigable Small World (HNSW) index details, used as an optional configuration for IVF indexes.
message HnswIndexDetails {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I think of "index details" as the top-level message describing a type of index. This is a nested message that cannot stand on its own so maybe just HnswParameters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair.

assert info.type_url == "/lance.table.VectorIndexDetails"
# This is currently Unknown because vector indices are not yet handled by plugins
assert info.index_type == "Unknown"
assert info.type_url == "/lance.index.pb.VectorIndexDetails"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrettably, I do not think we can change the type URL for backwards compatibility reasons.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move that back if it's necessary.

Comment on lines +215 to +216
// Carry forward existing index details, preferring the first segment
// that has populated (non-empty) details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a safe assumption but we are assuming all segments have the same details right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be. Ideally, we would create some top-level index configuration that is deduplicated across segments, but that's a complex format change for another day.

Comment thread rust/lance/src/index.rs Outdated
Comment on lines +414 to +417
use vector::details::{
derive_vector_index_type, infer_missing_vector_details, vector_details_as_json,
};
pub(crate) use vector::details::{vector_index_details, vector_index_details_default};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be at the top of the file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I should move those.

@wjones127
Copy link
Copy Markdown
Contributor Author

I'm not 100% sure we can get away with changing the protobuf type URL. Can you create a compatibility test to ensure that old versions can read new indexes created with these new details?

We have one here that goes back to 0.29.1.beta2:

@compat_test(min_version="0.29.1.beta2")
class PqVectorIndex(UpgradeDowngradeTest):
"""Test PQ (Product Quantization) vector index compatibility."""
def __init__(self, path: Path):
self.path = path
def create(self):

I wonder if we need to check earlier than that to see incompatabilities.

@wjones127
Copy link
Copy Markdown
Contributor Author

Should be able to implement this TODO:

// TODO: Once we do https://github.com/lance-format/lance/issues/5231, we
// should be able to get the metric type directly from the index metadata,
// at least for newer indexes.
let idx = self
.dataset
.open_vector_index(
q.column.as_str(),
&index.uuid.to_string(),
&NoOpMetricsCollector,
)
.await?;
let index_metric = idx.metric_type();

@wjones127
Copy link
Copy Markdown
Contributor Author

@westonpace I had claude go through and see if this breaks backwards compat as is. Other than the type_url, it claims that the indexes don't have different compatibility from what's on main:

  Results                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                      
  The protobuf move (VectorIndexDetails from table.proto to index.proto) does not break backward compatibility. All failures seen are pre-existing and identical when reading indices written by main vs this branch.                                                                                                   
                                                                                                                                                                                                                                                                                                                        
  ┌───────────────┬────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐                                                                                                                                                  
  │ Version range │              Behavior              │                                                    Cause                                                    │                                                                                                                                                  
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                  
  │ 0.8.0         │ Can't open dataset                 │ Old manifest format                                                                                         │                                                                                                                                                  
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                  
  │ 0.9.0-0.10.0  │ Opens, panics on search            │ Very old index format                                                                                       │
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 0.11.0-0.14.0 │ Import error                       │ NumPy 1.x/2.x incompatibility                                                                               │
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 0.15.0-0.16.0 │ Works (brute force)                │ Doesn't recognize index, falls back to scan                                                                 │
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 0.17.0-0.29.0 │ Opens, sees index, fails to search │ Pre-existing index format incompatibility (missing field num_bits, 2-D tensor shape) — same failure on main │
  ├───────────────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 0.30.0+       │ Works                              │ Full compatibility                                                                                          │
  └───────────────┴────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  The earliest version that can successfully use an IVF_PQ index written by this branch is 0.30.0 — but this is the same boundary as main. The registry lookup at rust/lance-index/src/registry.rs:98 uses details.type_url.split('.').next_back() which extracts just "VectorIndexDetails" regardless of package, so
  the package name change is transparent.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. Lets go for it.

Previously, `scanner.rs:3425` called `open_vector_index` just to get
the metric type for metric compatibility checks. This required expensive
deserialization of the index file.

Now we read the metric type directly from `IndexMetadata.index_details`
(a `VectorIndexDetails` proto) added on the feat/vector-index-details
branch. This provides a fast path for newer indices without I/O.

For legacy indices without populated details (empty proto value bytes),
we fall back to the original expensive path.

Adds `metric_type_from_index_metadata` helper in `details.rs` that:
- Returns `None` for missing or empty details (legacy indices)
- Converts `VectorIndexDetails.metric_type` to `DistanceType` for populated details
- Uses the existing `From<VectorMetricType> for DistanceType` impl

Changes `matching_index` tuple from `(index, idx, index_metric)` to
`(index, index_metric)` since `idx` is only used in the fallback path.

Fixes lance-format#5231

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Comment thread protos/index.proto
}

// An unset compression oneof means flat / no quantization.
oneof compression {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding index_version for each one.
Found it's complicated for vector index to maintain compatibility when introducing breaking changes because it's hard to get the index version.

The other scalar index has index_version already today.

Comment thread protos/index.proto
Comment thread protos/index.proto
wjones127 and others added 10 commits March 17, 2026 10:45
- Rename HnswIndexDetails -> HnswParameters (westonpace)
- Move imports to top of index.rs (westonpace)
- Add index_version field to VectorIndexDetails proto (BubbleCal)
- Add explicit FlatCompression message instead of using unset oneof (westonpace)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etails

# Conflicts:
#	rust/lance/src/index/append.rs
#	rust/lance/src/index/vector/ivf.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VectorIndexDetails moved from table.proto to index.proto, so the Python
binding needs to reference lance_index::pb instead of lance_table::format::pb.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a `map<string, string> runtime_hints` field to `VectorIndexDetails`
for storing optional build preferences that don't affect index structure
(e.g., KMeans iterations, shuffle concurrency, GPU accelerator). These
are needed so a background index rebuild process can reproduce the
original build configuration.

Keys use reverse-DNS namespacing: `lance.*` for core Lance hints,
`lancedb.*` for LanceDB-specific hints (e.g., `lancedb.accelerator`).
Runtimes that don't recognize a key must silently ignore it. Only
non-default values are written to keep the map minimal.

Also adds `apply_runtime_hints()` to read hints back into build params.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uild

Adds `vector_params_from_details()` to reconstruct a full `VectorIndexParams`
from stored `VectorIndexDetails` (core spec fields + runtime hints). This
enables future index rebuild logic to reproduce the original build config
from the manifest without re-opening index files.

Also wires the `lance.skip_transpose` hint into `optimize_vector_indices_v2`
so incremental rebuilds honour the original skip_transpose preference rather
than silently reverting to false on each append.

Adds Python tests validating that non-default build params appear as
`runtime_hints` in `describe_indices()` output, and that default values
are omitted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `python/src/dataset.rs`: `max_iters` kwarg was not forwarded to
  `IvfBuildParams`/`PQBuildParams` in `prepare_vector_index_params`,
  so it was silently ignored and never stored as a runtime hint
- `python/src/indices.rs`: wrong proto path for `VectorIndexDetails`
  (`lance_table::format::pb` → `lance_index::pb`)
- `java/lance-jni/src/utils.rs`: missing `runtime_hints` field in
  `VectorIndexParams` struct literal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the java label Apr 8, 2026
@wjones127
Copy link
Copy Markdown
Contributor Author

@westonpace @BubbleCal do you want to take another look? I changed up the format a little bit to handle other index parameters.


#[derive(Serialize)]
#[serde(tag = "type", rename_all = "lowercase")]
enum CompressionDetailsJson {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed Flat

StageParams::Hnsw(hnsw) => {
if let Some(raw) = hints.get("lance.hnsw.prefetch_distance") {
hnsw.prefetch_distance = if raw == "none" {
None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the default is Some(2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, just noticed this is setting it to None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lightweight vector index metadata to VectorIndexDetails

3 participants