feat: add describe_indices function by westonpace · Pull Request #5221 · lance-format/lance

westonpace · 2025-11-12T14:29:27Z

The list_indices function relies on APIs from the index objects themselves. This means we need to load the indices to populate the information. In addition, the python function uses the index statistics which can be slow.

Rather than modify the existing method (which may introduce a breaking change) this creates a new method describe_indices. This method only uses information available in the dataset manifest. This ensures that minimal I/O will be required (loading the manifest if it hasn't been loaded) and the call shouldn't be slow.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-12T14:33:45Z

+struct IndexDescriptionImpl {
+    metadata: IndexMetadata,
+    details: IndexDetails,
+    rows_indexed: u64,
+}
+
+impl IndexDescriptionImpl {
+    fn try_new(metadata: IndexMetadata, dataset: &Dataset) -> Result<Self> {
+        // This should not fail as we should have already filtered out indexes without index details.
+        let index_details = metadata.index_details.as_ref().ok_or(Error::Index {
+            message:
+                "Index details are required for index description.  This index must be retrained to support this method."
+                    .to_string(),
+            location: location!(),
+        })?;
+        let fragment_bitmap = metadata
+            .fragment_bitmap
+            .as_ref()
+            .ok_or_else(|| Error::Index {
+                message: "Fragment bitmap is required for index description.  This index must be retrained to support this method.".to_string(),
+                location: location!(),
+            })?;
+        let details = IndexDetails(index_details.clone());
+        let mut rows_indexed = 0;
+
+        for fragment in dataset.get_fragments() {
+            if fragment_bitmap.contains(fragment.id() as u32) {
+                rows_indexed += fragment.fast_physical_rows()? as u64;
+            }
+        }


Handle missing physical row counts in describe_indexes

The new index description path sums fragment.fast_physical_rows() for every fragment referenced by an index. fast_physical_rows deliberately errors unless both the dataset manifest and fragment metadata contain a stored physical row count (see its implementation in dataset/fragment.rs). For datasets or indexes written by older Lance versions this metadata is absent, so calling describe_indexes() will immediately return an error even though the index itself is still usable. This makes the API unusable for existing data instead of gracefully degrading. Consider falling back to physical_rows().await, skipping the count, or returning None when the metadata is missing.

Useful? React with 👍 / 👎.

This was intentional. I consider this to be a new feature and want to keep the API as synchronous as possible to avoid accidentally introducing slow I/O.

rpgreen · 2025-11-12T14:34:04Z

    def index_statistics(self, index_name: str) -> str: ...
    def serialized_manifest(self) -> bytes: ...
    def load_indices(self) -> List[Index]: ...
+    def describe_indexes(self) -> List[IndexDescription]: ...


Naming nit: is there a reason we're going with "indexes" vs "indices"?

We probably should choose a plural form and be consistent everywhere. My preference would be "indices"

Agree we should pick one and be consistent. I was in team "indices" for a while but was persuaded a while back that "indices" is math/stats and "indexes" is more db historical.

I don't have strong feelings here

I'll make an issue / discussion for this.

I've gone back to describe_indices as I think we use indices in too many places to try and change this.

wjones127 · 2025-11-12T17:22:23Z

+    async fn describe_indexes<'a, 'b>(
+        &'a self,
+        criteria: Option<IndexCriteria<'b>>,
+    ) -> Result<Vec<Arc<dyn IndexDescription>>> {


One thing I wanted to do with a new list indexes API is aggregate the index deltas, so we return just one entry per index name. What would you think of doing that here?

Yes, I think that is a good idea. I'll do that.

This would change the IndexDescription to something like:

struct IndexDescriptionImpl { name: String, fragments: Vec<IndexMetadata>, details: IndexDetails, rows_indexed: u64, }

Updated. Now we have:

struct IndexDescriptionImpl { name: String, field_ids: Vec<u32>, segments: Vec<IndexMetadata>, index_type: String, details: IndexDetails, rows_indexed: u64, }

I'm assuming that the field_ids and index_type will be consistent across all ~~shards~~ segments.

I'm also assuming the details will be consistent across all segments. This may not be true at some point in the future. However, when we get there, I think we will have "common details" (for the whole index) and segment details (for individual segments) so I think it's ok to still have details at the index-level.

codecov-commenter · 2025-11-14T00:55:36Z

Codecov Report

❌ Patch coverage is 30.82192% with 202 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.12%. Comparing base (975c59c) to head (5e05a7f).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/index.rs	0.00%	117 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs	0.00%	20 Missing ⚠️
rust/lance/src/dataset/fragment.rs	0.00%	17 Missing ⚠️
rust/lance-index/src/scalar/json.rs	6.25%	15 Missing ⚠️
rust/lance-index/src/scalar/inverted.rs	0.00%	8 Missing ⚠️
rust/lance-index/src/registry.rs	92.15%	4 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs	0.00%	3 Missing ⚠️
rust/lance-index/src/scalar/bloomfilter.rs	0.00%	3 Missing ⚠️
rust/lance-index/src/scalar/btree.rs	0.00%	3 Missing ⚠️
rust/lance-index/src/scalar/label_list.rs	0.00%	3 Missing ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5221      +/-   ##
==========================================
- Coverage   82.21%   82.12%   -0.10%     
==========================================
  Files         344      345       +1     
  Lines      144901   145054     +153     
  Branches   144901   145054     +153     
==========================================
- Hits       119135   119125      -10     
- Misses      21836    21994     +158     
- Partials     3930     3935       +5

Flag	Coverage Δ
unittests	`82.12% <30.82%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wjones127

Thanks for working on this, it's an excellent refactor of the API.

Two main concerns I'd like to see addressed:

Handling of deletions in rows_indexed
Double-encoding of index details in JSON

wjones127 · 2025-11-14T17:15:54Z

+
+class IndexSegmentDescription:
+    uuid: str
+    dataset_version: int


Do we want a better name that this? As-is, I think it's unclear which version this is supposed to be.

Probably a good idea since I don't even know the answer. I think it's the version the index was trained against? I'll look it up.

Yeah I think so too.

wjones127 · 2025-11-14T17:17:13Z

+class IndexSegmentDescription:
+    uuid: str
+    dataset_version: int
+    fragment_ids: list[int]


nit: should this be a set? I say this because I made the fragment bitmap a set over here:

https://github.com/lancedb/lance/blob/f48bbd9cd0885b9b96b578eb967cc7eaa270b409/python/python/lance/dataset.py#L3698

I don't see any harm in it but I also don't see the reasoning? Is it because we assume users will want to do set-like operations on this?

Yeah, that's my thinking. I don't feel strongly about this.

Ok, switched to set

wjones127 · 2025-11-14T17:17:53Z

+    fields: list[int]
+    field_names: list[str]
+    segments: list[IndexSegmentDescription]
+    details: str


Are details a string? Or bytes?

Oh it's JSON. I wonder if we should parse it eagerly for them or if that's a bad idea.

Parse it into what? Python dicts?

Yeah Python dicts. Maybe a bad idea, but could be nice to do for them. Or at least specify in the docstring or something it is JSON.

I went ahead and converted to python dict

wjones127 · 2025-11-14T17:26:17Z

+    # This is currently Unknown because vector indices are not yet handled by plugins
+    assert info.index_type == "Unknown"


This seems unfortunate. Hopefully we can fix this very soon!

Yeah, the current implementation (list_indices) can actually determine the type for vector indexes so this is a bit of a regression but I think it involves opening the index and I'd like to be able to do it from the details / manifest only.

wjones127 · 2025-11-14T17:38:56Z

+        Ok(serde_json::json!({
+            "path": json_details.path,
+            "target_details": target_details_json,
+        })


The nested details means that just one json.loads() later won't be enough, which is annoying.

Could avoid this by parsing the target_details_json into serde_json::Value before putting it in the target_details field.

That's a good suggestion. Will do.

wjones127 · 2025-11-14T18:07:12Z

+
+            for fragment in dataset.get_fragments() {
+                if fragment_bitmap.contains(fragment.id() as u32) {
+                    rows_indexed += fragment.fast_physical_rows()? as u64;


Should we subtract the number of deleted rows in that fragment? Otherwise I worry this metric will be confusing for users.

Imagine:

Write 1000 rows

Create index

Delete every other row

ds.count_rows() will report 500 rows. index.rows_indexed() will report 1000 rows.

We do document that this overcounts. Do we record the number of deleted rows per fragment in the manifest? I didn't want to have to load the deletion vector.

Ah, yep, it is in the deletion file metadata. I can fix this up.

Now includes deletion count in the row count calculation

wjones127

Excellent work! Thank you.

The list_indices function relies on APIs from the index objects themselves. This means we need to load the indices to populate the information. In addition, the python function uses the index statistics which can be slow. Rather than modify the existing method (which may introduce a breaking change) this creates a new method `describe_indices`. This method only uses information available in the dataset manifest. This ensures that minimal I/O will be required (loading the manifest if it hasn't been loaded) and the call shouldn't be slow.

github-actions Bot added enhancement New feature or request python labels Nov 12, 2025

chatgpt-codex-connector Bot reviewed Nov 12, 2025

View reviewed changes

rpgreen reviewed Nov 12, 2025

View reviewed changes

westonpace mentioned this pull request Nov 12, 2025

Refactor: we should move DatasetIndexExt into the lance crate #5222

Open

wjones127 reviewed Nov 12, 2025

View reviewed changes

westonpace changed the title ~~feat: add describe_indexes function, deprecate list_indices~~ feat: add describe_indices function Nov 13, 2025

westonpace mentioned this pull request Nov 14, 2025

Deprecate list_indices #5237

Closed

westonpace force-pushed the feat/no-io-describe-indexes branch from 3940f84 to f727d5a Compare November 14, 2025 00:04

westonpace requested review from rpgreen and wjones127 November 14, 2025 04:54

wjones127 requested changes Nov 14, 2025

View reviewed changes

westonpace added 7 commits November 17, 2025 05:53

Add describe_indexes function

973a908

Address PR feedback

a419983

Address clippy suggestions

1f691f2

Rename describe_indexes to describe_indices

a533cc0

shard->segment

5289ebf

Add license header

e942829

Address PR review comments

9bdbe65

westonpace force-pushed the feat/no-io-describe-indexes branch from c70b201 to 9bdbe65 Compare November 17, 2025 13:53

westonpace requested a review from wjones127 November 17, 2025 13:53

wjones127 approved these changes Nov 17, 2025

View reviewed changes

wjones127 mentioned this pull request Nov 17, 2025

refactor: write bitmap index statistics in file instead #5251

Merged

Fix unit test after api change

5e05a7f

westonpace merged commit 1024091 into lance-format:main Nov 18, 2025
27 of 29 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

		# This is currently Unknown because vector indices are not yet handled by plugins
		assert info.index_type == "Unknown"

Conversation

westonpace commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Nov 12, 2025 •

edited

Loading

westonpace Nov 12, 2025 •

edited

Loading

westonpace Nov 14, 2025 •

edited

Loading

westonpace Nov 14, 2025 •

edited

Loading

codecov-commenter commented Nov 14, 2025 •

edited

Loading