Skip to content

feat: add describe_indices function#5221

Merged
westonpace merged 8 commits intolance-format:mainfrom
westonpace:feat/no-io-describe-indexes
Nov 18, 2025
Merged

feat: add describe_indices function#5221
westonpace merged 8 commits intolance-format:mainfrom
westonpace:feat/no-io-describe-indexes

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Nov 12, 2025

The list_indices function relies on APIs from the index objects themselves. This means we need to load the indices to populate the information. In addition, the python function uses the index statistics which can be slow.

Rather than modify the existing method (which may introduce a breaking change) this creates a new method describe_indices. This method only uses information available in the dataset manifest. This ensures that minimal I/O will be required (loading the manifest if it hasn't been loaded) and the call shouldn't be slow.

@github-actions github-actions Bot added enhancement New feature or request python labels Nov 12, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread rust/lance/src/index.rs
Comment on lines +373 to +402
struct IndexDescriptionImpl {
metadata: IndexMetadata,
details: IndexDetails,
rows_indexed: u64,
}

impl IndexDescriptionImpl {
fn try_new(metadata: IndexMetadata, dataset: &Dataset) -> Result<Self> {
// This should not fail as we should have already filtered out indexes without index details.
let index_details = metadata.index_details.as_ref().ok_or(Error::Index {
message:
"Index details are required for index description. This index must be retrained to support this method."
.to_string(),
location: location!(),
})?;
let fragment_bitmap = metadata
.fragment_bitmap
.as_ref()
.ok_or_else(|| Error::Index {
message: "Fragment bitmap is required for index description. This index must be retrained to support this method.".to_string(),
location: location!(),
})?;
let details = IndexDetails(index_details.clone());
let mut rows_indexed = 0;

for fragment in dataset.get_fragments() {
if fragment_bitmap.contains(fragment.id() as u32) {
rows_indexed += fragment.fast_physical_rows()? as u64;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle missing physical row counts in describe_indexes

The new index description path sums fragment.fast_physical_rows() for every fragment referenced by an index. fast_physical_rows deliberately errors unless both the dataset manifest and fragment metadata contain a stored physical row count (see its implementation in dataset/fragment.rs). For datasets or indexes written by older Lance versions this metadata is absent, so calling describe_indexes() will immediately return an error even though the index itself is still usable. This makes the API unusable for existing data instead of gracefully degrading. Consider falling back to physical_rows().await, skipping the count, or returning None when the metadata is missing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional. I consider this to be a new feature and want to keep the API as synchronous as possible to avoid accidentally introducing slow I/O.

Comment thread python/python/lance/lance/__init__.pyi Outdated
def index_statistics(self, index_name: str) -> str: ...
def serialized_manifest(self) -> bytes: ...
def load_indices(self) -> List[Index]: ...
def describe_indexes(self) -> List[IndexDescription]: ...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: is there a reason we're going with "indexes" vs "indices"?

We probably should choose a plural form and be consistent everywhere. My preference would be "indices"

Copy link
Copy Markdown
Member Author

@westonpace westonpace Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should pick one and be consistent. I was in team "indices" for a while but was persuaded a while back that "indices" is math/stats and "indexes" is more db historical.

I don't have strong feelings here

I'll make an issue / discussion for this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back to describe_indices as I think we use indices in too many places to try and change this.

Comment thread rust/lance/src/index.rs Outdated
Comment on lines +539 to +542
async fn describe_indexes<'a, 'b>(
&'a self,
criteria: Option<IndexCriteria<'b>>,
) -> Result<Vec<Arc<dyn IndexDescription>>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wanted to do with a new list indexes API is aggregate the index deltas, so we return just one entry per index name. What would you think of doing that here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is a good idea. I'll do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would change the IndexDescription to something like:

struct IndexDescriptionImpl {
    name: String,
    fragments: Vec<IndexMetadata>,
    details: IndexDetails,
    rows_indexed: u64,
}

Copy link
Copy Markdown
Member Author

@westonpace westonpace Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now we have:

struct IndexDescriptionImpl {
    name: String,
    field_ids: Vec<u32>,
    segments: Vec<IndexMetadata>,
    index_type: String,
    details: IndexDetails,
    rows_indexed: u64,
}

Copy link
Copy Markdown
Member Author

@westonpace westonpace Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the field_ids and index_type will be consistent across all shards segments.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also assuming the details will be consistent across all segments. This may not be true at some point in the future. However, when we get there, I think we will have "common details" (for the whole index) and segment details (for individual segments) so I think it's ok to still have details at the index-level.

@westonpace westonpace changed the title feat: add describe_indexes function, deprecate list_indices feat: add describe_indices function Nov 13, 2025
@westonpace westonpace force-pushed the feat/no-io-describe-indexes branch from 3940f84 to f727d5a Compare November 14, 2025 00:04
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 30.82192% with 202 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.12%. Comparing base (975c59c) to head (5e05a7f).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index.rs 0.00% 117 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs 0.00% 20 Missing ⚠️
rust/lance/src/dataset/fragment.rs 0.00% 17 Missing ⚠️
rust/lance-index/src/scalar/json.rs 6.25% 15 Missing ⚠️
rust/lance-index/src/scalar/inverted.rs 0.00% 8 Missing ⚠️
rust/lance-index/src/registry.rs 92.15% 4 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/bloomfilter.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/btree.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/label_list.rs 0.00% 3 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5221      +/-   ##
==========================================
- Coverage   82.21%   82.12%   -0.10%     
==========================================
  Files         344      345       +1     
  Lines      144901   145054     +153     
  Branches   144901   145054     +153     
==========================================
- Hits       119135   119125      -10     
- Misses      21836    21994     +158     
- Partials     3930     3935       +5     
Flag Coverage Δ
unittests 82.12% <30.82%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, it's an excellent refactor of the API.

Two main concerns I'd like to see addressed:

  • Handling of deletions in rows_indexed
  • Double-encoding of index details in JSON


class IndexSegmentDescription:
uuid: str
dataset_version: int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a better name that this? As-is, I think it's unclear which version this is supposed to be.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea since I don't even know the answer. I think it's the version the index was trained against? I'll look it up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so too.

class IndexSegmentDescription:
uuid: str
dataset_version: int
fragment_ids: list[int]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be a set? I say this because I made the fragment bitmap a set over here:

https://github.com/lancedb/lance/blob/f48bbd9cd0885b9b96b578eb967cc7eaa270b409/python/python/lance/dataset.py#L3698

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any harm in it but I also don't see the reasoning? Is it because we assume users will want to do set-like operations on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my thinking. I don't feel strongly about this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, switched to set

fields: list[int]
field_names: list[str]
segments: list[IndexSegmentDescription]
details: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are details a string? Or bytes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's JSON. I wonder if we should parse it eagerly for them or if that's a bad idea.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse it into what? Python dicts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Python dicts. Maybe a bad idea, but could be nice to do for them. Or at least specify in the docstring or something it is JSON.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and converted to python dict

Comment on lines +1399 to +1400
# This is currently Unknown because vector indices are not yet handled by plugins
assert info.index_type == "Unknown"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unfortunate. Hopefully we can fix this very soon!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the current implementation (list_indices) can actually determine the type for vector indexes so this is a bit of a regression but I think it involves opening the index and I'd like to be able to do it from the details / manifest only.

Comment thread rust/lance-index/src/scalar/json.rs Outdated
Comment on lines +844 to +847
Ok(serde_json::json!({
"path": json_details.path,
"target_details": target_details_json,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested details means that just one json.loads() later won't be enough, which is annoying.

Could avoid this by parsing the target_details_json into serde_json::Value before putting it in the target_details field.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion. Will do.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread rust/lance/src/index.rs Outdated

for fragment in dataset.get_fragments() {
if fragment_bitmap.contains(fragment.id() as u32) {
rows_indexed += fragment.fast_physical_rows()? as u64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we subtract the number of deleted rows in that fragment? Otherwise I worry this metric will be confusing for users.

Imagine:

  • Write 1000 rows
  • Create index
  • Delete every other row

ds.count_rows() will report 500 rows. index.rows_indexed() will report 1000 rows.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do document that this overcounts. Do we record the number of deleted rows per fragment in the manifest? I didn't want to have to load the deletion vector.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep, it is in the deletion file metadata. I can fix this up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now includes deletion count in the row count calculation

@westonpace westonpace force-pushed the feat/no-io-describe-indexes branch from c70b201 to 9bdbe65 Compare November 17, 2025 13:53
@westonpace westonpace requested a review from wjones127 November 17, 2025 13:53
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! Thank you.

@westonpace westonpace merged commit 1024091 into lance-format:main Nov 18, 2025
27 of 29 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
The list_indices function relies on APIs from the index objects
themselves. This means we need to load the indices to populate the
information. In addition, the python function uses the index statistics
which can be slow.

Rather than modify the existing method (which may introduce a breaking
change) this creates a new method `describe_indices`. This method only
uses information available in the dataset manifest. This ensures that
minimal I/O will be required (loading the manifest if it hasn't been
loaded) and the call shouldn't be slow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants