feat: add describe_indices function#5221
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| struct IndexDescriptionImpl { | ||
| metadata: IndexMetadata, | ||
| details: IndexDetails, | ||
| rows_indexed: u64, | ||
| } | ||
|
|
||
| impl IndexDescriptionImpl { | ||
| fn try_new(metadata: IndexMetadata, dataset: &Dataset) -> Result<Self> { | ||
| // This should not fail as we should have already filtered out indexes without index details. | ||
| let index_details = metadata.index_details.as_ref().ok_or(Error::Index { | ||
| message: | ||
| "Index details are required for index description. This index must be retrained to support this method." | ||
| .to_string(), | ||
| location: location!(), | ||
| })?; | ||
| let fragment_bitmap = metadata | ||
| .fragment_bitmap | ||
| .as_ref() | ||
| .ok_or_else(|| Error::Index { | ||
| message: "Fragment bitmap is required for index description. This index must be retrained to support this method.".to_string(), | ||
| location: location!(), | ||
| })?; | ||
| let details = IndexDetails(index_details.clone()); | ||
| let mut rows_indexed = 0; | ||
|
|
||
| for fragment in dataset.get_fragments() { | ||
| if fragment_bitmap.contains(fragment.id() as u32) { | ||
| rows_indexed += fragment.fast_physical_rows()? as u64; | ||
| } | ||
| } |
There was a problem hiding this comment.
Handle missing physical row counts in describe_indexes
The new index description path sums fragment.fast_physical_rows() for every fragment referenced by an index. fast_physical_rows deliberately errors unless both the dataset manifest and fragment metadata contain a stored physical row count (see its implementation in dataset/fragment.rs). For datasets or indexes written by older Lance versions this metadata is absent, so calling describe_indexes() will immediately return an error even though the index itself is still usable. This makes the API unusable for existing data instead of gracefully degrading. Consider falling back to physical_rows().await, skipping the count, or returning None when the metadata is missing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This was intentional. I consider this to be a new feature and want to keep the API as synchronous as possible to avoid accidentally introducing slow I/O.
| def index_statistics(self, index_name: str) -> str: ... | ||
| def serialized_manifest(self) -> bytes: ... | ||
| def load_indices(self) -> List[Index]: ... | ||
| def describe_indexes(self) -> List[IndexDescription]: ... |
There was a problem hiding this comment.
Naming nit: is there a reason we're going with "indexes" vs "indices"?
We probably should choose a plural form and be consistent everywhere. My preference would be "indices"
There was a problem hiding this comment.
Agree we should pick one and be consistent. I was in team "indices" for a while but was persuaded a while back that "indices" is math/stats and "indexes" is more db historical.
I don't have strong feelings here
I'll make an issue / discussion for this.
There was a problem hiding this comment.
I've gone back to describe_indices as I think we use indices in too many places to try and change this.
| async fn describe_indexes<'a, 'b>( | ||
| &'a self, | ||
| criteria: Option<IndexCriteria<'b>>, | ||
| ) -> Result<Vec<Arc<dyn IndexDescription>>> { |
There was a problem hiding this comment.
One thing I wanted to do with a new list indexes API is aggregate the index deltas, so we return just one entry per index name. What would you think of doing that here?
There was a problem hiding this comment.
Yes, I think that is a good idea. I'll do that.
There was a problem hiding this comment.
This would change the IndexDescription to something like:
struct IndexDescriptionImpl {
name: String,
fragments: Vec<IndexMetadata>,
details: IndexDetails,
rows_indexed: u64,
}There was a problem hiding this comment.
Updated. Now we have:
struct IndexDescriptionImpl {
name: String,
field_ids: Vec<u32>,
segments: Vec<IndexMetadata>,
index_type: String,
details: IndexDetails,
rows_indexed: u64,
}
There was a problem hiding this comment.
I'm assuming that the field_ids and index_type will be consistent across all shards segments.
There was a problem hiding this comment.
I'm also assuming the details will be consistent across all segments. This may not be true at some point in the future. However, when we get there, I think we will have "common details" (for the whole index) and segment details (for individual segments) so I think it's ok to still have details at the index-level.
3940f84 to
f727d5a
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5221 +/- ##
==========================================
- Coverage 82.21% 82.12% -0.10%
==========================================
Files 344 345 +1
Lines 144901 145054 +153
Branches 144901 145054 +153
==========================================
- Hits 119135 119125 -10
- Misses 21836 21994 +158
- Partials 3930 3935 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
wjones127
left a comment
There was a problem hiding this comment.
Thanks for working on this, it's an excellent refactor of the API.
Two main concerns I'd like to see addressed:
- Handling of deletions in
rows_indexed - Double-encoding of index details in JSON
|
|
||
| class IndexSegmentDescription: | ||
| uuid: str | ||
| dataset_version: int |
There was a problem hiding this comment.
Do we want a better name that this? As-is, I think it's unclear which version this is supposed to be.
There was a problem hiding this comment.
Probably a good idea since I don't even know the answer. I think it's the version the index was trained against? I'll look it up.
| class IndexSegmentDescription: | ||
| uuid: str | ||
| dataset_version: int | ||
| fragment_ids: list[int] |
There was a problem hiding this comment.
nit: should this be a set? I say this because I made the fragment bitmap a set over here:
There was a problem hiding this comment.
I don't see any harm in it but I also don't see the reasoning? Is it because we assume users will want to do set-like operations on this?
There was a problem hiding this comment.
Yeah, that's my thinking. I don't feel strongly about this.
| fields: list[int] | ||
| field_names: list[str] | ||
| segments: list[IndexSegmentDescription] | ||
| details: str |
There was a problem hiding this comment.
Are details a string? Or bytes?
There was a problem hiding this comment.
Oh it's JSON. I wonder if we should parse it eagerly for them or if that's a bad idea.
There was a problem hiding this comment.
Parse it into what? Python dicts?
There was a problem hiding this comment.
Yeah Python dicts. Maybe a bad idea, but could be nice to do for them. Or at least specify in the docstring or something it is JSON.
There was a problem hiding this comment.
I went ahead and converted to python dict
| # This is currently Unknown because vector indices are not yet handled by plugins | ||
| assert info.index_type == "Unknown" |
There was a problem hiding this comment.
This seems unfortunate. Hopefully we can fix this very soon!
There was a problem hiding this comment.
Yeah, the current implementation (list_indices) can actually determine the type for vector indexes so this is a bit of a regression but I think it involves opening the index and I'd like to be able to do it from the details / manifest only.
| Ok(serde_json::json!({ | ||
| "path": json_details.path, | ||
| "target_details": target_details_json, | ||
| }) |
There was a problem hiding this comment.
The nested details means that just one json.loads() later won't be enough, which is annoying.
Could avoid this by parsing the target_details_json into serde_json::Value before putting it in the target_details field.
There was a problem hiding this comment.
That's a good suggestion. Will do.
|
|
||
| for fragment in dataset.get_fragments() { | ||
| if fragment_bitmap.contains(fragment.id() as u32) { | ||
| rows_indexed += fragment.fast_physical_rows()? as u64; |
There was a problem hiding this comment.
Should we subtract the number of deleted rows in that fragment? Otherwise I worry this metric will be confusing for users.
Imagine:
- Write 1000 rows
- Create index
- Delete every other row
ds.count_rows() will report 500 rows. index.rows_indexed() will report 1000 rows.
There was a problem hiding this comment.
We do document that this overcounts. Do we record the number of deleted rows per fragment in the manifest? I didn't want to have to load the deletion vector.
There was a problem hiding this comment.
Ah, yep, it is in the deletion file metadata. I can fix this up.
There was a problem hiding this comment.
Now includes deletion count in the row count calculation
c70b201 to
9bdbe65
Compare
wjones127
left a comment
There was a problem hiding this comment.
Excellent work! Thank you.
The list_indices function relies on APIs from the index objects themselves. This means we need to load the indices to populate the information. In addition, the python function uses the index statistics which can be slow. Rather than modify the existing method (which may introduce a breaking change) this creates a new method `describe_indices`. This method only uses information available in the dataset manifest. This ensures that minimal I/O will be required (loading the manifest if it hasn't been loaded) and the call shouldn't be slow.
The list_indices function relies on APIs from the index objects themselves. This means we need to load the indices to populate the information. In addition, the python function uses the index statistics which can be slow.
Rather than modify the existing method (which may introduce a breaking change) this creates a new method
describe_indices. This method only uses information available in the dataset manifest. This ensures that minimal I/O will be required (loading the manifest if it hasn't been loaded) and the call shouldn't be slow.