feat: support IVF_FLAT index#2418
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2418 +/- ##
==========================================
- Coverage 79.78% 79.47% -0.31%
==========================================
Files 205 205
Lines 56547 56975 +428
Branches 56547 56975 +428
==========================================
+ Hits 45114 45279 +165
- Misses 8753 9014 +261
- Partials 2680 2682 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
westonpace
left a comment
There was a problem hiding this comment.
This looks good. There's some dead code that it would be good to remove but nothing too concerning.
| // if self.rows_written == 0 { | ||
| // self.writer.shutdown().await?; | ||
| // return Ok(0); | ||
| // } |
There was a problem hiding this comment.
You can go ahead and remove these lines. It might be good to add a test but I can do that later too.
| fmt::Debug + Clone + Sized + DeepSizeOf + for<'a> Deserialize<'a> + Serialize | ||
| { | ||
| async fn load(reader: &FileReader) -> Result<Self>; | ||
| // fn try_from_json(s: &str) -> Result<Self>; |
|
|
||
| // fn try_from_batch( | ||
| // batch: RecordBatch, | ||
| // distance_type: DistanceType, | ||
| // metadata: &Self::Metadata, | ||
| // ) -> Result<Self>; |
|
|
||
| // fn try_from_json(metadata_str: &str) -> Result<Self> { | ||
| // serde_json::from_str(metadata_str).map_err(|_| Error::Index { | ||
| // message: format!("Failed to parse index metadata: {}", metadata_str), | ||
| // location: location!(), | ||
| // }) | ||
| // } |
| // impl<Q: Quantization> Clone for IvfQuantizationStorage<Q> { | ||
| // fn clone(&self) -> Self { | ||
| // Self { | ||
| // reader: self.reader.clone(), | ||
| // metric_type: self.metric_type, | ||
| // quantizer: self.quantizer.clone(), | ||
| // metadata: self.metadata.clone(), | ||
| // ivf: self.ivf.clone(), | ||
| // } | ||
| // } | ||
| // } |
| let batches = reader | ||
| .read_stream(ReadBatchParams::RangeFull, storage_size as u32, 1)? | ||
| .peekable() | ||
| .try_collect::<Vec<_>>() | ||
| .await?; | ||
| let batch = arrow::compute::concat_batches(&batches[0].schema(), batches.iter())?; |
There was a problem hiding this comment.
We should probably just add a read_batch method at some point. I'll try and do this soon.
| // index_writer.add_schema_metadata( | ||
| // IVF_PARTITION_KEY, | ||
| // serde_json::to_string(&partitions_metadata)?, | ||
| // ); |
There was a problem hiding this comment.
| // index_writer.add_schema_metadata( | |
| // IVF_PARTITION_KEY, | |
| // serde_json::to_string(&partitions_metadata)?, | |
| // ); |
| fn deep_size_of_children(&self, context: &mut deepsize::Context) -> usize { | ||
| self.uuid.deep_size_of_children(context) | ||
| // Skipping session since it is a weak ref | ||
| } |
There was a problem hiding this comment.
What about the other properties? E.g. storage, sub_index_cache, etc.?
There was a problem hiding this comment.
added storage, sub_index_cache is now for passing the the object safe requirement (we can't cache the partition in session's cache now), it will be removed soon
| // match query.key.data_type() { | ||
| // arrow::datatypes::DataType::Float16 => { | ||
|
|
||
| // } | ||
| // arrow::datatypes::DataType::Float32 => { | ||
| // part_index.search(query.key, query.k, (&query).into(), &storage, pre_filter) | ||
| // } | ||
| // arrow::datatypes::DataType::Float64 => { | ||
| // part_index.search(query.key, query.k, (&query).into(), &storage, pre_filter) | ||
| // } | ||
| // _ => Err(Error::Index { | ||
| // message: "unsupported data type".to_string(), | ||
| // location: Location::new(file!(), line!(), column!()), | ||
| // }), |
| async fn calculate_included_frags(&self) -> Result<RoaringBitmap> { | ||
| todo!() | ||
| // let mut frag_ids = RoaringBitmap::default(); | ||
| // let part_ids = 0..self.ivf.num_partitions(); | ||
| // for part_id in part_ids { | ||
| // let part = self.load_partition(part_id, false).await?; | ||
| // frag_ids |= part.calculate_included_frags().await?; | ||
| // } | ||
| // Ok(frag_ids) | ||
| } |
There was a problem hiding this comment.
You can probably (intentionally) skip this with a message (maybe use unimplemented instead of todo so its more obvious). This method was only needed for migrating older manifests that did not correctly track the fragment bitmap. New indices should not have to worry about this.
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| self.write_pages(encoding_tasks).await?; | ||
|
|
||
| // No data, so don't create a file | ||
| if self.rows_written == 0 { |
There was a problem hiding this comment.
for IVF_FLAT, the FLAT index file is empty, but we still need to store some metadata (IVF related)
| .try_collect::<Vec<_>>() | ||
| .await?; | ||
| let schema = Arc::new(self.reader.schema().as_ref().into()); | ||
| let batch = concat_batches(&schema, batches.iter())?; |
| .await | ||
| } | ||
|
|
||
| (0, 3) => { |
There was a problem hiding this comment.
do we need to bump versino everytime a new index is added?
There was a problem hiding this comment.
the new index is based on v2 lance format, which is not compatible with v1, so have to bump version
| String::from_utf8(storage_ivf_pb.encode_to_vec()).map_err(|e| { | ||
| Error::io(format!("failed to encode IVF metadata: {}", e), location!()) | ||
| })?, | ||
| hex::encode(storage_ivf_pb.encode_to_vec()), |
There was a problem hiding this comment.
why do we use hex encoding here? can we just write some random data? We do have too many dependencies now.
There was a problem hiding this comment.
we need hex for converting the bytes to string cause the schema metadata can accept String only, any recommendations here?
There was a problem hiding this comment.
Can we just write it to the file as an array, and keep track the offset in the metadata instead?
There was a problem hiding this comment.
@westonpace can we do this with v2 format?
exposing object_writer seems tricky, maybe we need some better API to do this
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| let dist_calc = storage.dist_calculator_from_native(query); | ||
| let dist_calc = storage.dist_calculator(query); | ||
| let filtered_row_ids = prefilter | ||
| .filter_row_ids(Box::new(storage.row_ids())) |
There was a problem hiding this comment.
is it the case even Prefilter is empty, we need to iterate over billions of Row IDs anyway?
| DataType::Float32 => { | ||
| let vector = self.vectors.value(id as usize); | ||
| self.distance_type.func()( | ||
| self.query.as_primitive::<Float32Type>().values(), |
There was a problem hiding this comment.
How does it work with hamming / uint8
| Self::Product(pq) => pq.metadata_key(), | ||
| Self::Scalar(sq) => sq.metadata_key(), | ||
| Self::Flat(_) => FlatQuantizer::metadata_key(), | ||
| Self::Product(_) => ProductQuantizerImpl::<Float32Type>::metadata_key(), |
There was a problem hiding this comment.
do we need to support other float types?
There was a problem hiding this comment.
yeah, but won't be in this PR
| String::from_utf8(storage_ivf_pb.encode_to_vec()).map_err(|e| { | ||
| Error::io(format!("failed to encode IVF metadata: {}", e), location!()) | ||
| })?, | ||
| hex::encode(storage_ivf_pb.encode_to_vec()), |
There was a problem hiding this comment.
Can we just write it to the file as an array, and keep track the offset in the metadata instead?
[](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | patch | `0.12.1` -> `0.12.2` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.2`](https://togithub.com/lancedb/lance/releases/tag/v0.12.2) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.1...v0.12.2) <!-- Release notes generated using configuration in .github/release.yml at v0.12.2 --> #### What's Changed ##### New Features 🎉 - feat: support IVF_FLAT index by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2418](https://togithub.com/lancedb/lance/pull/2418) - feat: add support for arbitrary global buffers to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2445](https://togithub.com/lancedb/lance/pull/2445) - feat: add pushdown to the read path by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2444](https://togithub.com/lancedb/lance/pull/2444) - feat: scan with stable row id by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2441](https://togithub.com/lancedb/lance/pull/2441) - feat(java): add lance spark connector for read by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2429](https://togithub.com/lancedb/lance/pull/2429) - feat: pickle the manifest when pickling a dataset by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2459](https://togithub.com/lancedb/lance/pull/2459) - feat: store IVF in global buffer by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2449](https://togithub.com/lancedb/lance/pull/2449) ##### Bug Fixes 🐛 - fix: lance.torch.LanceDataset spelt readahead wrong by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2453](https://togithub.com/lancedb/lance/pull/2453) - fix: respect storage options on `create_index` by [@​joshua-auchincloss](https://togithub.com/joshua-auchincloss) in [https://github.com/lancedb/lance/pull/2460](https://togithub.com/lancedb/lance/pull/2460) - fix: incorrect logic when decoding manifest from last block by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2464](https://togithub.com/lancedb/lance/pull/2464) ##### Documentation 📚 - docs: add spark-lance-connector readme by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2456](https://togithub.com/lancedb/lance/pull/2456) ##### Performance Improvements 🚀 - perf: speed up hnsw build by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2448](https://togithub.com/lancedb/lance/pull/2448) - perf: better L2 code generation by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2450](https://togithub.com/lancedb/lance/pull/2450) ##### Other Changes - refactor: pull row_id and deletion vector out of v1 file reader by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2436](https://togithub.com/lancedb/lance/pull/2436) - refactor: move Prefilter to a trait and move VectorIndex to lance-vector by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2442](https://togithub.com/lancedb/lance/pull/2442) - refactor: move hnsw from lance crate to lance-index crate by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2446](https://togithub.com/lancedb/lance/pull/2446) - refactor: impl the sub index trait for HNSW and clean code by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2462](https://togithub.com/lancedb/lance/pull/2462) #### New Contributors - [@​joshua-auchincloss](https://togithub.com/joshua-auchincloss) made their first contribution in [https://github.com/lancedb/lance/pull/2460](https://togithub.com/lancedb/lance/pull/2460) **Full Changelog**: lance-format/lance@v0.12.1...v0.12.2 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | patch | `0.12.1` -> `0.12.2` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.2`](https://togithub.com/lancedb/lance/releases/tag/v0.12.2) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.1...v0.12.2) <!-- Release notes generated using configuration in .github/release.yml at v0.12.2 --> #### What's Changed ##### New Features 🎉 - feat: support IVF_FLAT index by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2418](https://togithub.com/lancedb/lance/pull/2418) - feat: add support for arbitrary global buffers to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2445](https://togithub.com/lancedb/lance/pull/2445) - feat: add pushdown to the read path by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2444](https://togithub.com/lancedb/lance/pull/2444) - feat: scan with stable row id by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2441](https://togithub.com/lancedb/lance/pull/2441) - feat(java): add lance spark connector for read by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2429](https://togithub.com/lancedb/lance/pull/2429) - feat: pickle the manifest when pickling a dataset by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2459](https://togithub.com/lancedb/lance/pull/2459) - feat: store IVF in global buffer by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2449](https://togithub.com/lancedb/lance/pull/2449) ##### Bug Fixes 🐛 - fix: lance.torch.LanceDataset spelt readahead wrong by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2453](https://togithub.com/lancedb/lance/pull/2453) - fix: respect storage options on `create_index` by [@​joshua-auchincloss](https://togithub.com/joshua-auchincloss) in [https://github.com/lancedb/lance/pull/2460](https://togithub.com/lancedb/lance/pull/2460) - fix: incorrect logic when decoding manifest from last block by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2464](https://togithub.com/lancedb/lance/pull/2464) ##### Documentation 📚 - docs: add spark-lance-connector readme by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2456](https://togithub.com/lancedb/lance/pull/2456) ##### Performance Improvements 🚀 - perf: speed up hnsw build by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2448](https://togithub.com/lancedb/lance/pull/2448) - perf: better L2 code generation by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2450](https://togithub.com/lancedb/lance/pull/2450) ##### Other Changes - refactor: pull row_id and deletion vector out of v1 file reader by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2436](https://togithub.com/lancedb/lance/pull/2436) - refactor: move Prefilter to a trait and move VectorIndex to lance-vector by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2442](https://togithub.com/lancedb/lance/pull/2442) - refactor: move hnsw from lance crate to lance-index crate by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2446](https://togithub.com/lancedb/lance/pull/2446) - refactor: impl the sub index trait for HNSW and clean code by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2462](https://togithub.com/lancedb/lance/pull/2462) #### New Contributors - [@​joshua-auchincloss](https://togithub.com/joshua-auchincloss) made their first contribution in [https://github.com/lancedb/lance/pull/2460](https://togithub.com/lancedb/lance/pull/2460) **Full Changelog**: lance-format/lance@v0.12.1...v0.12.2 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
No description provided.