feat(java): support building vector index distributively#5664
feat(java): support building vector index distributively#5664yanghua merged 5 commits intolance-format:mainfrom
Conversation
2f84c6d to
49b415e
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
3bef5b6 to
ee05fa4
Compare
ee05fa4 to
9e2273f
Compare
22cba7b to
0007d9e
Compare
0007d9e to
8735390
Compare
| let store = LanceIndexStore::from_dataset_for_new(self, index_uuid)?; | ||
| let index_dir = self.indices_dir().child(index_uuid); | ||
| log::info!( | ||
| "merge_index_metadata called with index_type={})", |
There was a problem hiding this comment.
If we want to add some logging, make it more readable?
There was a problem hiding this comment.
Sorry.
This is just copied from original python dataset.rs. I agree to delete this info. WDYT
| // descriptive error if the column is invalid. | ||
| let _dim = get_vector_dim(dataset.schema(), &column)?; | ||
|
|
||
| // Drop dataset_guard at the end of this block before reusing env. |
PTAL when you have time @yanghua |
yanghua
left a comment
There was a problem hiding this comment.
Just two comments, otherwise LGTM.
| import static org.junit.jupiter.api.Assertions.assertFalse; | ||
| import static org.junit.jupiter.api.Assertions.assertTrue; | ||
|
|
||
| public class VectorIndexTest { |
There was a problem hiding this comment.
Maybe we can compare the recall performance between a single-machine built index and a distributedly built index? Just like Python. Currently, all the test cases do not measure the metrics for the recall.
There was a problem hiding this comment.
I think we could verify the accuracy of the recall implementation just once — since both Java and Python call the same underlying rust code. I‘d like to keep Java tests more functional-only
What do you think? @yanghua
There was a problem hiding this comment.
since both Java and Python call the same underlying rust code
Maybe we can choose one case to verify? Although the underlying code is the same, the upper layer still needs to do some data type change or parameters pass. They may cause some issues?
yanghua
left a comment
There was a problem hiding this comment.
I have discussed with @majin1102. And we have agreed that we can add more detailed verification about recall in the following PR. So, LGTM about this PR!
…t#5664) Note: 1. Reposition merge_index_metadata so that Java dataset could directly call mergeIndexMetadata for vector indexes. 2. Add `centroids` to IvfBuilderParams and `codebook` to PqBuilderParams to make distrubitively creating vector indexes work 3. Add a new index package under tests to have index tests. 4. This PR update Cargo.lock due to the intruduction of lance-format#5650.
…t#5664) Note: 1. Reposition merge_index_metadata so that Java dataset could directly call mergeIndexMetadata for vector indexes. 2. Add `centroids` to IvfBuilderParams and `codebook` to PqBuilderParams to make distrubitively creating vector indexes work 3. Add a new index package under tests to have index tests. 4. This PR update Cargo.lock due to the intruduction of lance-format#5650.
This PR should be rebased after #5664 --------- Co-authored-by: majin.nathan <majin.nathan@bytedance.com>
Note:
centroidsto IvfBuilderParams andcodebookto PqBuilderParams to make distrubitively creating vector indexes work