feat(java): supports building scalar indices distributedly in java module#4961
feat(java): supports building scalar indices distributedly in java module#4961jackye1995 merged 4 commits intolance-format:mainfrom
Conversation
|
This PR is ready for review, PTAL, thanks! @jackye1995 |
| Optional<String> name, | ||
| IndexParams params, | ||
| boolean replace) { | ||
| IndexOptions options) { |
There was a problem hiding this comment.
I feel this has become quite a long list of things that we might want to just have a CreateIndexBuilder which has all the available options including columns, index type, name, params. That seems like the best way to ensure we don't keep changing our public APIs again and again, what do you think?
There was a problem hiding this comment.
Thanks for your advise! I think you are right, this interface might be evolving. I've wrapped all params into IndexOptions class with required options and optional options. I think CreateIndexBuilder might be more rust style, so I kept the IndexOptions class. The original interface is also kept for backwards compatibility.
| index_builder = index_builder.fragments(fragment_ids); | ||
| } | ||
|
|
||
| if let Some(fragment_uuid) = fragment_uuid { |
There was a problem hiding this comment.
not a problem for you, but more for @chenghao-guo for this fragment_uuid thing. I get this is shared by the fragment level index builder, but isn't this just an index UUID? Why do we call it fragment UUID? It is confusing to me.
There was a problem hiding this comment.
Your understanding is correct. I agree that using index_uuid improves clarity and reduces ambiguity. Once this get refracted, I will update that in lance-ray project.
There was a problem hiding this comment.
Looks like java api already used the index_uuid in the method, I think we can go ahead and refractor it in this PR. I will update lance-ray/daft method parameter to uniform it as well.
There was a problem hiding this comment.
@chenghao-guo ok, thanks for your reply. I'll rename it in this pr.
There was a problem hiding this comment.
@jackye1995 I've refactored this. PTAL, thanks!
| private final List<Integer> fields; | ||
| private final String name; | ||
| private final long datasetVersion; | ||
| private final byte[] fragmentBitmap; |
There was a problem hiding this comment.
I think we probably want to keep this, since for a created index we would prefer to return the bitmap. Also we should make sure bitmap and list of fragments should not be set at the same time
There was a problem hiding this comment.
@jackye1995 Thanks for your review! I followed the python api, see python index. Python Index is also directly converted from (and would be also directly converted to) Rust IndexMetadata struct, see python conversion.
Current java api stores serialized rust RoaringBitmap in a byte array, which is absolutely opaque to java user. A List makes fragments info visible, which might be useful in some situations. For example, in distributed scan scenerios, we could configure more resources for unindexed fragments.
Do you think we need to have both a List and a byte[] in java side ? Or just follow the Python implementation. And we can also introduce RoaringBitmap dependency if you think a List object is too large.
There was a problem hiding this comment.
Sounds good I think we can have it as a materialized for now, but in long term I think we should introduce roaring bitmap to java sdk but that can be a separated task.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4961 +/- ##
==========================================
- Coverage 81.77% 81.74% -0.04%
==========================================
Files 341 341
Lines 140933 140933
Branches 140933 140933
==========================================
- Hits 115255 115202 -53
- Misses 21862 21910 +48
- Partials 3816 3821 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jackye1995 Friendly reminder to review this when convenient! |
8dcb63b to
b394c1c
Compare
jackye1995
left a comment
There was a problem hiding this comment.
looks good to me, sorry I missed reviewing this after the update. Let me know when it is rebased!
| private final List<Integer> fields; | ||
| private final String name; | ||
| private final long datasetVersion; | ||
| private final byte[] fragmentBitmap; |
There was a problem hiding this comment.
Sounds good I think we can have it as a materialized for now, but in long term I think we should introduce roaring bitmap to java sdk but that can be a separated task.
011a410 to
a308ab1
Compare
@jackye1995 Thanks for your review! I've rebased onto the main branch and all stable tests are passed. |
…dule (lance-format#4961) This PR is about to bring the distributed index creation functionality (see lance-format#4667, lance-format#4578) to java module, which is aligned with the python implementation. --------- Co-authored-by: 喆宇 <wxy407679@antgroup.com>
This PR is about to bring the distributed index creation functionality (see #4667, #4578) to java module, which is aligned with the python implementation.