Skip to content

feat(java): supports building scalar indices distributedly in java module#4961

Merged
jackye1995 merged 4 commits intolance-format:mainfrom
steFaiz:java_dist_index
Nov 5, 2025
Merged

feat(java): supports building scalar indices distributedly in java module#4961
jackye1995 merged 4 commits intolance-format:mainfrom
steFaiz:java_dist_index

Conversation

@steFaiz
Copy link
Copy Markdown
Collaborator

@steFaiz steFaiz commented Oct 15, 2025

This PR is about to bring the distributed index creation functionality (see #4667, #4578) to java module, which is aligned with the python implementation.

@github-actions github-actions Bot added enhancement New feature or request java labels Oct 15, 2025
@jackye1995 jackye1995 self-requested a review October 15, 2025 21:07
@steFaiz
Copy link
Copy Markdown
Collaborator Author

steFaiz commented Oct 17, 2025

This PR is ready for review, PTAL, thanks! @jackye1995

Optional<String> name,
IndexParams params,
boolean replace) {
IndexOptions options) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this has become quite a long list of things that we might want to just have a CreateIndexBuilder which has all the available options including columns, index type, name, params. That seems like the best way to ensure we don't keep changing our public APIs again and again, what do you think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advise! I think you are right, this interface might be evolving. I've wrapped all params into IndexOptions class with required options and optional options. I think CreateIndexBuilder might be more rust style, so I kept the IndexOptions class. The original interface is also kept for backwards compatibility.

@steFaiz steFaiz requested a review from jackye1995 October 20, 2025 06:29
Comment thread java/lance-jni/src/blocking_dataset.rs Outdated
index_builder = index_builder.fragments(fragment_ids);
}

if let Some(fragment_uuid) = fragment_uuid {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a problem for you, but more for @chenghao-guo for this fragment_uuid thing. I get this is shared by the fragment level index builder, but isn't this just an index UUID? Why do we call it fragment UUID? It is confusing to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. I agree that using index_uuid improves clarity and reduces ambiguity. Once this get refracted, I will update that in lance-ray project.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like java api already used the index_uuid in the method, I think we can go ahead and refractor it in this PR. I will update lance-ray/daft method parameter to uniform it as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-guo ok, thanks for your reply. I'll rename it in this pr.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 I've refactored this. PTAL, thanks!

private final List<Integer> fields;
private final String name;
private final long datasetVersion;
private final byte[] fragmentBitmap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to keep this, since for a created index we would prefer to return the bitmap. Also we should make sure bitmap and list of fragments should not be set at the same time

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 Thanks for your review! I followed the python api, see python index. Python Index is also directly converted from (and would be also directly converted to) Rust IndexMetadata struct, see python conversion.
Current java api stores serialized rust RoaringBitmap in a byte array, which is absolutely opaque to java user. A List makes fragments info visible, which might be useful in some situations. For example, in distributed scan scenerios, we could configure more resources for unindexed fragments.
Do you think we need to have both a List and a byte[] in java side ? Or just follow the Python implementation. And we can also introduce RoaringBitmap dependency if you think a List object is too large.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good I think we can have it as a materialized for now, but in long term I think we should introduce roaring bitmap to java sdk but that can be a separated task.

@steFaiz steFaiz requested a review from jackye1995 October 21, 2025 05:08
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.74%. Comparing base (e2a9079) to head (011a410).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4961      +/-   ##
==========================================
- Coverage   81.77%   81.74%   -0.04%     
==========================================
  Files         341      341              
  Lines      140933   140933              
  Branches   140933   140933              
==========================================
- Hits       115255   115202      -53     
- Misses      21862    21910      +48     
- Partials     3816     3821       +5     
Flag Coverage Δ
unittests 81.74% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@steFaiz
Copy link
Copy Markdown
Collaborator Author

steFaiz commented Oct 28, 2025

@jackye1995 Friendly reminder to review this when convenient!

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, sorry I missed reviewing this after the update. Let me know when it is rebased!

private final List<Integer> fields;
private final String name;
private final long datasetVersion;
private final byte[] fragmentBitmap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good I think we can have it as a materialized for now, but in long term I think we should introduce roaring bitmap to java sdk but that can be a separated task.

@steFaiz
Copy link
Copy Markdown
Collaborator Author

steFaiz commented Nov 5, 2025

looks good to me, sorry I missed reviewing this after the update. Let me know when it is rebased!

@jackye1995 Thanks for your review! I've rebased onto the main branch and all stable tests are passed.

@jackye1995 jackye1995 merged commit a5b754b into lance-format:main Nov 5, 2025
26 of 27 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…dule (lance-format#4961)

This PR is about to bring the distributed index creation functionality
(see lance-format#4667,
lance-format#4578) to java module, which is
aligned with the python implementation.

---------

Co-authored-by: 喆宇 <wxy407679@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants