Skip to content

Conversation

@noahfrn
Copy link
Contributor

@noahfrn noahfrn commented Apr 11, 2023

I'd like to add support for multi-file datasets (in different directories) in the Java bindings.
This PR adds to the existing Datasets API to allow an array of URIs to be passed in. The logic for whether the URIs passed are valid or not exists in the JNI layer.

@noahfrn noahfrn requested a review from lidavidm as a code owner April 11, 2023 10:28
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@noahfrn noahfrn changed the title 35033: [Java] [Datasets] Add support for multi-file datasets from Java GH-35033: [Java] [Datasets] Add support for multi-file datasets from Java Apr 11, 2023
@github-actions
Copy link

@noahfrn
Copy link
Contributor Author

noahfrn commented Apr 11, 2023

Unsure why the flight-core test is failing, doesn't seem related to my change.

@lidavidm
Copy link
Member

@davisusanibar

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 11, 2023
[] (const auto& p1, const auto& p2) {
return p1.first->type_name() == p2.first->type_name();
}) - filesystems.begin() != 1) {
JniThrow("Different filesystems are not supported in a multi-file dataset.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be possible to add the message that contains detail about what different filesystem the user are trying to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using #34420

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both counts, I'll wait for #34420 to clean this up a bit. and then create a better exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahfournier FYI #34420 is merged now


AutoCloseables.close(datum);

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be possible to add testing for exception throws: (1) Invalid URI , (2) ensure that they all share a FileSystem type

[] (const auto& p1, const auto& p2) {
return p1.first->type_name() == p2.first->type_name();
}) - filesystems.begin() != 1) {
JniThrow("Different filesystems are not supported in a multi-file dataset.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using #34420

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 12, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 13, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 2, 2023
@noahfrn
Copy link
Contributor Author

noahfrn commented Jun 2, 2023

Sorry for the delay on this one. I've added the test as requested, and cleaned up the implementation using #34420

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 2, 2023
@lidavidm lidavidm merged commit cd42895 into apache:main Jun 2, 2023
@ursabot
Copy link

ursabot commented Jun 3, 2023

Benchmark runs are scheduled for baseline = 018e7d3 and contender = cd42895. cd42895 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.62% ⬆️0.09%] test-mac-arm
[Failed] ursa-i9-9960x
[Finished ⬇️0.6% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] cd42895a ec2-t3-xlarge-us-east-2
[Finished] cd42895a test-mac-arm
[Failed] cd42895a ursa-i9-9960x
[Finished] cd42895a ursa-thinkcentre-m75q
[Finished] 018e7d3f ec2-t3-xlarge-us-east-2
[Finished] 018e7d3f test-mac-arm
[Finished] 018e7d3f ursa-i9-9960x
[Finished] 018e7d3f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@noahfrn noahfrn deleted the multifile-datasets-java branch July 5, 2023 14:17
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
… from Java (apache#35034)

I'd like to add support for multi-file datasets (in different directories) in the Java bindings.
This PR adds to the existing Datasets API to allow an array of URIs to be passed in. The logic for whether the URIs passed are valid or not exists in the JNI layer.
* Closes: apache#35033

Lead-authored-by: NoahFournier <fourniernoah6@gmail.com>
Co-authored-by: NoahFournier <63198198+NoahFournier@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] [Datasets] Add support for multi-file datasets from Java

5 participants