Skip to content

Conversation

@zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Aug 5, 2021

Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API.

Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI.

This is also a dependency of ARROW-11776 (PR #10201).

@github-actions
Copy link

github-actions bot commented Aug 5, 2021

@zhztheplayer
Copy link
Member Author

Hi @kiszk, @emkornfield, @fsaintjacques. Would you like to help review this PR as a dependency of the ongoing ARROW-11776 fix? Thanks a lot. The major code changes are in second commit 1f4805d without the Nit stuffs which is in another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to ask you to avoid these types of the cleanup in this huge PR. Increasing # of changed lines makes us harder to review this PR.

What do you think?

Copy link
Member Author

@zhztheplayer zhztheplayer Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The Nit changes was now removed from this PR. I'll make another after this one get merged, because now in this PR we should follow the old style

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for changing name.

@zhztheplayer zhztheplayer force-pushed the ARROW-7272 branch 4 times, most recently from fd07e19 to 4f09d18 Compare August 24, 2021 06:53
@kiszk
Copy link
Member

kiszk commented Aug 29, 2021

Would it be good to add tests for DeserializeUnsafeFromJava and SerializeUnsafeFromNative to jni_util_test.cc?
Since we use a few jni methods(GetArrayLength, GetByteArrayElements, NewByteArray, SetByteArrayRegion, ...), we could create a mock jni in the test case.

Test cases make us clear to do things in the methods.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Aug 31, 2021

Would it be good to add tests for DeserializeUnsafeFromJava and SerializeUnsafeFromNative to jni_util_test.cc?
Since we use a few jni methods(GetArrayLength, GetByteArrayElements, NewByteArray, SetByteArrayRegion, ...), we could create a mock jni in the test case.

Test cases make us clear to do things in the methods.

I can give it a try to add mock based tests in C++ side.

Actually the round-trip test already covered this? https://github.com/apache/arrow/pull/10883/files#diff-e1e74b2b4a2812b4e273413ec2beb7f89bb62f2a4172f832805d68a87a773107R100-R101

@kiszk
Copy link
Member

kiszk commented Sep 5, 2021

Thank you for your try. We are looking forward to seeing it.

The end-to-end test looks good for the coverage. IMHO, unit tests help the review and narrow down the (future) potential problems.

@zhztheplayer
Copy link
Member Author

I completely agree that we should add more tests though I was thinking this part of code could stay here in Dataset code for short term (that is also why I didn't put these codes to a common module), as I preferred to integrate Java C ABI which is pending on another proposal, to replace this half-serialization implementation. Anyway as long as we think the C++ mock test is needed here then I would try adding them to this PR. (Once I get free some time to do it. A bit busy these days. Sorry for letting you wait)

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Sep 6, 2021

@kiszk

I have added a mock test against batch exporting (unsafe serialization) code from C++ side. 0c1d58f

Unsafe importing part will be even tricky so I didn't add test case for that. The function arrow::dataset::jni::DeserializeUnsafeFromJava require for a byte array that was serialized out from Java side utility method. So we will end up adding magic byte inputs if we want to unit test this part of code in C++. Do you have any thoughts around this?

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Oct 9, 2021

@kiszk Any thoughts about the current code?

@emkornfield Would you like to take a look on this too? I think It would be great if we can bring the Java dataset write feature #10201 to version 6.0.0.

@pitrou
Copy link
Member

pitrou commented Oct 14, 2021

ARROW-12965 is now solved, so this should be simplified to reuse the C data interface.

@zhztheplayer
Copy link
Member Author

ARROW-12965 is now solved, so this should be simplified to reuse the C data interface.

Agreed. I'll update the code.

@emkornfield
Copy link
Contributor

@zhztheplayer sorry for the delay here, did you get a chance to update this to the C-FFI usage?

@zhztheplayer
Copy link
Member Author

@zhztheplayer sorry for the delay here, did you get a chance to update this to the C-FFI usage?

Sorry, haven't got chance to work on it yet. I will try to see if I can make a patch in next week. Thanks.

@zhztheplayer
Copy link
Member Author

@emkornfield I have updated the PR. I think the CI might be broken since I haven't changed the scripts yet. But you can review the codes if you want to. I will update CI scripts later some time in this PR. Thanks.

@zhztheplayer
Copy link
Member Author

Travis failure doesn't seem to be related to the changes. Other jobs were passed.

@zhztheplayer zhztheplayer changed the title ARROW-7272: [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot ARROW-7272: [C++][Java][Dataset] JNI bridge between RecordBatch and VectorSchemaRoot Apr 25, 2022
@zhztheplayer
Copy link
Member Author

@kiszk Can we bring the patch into release 8.0.0? Hopefully it's not too late. Thanks.

@kszucs
Copy link
Member

kszucs commented Apr 25, 2022

@github-actions crossbow submit java*

@github-actions
Copy link

Revision: 5fdb38b369d5f7daa8b2e483ae9f1b932460c0ee

Submitted crossbow builds: ursacomputing/crossbow @ actions-1945

Task Status
java-jars Github Actions

@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

@zhztheplayer Can you update the PR description?

@kiszk
Copy link
Member

kiszk commented Apr 25, 2022

cc @kszucs

@kszucs
Copy link
Member

kszucs commented Apr 25, 2022

@kiszk can you approve the Java changes? Then we can include it in the release.

@kiszk
Copy link
Member

kiszk commented Apr 25, 2022

Sure, let me see.

@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

@lwhite1 You may want to take a look here.

@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

Also @zhztheplayer please rebase on master so that we can fix Travis-CI builds.

@zhztheplayer
Copy link
Member Author

Travis is fixed but some new errors emerged which seems to be related to network issues. @pitrou can you help retrigger the failed jobs?

@pitrou
Copy link
Member

pitrou commented Apr 25, 2022

It seems someone else did it for me :-)
(but, yes, Homebrew is often unreliable...)

Comment on lines +170 to +172
env->ExceptionDescribe();
env->ExceptionClear();
return Status::Invalid("Error during calling Java code from native code");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply dumping the exception, is it possible to add its description to the Status instance that is being returned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be deferred to a later JIRA btw.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll open another ticket for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned to me. Thanks.

@pitrou pitrou requested a review from lidavidm April 25, 2022 14:05
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor comments

* ScanTask is meant to be a unit of work to be dispatched. The implementation
* must be thread and concurrent safe.
*/
public interface ScanTask extends AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. we should eventually remove this class entirely as the corresponding interface no longer exists in C++ ARROW-15745

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can assign to myself. Thanks.

for (int i = 0; i < schema->num_fields(); ++i) {
std::vector<std::shared_ptr<arrow::Array>> offset_zeroed_arrays;
for (int i = 0; i < record_batch->num_columns(); ++i) {
// TODO: If the array has an offset then we need to de-offset the array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still relevant? It seems it's not a TODO anymore but an explanatory note

Copy link
Member Author

@zhztheplayer zhztheplayer Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still valid. The codes still copy and rebuild offset-buffers because Java C Data Interface doesn't implement exporing/importing for offset-buffers too. And I believe to do that we should make systematic change to Java code since Arrow Java didn't implement the same offset semantic comparing to C++.

final List<ArrowRecordBatch> ret = stream(scanner.scan())
.flatMap(t -> stream(t.execute()))
.collect(Collectors.toList());
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this should actually be a try-with-resources or a try-finally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to factor out as try-finally but it seems that it better requires some common changes to AutoCloseables.java.

See my draft:
zhztheplayer@c460f67

Do you think we can have another ticket for the changes too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no worries then. Thanks for looking.

Copy link
Contributor

@lwhite1 lwhite1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@pitrou pitrou closed this in dc97883 Apr 26, 2022
@zhztheplayer
Copy link
Member Author

Thanks everyone!

@ursabot
Copy link

ursabot commented May 2, 2022

Benchmark runs are scheduled for baseline = d464425 and contender = dc97883. dc97883 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.08% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.59% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] dc97883d ec2-t3-xlarge-us-east-2
[Finished] dc97883d test-mac-arm
[Finished] dc97883d ursa-i9-9960x
[Finished] dc97883d ursa-thinkcentre-m75q
[Finished] d4644254 ec2-t3-xlarge-us-east-2
[Finished] d4644254 test-mac-arm
[Finished] d4644254 ursa-i9-9960x
[Finished] d4644254 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ectorSchemaRoot

Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API.

Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI.

This is also a dependency of ARROW-11776 (PR apache#10201).

Closes apache#10883 from zhztheplayer/ARROW-7272

Authored-by: Hongze Zhang <hongze.zhang@intel.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants