-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7272: [C++][Java][Dataset] JNI bridge between RecordBatch and VectorSchemaRoot #10883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
53ed02e to
28742ab
Compare
01b6d3d to
1f4805d
Compare
|
Hi @kiszk, @emkornfield, @fsaintjacques. Would you like to help review this PR as a dependency of the ongoing ARROW-11776 fix? Thanks a lot. The major code changes are in second commit 1f4805d without the Nit stuffs which is in another one. |
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to ask you to avoid these types of the cleanup in this huge PR. Increasing # of changed lines makes us harder to review this PR.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The Nit changes was now removed from this PR. I'll make another after this one get merged, because now in this PR we should follow the old style
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for changing name.
fd07e19 to
4f09d18
Compare
|
Would it be good to add tests for DeserializeUnsafeFromJava and SerializeUnsafeFromNative to jni_util_test.cc? Test cases make us clear to do things in the methods. |
I can give it a try to add mock based tests in C++ side. Actually the round-trip test already covered this? https://github.com/apache/arrow/pull/10883/files#diff-e1e74b2b4a2812b4e273413ec2beb7f89bb62f2a4172f832805d68a87a773107R100-R101 |
|
Thank you for your try. We are looking forward to seeing it. The end-to-end test looks good for the coverage. IMHO, unit tests help the review and narrow down the (future) potential problems. |
|
I completely agree that we should add more tests though I was thinking this part of code could stay here in Dataset code for short term (that is also why I didn't put these codes to a common module), as I preferred to integrate Java C ABI which is pending on another proposal, to replace this half-serialization implementation. Anyway as long as we think the C++ mock test is needed here then I would try adding them to this PR. (Once I get free some time to do it. A bit busy these days. Sorry for letting you wait) |
4f09d18 to
c004e81
Compare
|
I have added a mock test against batch exporting (unsafe serialization) code from C++ side. 0c1d58f Unsafe importing part will be even tricky so I didn't add test case for that. The function |
c004e81 to
0c1d58f
Compare
9508565 to
51e235f
Compare
|
@kiszk Any thoughts about the current code? @emkornfield Would you like to take a look on this too? I think It would be great if we can bring the Java dataset write feature #10201 to version 6.0.0. |
|
ARROW-12965 is now solved, so this should be simplified to reuse the C data interface. |
Agreed. I'll update the code. |
|
@zhztheplayer sorry for the delay here, did you get a chance to update this to the C-FFI usage? |
Sorry, haven't got chance to work on it yet. I will try to see if I can make a patch in next week. Thanks. |
51e235f to
f469192
Compare
|
@emkornfield I have updated the PR. I think the CI might be broken since I haven't changed the scripts yet. But you can review the codes if you want to. I will update CI scripts later some time in this PR. Thanks. |
|
Travis failure doesn't seem to be related to the changes. Other jobs were passed. |
|
@kiszk Can we bring the patch into release 8.0.0? Hopefully it's not too late. Thanks. |
|
@github-actions crossbow submit java* |
|
Revision: 5fdb38b369d5f7daa8b2e483ae9f1b932460c0ee Submitted crossbow builds: ursacomputing/crossbow @ actions-1945
|
|
@zhztheplayer Can you update the PR description? |
|
cc @kszucs |
|
@kiszk can you approve the Java changes? Then we can include it in the release. |
|
Sure, let me see. |
|
@lwhite1 You may want to take a look here. |
|
Also @zhztheplayer please rebase on master so that we can fix Travis-CI builds. |
5fdb38b to
879ccb0
Compare
|
Travis is fixed but some new errors emerged which seems to be related to network issues. @pitrou can you help retrigger the failed jobs? |
|
It seems someone else did it for me :-) |
| env->ExceptionDescribe(); | ||
| env->ExceptionClear(); | ||
| return Status::Invalid("Error during calling Java code from native code"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of simply dumping the exception, is it possible to add its description to the Status instance that is being returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be deferred to a later JIRA btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll open another ticket for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I opened https://issues.apache.org/jira/browse/ARROW-16329
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned to me. Thanks.
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor comments
| * ScanTask is meant to be a unit of work to be dispatched. The implementation | ||
| * must be thread and concurrent safe. | ||
| */ | ||
| public interface ScanTask extends AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. we should eventually remove this class entirely as the corresponding interface no longer exists in C++ ARROW-15745
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can assign to myself. Thanks.
| for (int i = 0; i < schema->num_fields(); ++i) { | ||
| std::vector<std::shared_ptr<arrow::Array>> offset_zeroed_arrays; | ||
| for (int i = 0; i < record_batch->num_columns(); ++i) { | ||
| // TODO: If the array has an offset then we need to de-offset the array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still relevant? It seems it's not a TODO anymore but an explanatory note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still valid. The codes still copy and rebuild offset-buffers because Java C Data Interface doesn't implement exporing/importing for offset-buffers too. And I believe to do that we should make systematic change to Java code since Arrow Java didn't implement the same offset semantic comparing to C++.
| final List<ArrowRecordBatch> ret = stream(scanner.scan()) | ||
| .flatMap(t -> stream(t.execute())) | ||
| .collect(Collectors.toList()); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this should actually be a try-with-resources or a try-finally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to factor out as try-finally but it seems that it better requires some common changes to AutoCloseables.java.
See my draft:
zhztheplayer@c460f67
Do you think we can have another ticket for the changes too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no worries then. Thanks for looking.
lwhite1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
Thanks everyone! |
|
Benchmark runs are scheduled for baseline = d464425 and contender = dc97883. dc97883 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…ectorSchemaRoot Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API. Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI. This is also a dependency of ARROW-11776 (PR apache#10201). Closes apache#10883 from zhztheplayer/ARROW-7272 Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API.
Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI.
This is also a dependency of ARROW-11776 (PR #10201).