-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4714: [C++][JAVA] Providing JNI interface to Read ORC file via Arrow C++ #4348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yuruiz
commented
May 20, 2019
- setup necessary dev environment for JNI development on JAVA and C++ codebase
- implemented JNI interface to enable reading arrow record batch from ORC files
- implemented a naive arrow buffer reference manager to ensure c++ memory release
java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java
Outdated
Show resolved
Hide resolved
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments, I'm not super familiar with JNI so a I apologize if some of them don't make sense.
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReaderJniWrapper.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReaderJniWrapper.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReaderJniWrapper.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcStripeReader.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcStripeReader.java
Outdated
Show resolved
Hide resolved
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple questions about JNI code organization and build flags for C++
|
Hi @praveenbingo @emkornfield , all your comments have been addressed. Do you guys have any other comments regarding the latest changes? |
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another pass through, this seems much cleaner. Will try to take a more careful look tomorrow.
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcJniUtils.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcMemoryJniWrapper.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcMemoryJniWrapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to throw when we truncate memory size (at least have a debug mode that does this). Maybe not for this CL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand you here. Under what circumstances would we truncate the memory size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if getSize() returns a value greater than MAX_INT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should solve such an issue here. If an arrow buffer size could go beyond MAX_INT, it will also cause issue on IPC ser/deser.
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcJniUtils.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcJniUtils.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcMemoryJniWrapper.java
Outdated
Show resolved
Hide resolved
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java
Outdated
Show resolved
Hide resolved
|
@praveenbingo or @pravindra can one of you take another look at the JNI stuff? |
cpp/src/jni/orc/concurrent_map.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change back, the kInitiModuleId is what should be used for constants (also static shouldn't be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-static data member cannot be constexpr
and here is just changing the name to make it comply with Arrow C++ naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which naming convention are you referring to? As as I know constants are generally of the form kInitModuleId (non-static members follow the convention you have here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused about the naming convention in ARROW c++. In type.h I can find a lot static constexpr members follow my current convention like "type_id" so I thought this maybe a more consistent convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think at times were weren't consistent with our own conventions.
Theoretically we follow the google style guide with only a few exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for now let's follow the way like init_module_id_ to be consistent with rest of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have confused the point. Old code has both styles, New code should (and mostly does) follow the kInitModuleId style. At this point it would be counter-productive to change it back (I'd like to merge once CI passes) but in the future please use what is proscribed in the style guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Thank you!
will do a pass tomorrow my time. |
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcStripeReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think this is right implementation of this method..can you please check the impl in BufferLedger..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I know, currently this is just a naive implementation. Currently the bufferLedger implementation is relied on the allocator and allocationManager. Some minor refactor is required to have the native allocated memory fully integrated into the current system. I would make the changes in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unsupported operation might be better then? Till we implement fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we throw exception here. We couldn't pass the basic unittests since retain() will be called when loading orc memory into ArrowRecordBatch. So I would prefer to leave it here to allow basic validation and someone who interested to play around. I can add a warning here stating the api here is not stable and will subject to change in future.
java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReferenceManager.java
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #4348 +/- ##
=========================================
+ Coverage 88.42% 89.2% +0.78%
=========================================
Files 784 696 -88
Lines 100287 91686 -8601
Branches 1251 0 -1251
=========================================
- Hits 88678 81793 -6885
+ Misses 11373 9893 -1480
+ Partials 236 0 -236
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not these methods declare that they can throw exceptions (since it looks like we are throwing exceptions on invalid id).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exceptions are declared in OrcReader.java. I am not sure if I can define throw exception signature on native methods.
In any case, this methods are package private and should not be directly invoked by external user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native methods can declare exceptions, i am not sure of the behavior when a native method throws a checked exception when it does not declare in its signature, again i am ok to follow up on this later with a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah looks like you are throwing runtime exceptions now - which is better, thanks
cpp/src/jni/orc/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a lot of local references being created to java objects - can you please test with a large file say of 100 columns or more and see if this goes through..from what i remember jvm needed native methods to inform it if more than 16 references were needed using ensureLocalCapacity()..
Reference : https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html
I am ok if you want to create a followup JIRA for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praveenbingo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a ton @yuruiz for all of the work !
@emkornfield Please go ahead if this looks good to you.
|
@yuruiz this looks ok to me with two small comments. Could you also open up JIRAs to add this code to CI? |
|
setup basic development for native orc reader
- refactor JNI interface to make the underlying resource id explicit - add concurrent map for resource management - hold reference to class and method field to avoid unnecessary overhead
Convert OrcReaderJniWrapper to singleton model to allow client handle library loading exception
|
+1, LGTM. thank you . |
…Arrow C++ - setup necessary dev environment for JNI development on JAVA and C++ codebase - implemented JNI interface to enable reading arrow record batch from ORC files - implemented a naive arrow buffer reference manager to ensure c++ memory release Author: Yurui Zhou <yurui.zyr@alibaba-inc.com> Closes apache#4348 from yuruiz/JniOrcReader and squashes the following commits: 41592bf <Yurui Zhou> minor doc fix 44b5420 <Yurui Zhou> make sure lookup operation are performed under lock 706c8dc <Yurui Zhou> resolve comments de8529c <Yurui Zhou> resolve comments fc80175 <Yurui Zhou> resolve comments 9b04b76 <Yurui Zhou> fix style issues and add proper docs 9b13d7f <Yurui Zhou> replace nullptr with NULLPTR macro dd981af <Yurui Zhou> fix lint and clang-format 44505df <Yurui Zhou> Fix cmake format f2a0c04 <Yurui Zhou> destruct schema reader when finish reading 4f89e34 <Yurui Zhou> Make sure resources are properly released. 26d74db <Yurui Zhou> fix minor style check error ce30933 <Yurui Zhou> Add Arrow Jni Reader Unittests 7a80fbd <Yurui Zhou> Minor refactor e4c0630 <Yurui Zhou> remove redundant code e932aa8 <Yurui Zhou> Move jni code to src/jni and change build flag to arrow_jni 1b6a704 <Yurui Zhou> Interface refactor and performance optimization 3604c24 <Yurui Zhou> Resolve merge conflicts 1c0e0b2 <Yurui Zhou> Fix minor build errors e0d9c1f <Yurui Zhou> implement JNI interface on both size a1e80a6 <Yurui Zhou> Add arrow-orc setup