-
Notifications
You must be signed in to change notification settings - Fork 4k
WIP: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI #10201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield
I managed to migrate to existing flatbuffers API from protobuf.
previous comment #10108 (comment)
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spaces are not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But I don't see we always make consistent practices of Javadoc alignment in Arrow Java code. E.g.
arrow/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/JniWrapper.java
Lines 34 to 40 in 78c88a9
| * @param schemaBuf The schema serialized as a protobuf. See Types.proto | |
| * to see the protobuf specification | |
| * @param exprListBuf The serialized protobuf of the expression vector. Each | |
| * expression is created using TreeBuilder::MakeExpression. | |
| * @param selectionVectorType type of selection vector | |
| * @param configId Configuration to gandiva. | |
| * @return A moduleId that is passed to the evaluateProjector() and closeProjector() methods |
Maybe what we need here is to add a relevant rule to checkstyle config and do a complete clean up?
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you prefix org with "::"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but did you mean to use namespace flatbuf = ::org::apache::arrow::flatbuf;? I did a search in our C++ code base using RegEx namespace \w* = , and it seems to be a common practice not adding the leading ::
[root@localhost arrow]# grep -R "namespace [a-z A-Z][a-z A-Z]* =.*" cpp/src/
cpp/src/parquet/column_reader.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/column_writer.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/encoding_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/arrow/reader.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/arrow/reader_internal.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/encoding.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/column_writer_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/statistics_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/jni/dataset/jni_util.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/json_simple.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/ipc/metadata_internal.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/reader.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/metadata_internal.h:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/compute/kernels/aggregate_benchmark.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/arrow/filesystem/s3_test_util.h:namespace bp = boost::process;
cpp/src/arrow/filesystem/s3fs_test.cc:namespace bp = boost::process;
cpp/src/arrow/gpu/cuda_arrow_ipc.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/flight/flight_benchmark.cc:namespace perf = arrow::flight::perf;
cpp/src/arrow/flight/test_util.cc:namespace bp = boost::process;
cpp/src/arrow/flight/test_util.cc:namespace fs = boost::filesystem;
cpp/src/arrow/flight/test_util.cc: namespace fs = boost::filesystem;
cpp/src/arrow/flight/internal.h:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/server.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/serialization_internal.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/client.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/client.cc: namespace ge = GRPC_NAMESPACE_FOR_TLS_CREDENTIALS_OPTIONS;
cpp/src/arrow/flight/perf_server.cc:namespace perf = arrow::flight::perf;
cpp/src/arrow/flight/perf_server.cc:namespace proto = arrow::flight::protocol;
cpp/src/arrow/flight/flight_test.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/adapters/orc/adapter.cc:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_util.h:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_test.cc:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_util.cc:namespace liborc = orc;
cpp/src/arrow/json/object_parser.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/test_common.h:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/chunker.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/parser.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/object_writer.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/testing/json_internal.h:namespace rj = arrow::rapidjson;
cpp/src/plasma/store.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/client.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/common.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/plasma.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/test/serialization_tests.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/protocol.cc:namespace fb = plasma::flatbuf;
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to avoid these types of cleanups with large functional changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. :( Will make split changes next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @emkornfield. To split these types of changes into another PR make us easier to review this PR at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you think we should have a individual PR rather than a commit to fix the style stuffs? I am okay to either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me take a look.
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including the message with Invalid could aid users. Status's have a member for arbitrary details. it might pay to create a new extension of this class that can keep a reference to the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit 922d6dd143c12984daa788d9d737a75914e8a678.
cpp/src/jni/dataset/jni_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the performance overhead of this. Could another approach be to use to record batches. One that contains the data, and one that contains all the reference pointers?
Or at the very least only one metadata entry and encode all the buffer references as some sort of flatbuffer, protobuf or json list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed commit 8ef07be7b42cc9a4e83326f6a785457d664678ee.
Here I chose to encode the references within json arrays. It might be possible to use extra arrow arrays/recordbatch to store the reference pointers but we still have to manage the reference pointers to the ref arrays/recordbatch themselves. Let's have a rework on this within implementing C data interface for Java in future.
cpp/src/jni/dataset/jni_wrapper.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @lidavidm @westonpace what is the current status of changes in cardinality for ScanTask? is this a use that should be supported long term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScanTask is going away. However, from a quick glance, what we want here is a Scanner from a RecordBatchIterator. Use the existing Scanner::FromRecordBatchReader:
arrow/cpp/src/arrow/dataset/scanner.h
Lines 323 to 329 in 8b0e049
| /// \brief Make a scanner from a record batch reader. | |
| /// | |
| /// The resulting scanner can be scanned only once. This is intended | |
| /// to support writing data from streaming sources or other sources | |
| /// that can be iterated only once. | |
| static std::shared_ptr<ScannerBuilder> FromRecordBatchReader( | |
| std::shared_ptr<RecordBatchReader> reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 71cba25. Thanks for adding the facility.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there might be two changes in one here, that potentially could be split up and ease reviewing.
- Refactoring to allow passing native java buffers to C++ code. For this, another option that might be simpler/more reliable is to try to use the Arrow C ABI to translate the data across java and C++. Did you consider this?
- Allowing writes.
Is this right? would it be hard to make these changes separately.
Once we had a short discussion around this https://issues.apache.org/jira/browse/ARROW-7272?focusedCommentId=16983849&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983849 C data interface does sound to be better. Although I believe a Java implementation will require a design that probably tightly relies on JNI which makes the implementation slightly more complex than in other language's case. I think we can open a separate JIRA ticket for that to do the work in future as we don't yet have the implementation in Java. So far I have extracted the commit that refactored current code within flatbuffers and marked it to resolve the issue ARROW-7272 (if it's OK I may open an individual PR within the commit for 7272). What do you think? |
|
JNI CI failure is related to an existing issue https://issues.apache.org/jira/browse/ARROW-12838 |
9b9319b to
cc13ac2
Compare
|
Hi @emkornfield, do you have any other comments on the current code? Sorry for pinging you again as I would like to see if we can make this into 5.0.0. To ease review, the changes was split into two parts:
Thanks! |
cc13ac2 to
0476994
Compare
0476994 to
fcf1e14
Compare
b7fcd97 to
0acf50d
Compare
0acf50d to
cc5ea4f
Compare
|
Part of this work around ARROW-7272 has been moved to another PR #10883. |
|
Marked title as WIP to let #10883 to be reviewed first. |
|
@zhztheplayer shall we close this as stale? |
…ectorSchemaRoot Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API. Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI. This is also a dependency of ARROW-11776 (PR #10201). Closes #10883 from zhztheplayer/ARROW-7272 Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I suggest to keep it open for a while, as its dependency #10108 has been merged. I'm going to update this PR before 9.0.0. |
|
Also cc @lwhite1 for information and in case he's interested. |
|
Sorry still not ready to rework this. Will reopen before I get started. |
…ectorSchemaRoot Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API. Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI. This is also a dependency of ARROW-11776 (PR apache#10201). Closes apache#10883 from zhztheplayer/ARROW-7272 Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Sorry for messing up PR list but the previous PR is accidentally closed and not able to reopen. Let's use this one instead.
https://issues.apache.org/jira/browse/ARROW-11776