Skip to content

Conversation

@zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Apr 29, 2021

Sorry for messing up PR list but the previous PR is accidentally closed and not able to reopen. Let's use this one instead.

https://issues.apache.org/jira/browse/ARROW-11776

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
I managed to migrate to existing flatbuffers API from protobuf.

previous comment #10108 (comment)

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces are not necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But I don't see we always make consistent practices of Javadoc alignment in Arrow Java code. E.g.

* @param schemaBuf The schema serialized as a protobuf. See Types.proto
* to see the protobuf specification
* @param exprListBuf The serialized protobuf of the expression vector. Each
* expression is created using TreeBuilder::MakeExpression.
* @param selectionVectorType type of selection vector
* @param configId Configuration to gandiva.
* @return A moduleId that is passed to the evaluateProjector() and closeProjector() methods

Maybe what we need here is to add a relevant rule to checkstyle config and do a complete clean up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you prefix org with "::"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but did you mean to use namespace flatbuf = ::org::apache::arrow::flatbuf;? I did a search in our C++ code base using RegEx namespace \w* = , and it seems to be a common practice not adding the leading ::

[root@localhost arrow]# grep -R "namespace [a-z A-Z][a-z A-Z]* =.*" cpp/src/
cpp/src/parquet/column_reader.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/column_writer.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/encoding_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/arrow/reader.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/arrow/reader_internal.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/encoding.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/column_writer_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/parquet/statistics_test.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/jni/dataset/jni_util.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/json_simple.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/ipc/metadata_internal.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/reader.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/ipc/metadata_internal.h:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/compute/kernels/aggregate_benchmark.cc:namespace BitUtil = arrow::BitUtil;
cpp/src/arrow/filesystem/s3_test_util.h:namespace bp = boost::process;
cpp/src/arrow/filesystem/s3fs_test.cc:namespace bp = boost::process;
cpp/src/arrow/gpu/cuda_arrow_ipc.cc:namespace flatbuf = org::apache::arrow::flatbuf;
cpp/src/arrow/flight/flight_benchmark.cc:namespace perf = arrow::flight::perf;
cpp/src/arrow/flight/test_util.cc:namespace bp = boost::process;
cpp/src/arrow/flight/test_util.cc:namespace fs = boost::filesystem;
cpp/src/arrow/flight/test_util.cc:  namespace fs = boost::filesystem;
cpp/src/arrow/flight/internal.h:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/server.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/serialization_internal.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/client.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/flight/client.cc:          namespace ge = GRPC_NAMESPACE_FOR_TLS_CREDENTIALS_OPTIONS;
cpp/src/arrow/flight/perf_server.cc:namespace perf = arrow::flight::perf;
cpp/src/arrow/flight/perf_server.cc:namespace proto = arrow::flight::protocol;
cpp/src/arrow/flight/flight_test.cc:namespace pb = arrow::flight::protocol;
cpp/src/arrow/adapters/orc/adapter.cc:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_util.h:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_test.cc:namespace liborc = orc;
cpp/src/arrow/adapters/orc/adapter_util.cc:namespace liborc = orc;
cpp/src/arrow/json/object_parser.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/test_common.h:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/chunker.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/parser.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/json/object_writer.cc:namespace rj = arrow::rapidjson;
cpp/src/arrow/testing/json_internal.h:namespace rj = arrow::rapidjson;
cpp/src/plasma/store.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/client.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/common.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/plasma.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/test/serialization_tests.cc:namespace fb = plasma::flatbuf;
cpp/src/plasma/protocol.cc:namespace fb = plasma::flatbuf;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to avoid these types of cleanups with large functional changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. :( Will make split changes next time.

Copy link
Member

@kiszk kiszk Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @emkornfield. To split these types of changes into another PR make us easier to review this PR at this time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kiszk , this part of code (Cross JNI data sharing) was moved to #10883 as a dependency of this PR, would you like to have a look? Thank a lot. I have added a separated style fix commit in that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you think we should have a individual PR rather than a commit to fix the style stuffs? I am okay to either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including the message with Invalid could aid users. Status's have a member for arbitrary details. it might pay to create a new extension of this class that can keep a reference to the exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 922d6dd143c12984daa788d9d737a75914e8a678.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the performance overhead of this. Could another approach be to use to record batches. One that contains the data, and one that contains all the reference pointers?

Or at the very least only one metadata entry and encode all the buffer references as some sort of flatbuffer, protobuf or json list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed commit 8ef07be7b42cc9a4e83326f6a785457d664678ee.

Here I chose to encode the references within json arrays. It might be possible to use extra arrow arrays/recordbatch to store the reference pointers but we still have to manage the reference pointers to the ref arrays/recordbatch themselves. Let's have a rework on this within implementing C data interface for Java in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @lidavidm @westonpace what is the current status of changes in cardinality for ScanTask? is this a use that should be supported long term?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScanTask is going away. However, from a quick glance, what we want here is a Scanner from a RecordBatchIterator. Use the existing Scanner::FromRecordBatchReader:

/// \brief Make a scanner from a record batch reader.
///
/// The resulting scanner can be scanned only once. This is intended
/// to support writing data from streaming sources or other sources
/// that can be iterated only once.
static std::shared_ptr<ScannerBuilder> FromRecordBatchReader(
std::shared_ptr<RecordBatchReader> reader);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 71cba25. Thanks for adding the facility.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there might be two changes in one here, that potentially could be split up and ease reviewing.

  1. Refactoring to allow passing native java buffers to C++ code. For this, another option that might be simpler/more reliable is to try to use the Arrow C ABI to translate the data across java and C++. Did you consider this?
  2. Allowing writes.

Is this right? would it be hard to make these changes separately.

@zhztheplayer
Copy link
Member Author

@emkornfield

Refactoring to allow passing native java buffers to C++ code. For this, another option that might be simpler/more reliable is to try to use the Arrow C ABI to translate the data across java and C++. Did you consider this?

Once we had a short discussion around this https://issues.apache.org/jira/browse/ARROW-7272?focusedCommentId=16983849&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983849

C data interface does sound to be better. Although I believe a Java implementation will require a design that probably tightly relies on JNI which makes the implementation slightly more complex than in other language's case. I think we can open a separate JIRA ticket for that to do the work in future as we don't yet have the implementation in Java. So far I have extracted the commit that refactored current code within flatbuffers and marked it to resolve the issue ARROW-7272 (if it's OK I may open an individual PR within the commit for 7272). What do you think?

@zhztheplayer
Copy link
Member Author

JNI CI failure is related to an existing issue https://issues.apache.org/jira/browse/ARROW-12838

@zhztheplayer zhztheplayer marked this pull request as draft May 25, 2021 01:43
@zhztheplayer zhztheplayer marked this pull request as ready for review May 27, 2021 02:32
@zhztheplayer zhztheplayer force-pushed the ARROW-11776 branch 2 times, most recently from 9b9319b to cc13ac2 Compare June 28, 2021 07:31
@zhztheplayer zhztheplayer requested a review from emkornfield June 28, 2021 08:32
@zhztheplayer
Copy link
Member Author

zhztheplayer commented Jun 28, 2021

Hi @emkornfield, do you have any other comments on the current code? Sorry for pinging you again as I would like to see if we can make this into 5.0.0.

To ease review, the changes was split into two parts:

  • Refactoring to allow passing native java buffers to C++ code:
    4a0ddbe (the first commit)
  • Allowing writes:
    Other commits. diff

Thanks!

@zhztheplayer
Copy link
Member Author

Part of this work around ARROW-7272 has been moved to another PR #10883.

@zhztheplayer zhztheplayer changed the title ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI WIP: ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JNI Aug 24, 2021
@github-actions
Copy link

@zhztheplayer
Copy link
Member Author

Marked title as WIP to let #10883 to be reviewed first.

@kszucs
Copy link
Member

kszucs commented Apr 21, 2022

@zhztheplayer shall we close this as stale?

pitrou pushed a commit that referenced this pull request Apr 26, 2022
…ectorSchemaRoot

Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API.

Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI.

This is also a dependency of ARROW-11776 (PR #10201).

Closes #10883 from zhztheplayer/ARROW-7272

Authored-by: Hongze Zhang <hongze.zhang@intel.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@zhztheplayer
Copy link
Member Author

@zhztheplayer shall we close this as stale?

I suggest to keep it open for a while, as its dependency #10108 has been merged. I'm going to update this PR before 9.0.0.

@pitrou
Copy link
Member

pitrou commented May 4, 2022

Also cc @lwhite1 for information and in case he's interested.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Sep 9, 2022

Sorry still not ready to rework this. Will reopen before I get started.

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ectorSchemaRoot

Added simple utility API to share data between C++ and Java codes. The methods are directly calling C Data Interface API.

Updated Java dataset codes to use the new API instead of passing buffer pointers over JNI.

This is also a dependency of ARROW-11776 (PR apache#10201).

Closes apache#10883 from zhztheplayer/ARROW-7272

Authored-by: Hongze Zhang <hongze.zhang@intel.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants