ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file #14151

JkSelf · 2022-09-16T03:15:06Z

This PR is aim to support parquet write from ArrowReader to file.

github-actions · 2022-09-16T03:25:10Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

zhztheplayer · 2022-09-16T07:51:36Z

@JkSelf Can we just use the existing ticket ARROW-11776?

zhztheplayer · 2022-09-16T07:58:00Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/NativeScannerAdaptorImpl.java

Is it necessary to conduct this kind of encoding?

zhztheplayer · 2022-09-16T07:59:26Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java

Let's add some unit tests.

You can refer to the previous PR https://github.com/apache/arrow/pull/10201/files

zhztheplayer · 2022-09-16T08:13:12Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/NativeScannerAdaptorImpl.java

Is it needed to create this class?

IteratorImpl seems to already handle everything.

zhztheplayer · 2022-09-16T08:23:17Z

java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeRecordBatchIterator.java

May be we can rename this as CRecordBatchIterator or something to make things more clearer.

zhztheplayer · 2022-09-16T08:26:04Z

java/dataset/src/main/cpp/jni_wrapper.cc

There might be some better solution to distinguish on the 2 variables than just using iter and itr?

zhztheplayer · 2022-09-16T08:29:23Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/NativeScannerAdaptorImpl.java

arrowArray may need to be eventually closed or memory may leak

lidavidm · 2022-09-16T11:06:02Z

Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators?

lidavidm · 2022-09-16T11:07:07Z

In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader

JkSelf · 2022-09-23T06:07:10Z

@zhztheplayer @lidavidm
Sorry for the delay response. I have resolved the comments. Please help to review again. Thanks.

lidavidm · 2022-09-30T15:14:17Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java

Same question - why are we manually bridging data to C++, why do we have CRecordBatchIterator at all, when we could use ArrowArrayStream instead? This should just take an ArrowArrayStream as the data source

We have already used the ArrowArrayStream sharing data to native. Here the CRecordBatchIterator is to iterate the ArrowArrayStream object. I have changed the CRecordBatchIterator name to CArrowArrayStreamIterator. Sorry for the wrong naming.

No, what I mean is that you do not need the iterator at all. Implement the ArrowReader API for Scanner or whatever the Java-side data source is, then export that once and import it once on the C++ side. The JNI side should not need to know about any new Java-side classes or methods.

Thanks for your detailed explanation. Already removed the iterator and use the ArrowArrayStream as the data source .

lidavidm · 2022-09-30T15:14:36Z

CC @lwhite1 @davisusanibar

github-actions · 2022-09-30T15:15:27Z

https://issues.apache.org/jira/browse/ARROW-11776

github-actions · 2022-09-30T15:15:29Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java

lidavidm · 2022-10-14T15:35:20Z

java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java

I don't think this is right. We should implement ArrowReader over Scanner, then write the entire dataset in one go. And then this method can be used to write any ArrowReader, not just Scanners.

Added ArrowScannerReader over Scanner.

java/dataset/src/main/cpp/jni_wrapper.cc

java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java

java/dataset/src/main/cpp/jni_wrapper.cc

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java

java/dataset/src/main/cpp/jni_wrapper.cc

java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java

lidavidm · 2022-10-20T17:22:20Z

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java

Is there an issue if we potentially double-close these?

If already closed, the close method will directly return and no further operator.

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java

lidavidm · 2022-10-24T12:36:55Z

There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990


[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:28: Wrong order for 'java.io.IOException' import. [ImportOrder]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:31: Missing a Javadoc comment. [JavadocType]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:39:3: Missing a Javadoc comment. [JavadocMethod]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:61: 'if' construct must use '{}'s. [NeedBraces]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:86:5: WhitespaceAround: 'try' is not followed by whitespace. Empty blocks may only be represented as {} when not part of a multi-block statement (4.1.3) [WhitespaceAround]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java:60:50: Parameter name 'stream_address' must match pattern '^[a-z][a-zA-Z0-9]*$'. [ParameterName]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:20:8: Unused import: java.io.ByteArrayOutputStream. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:22:8: Unused import: java.io.IOException. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:23:8: Unused import: java.nio.channels.Channels. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:38:8: Unused import: org.apache.arrow.flatbuf.RecordBatch. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:44:8: Unused import: org.apache.arrow.vector.ipc.WriteChannel. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:46:8: Unused import: org.apache.arrow.vector.ipc.message.MessageSerializer. [UnusedImports]

JkSelf · 2022-10-26T05:11:41Z

@lidavidm
Do you have any further comment?

lidavidm · 2022-10-26T22:00:46Z

Can you rebase to see if the CI issue is fixed?

JkSelf · 2022-10-27T07:25:43Z

Can you rebase to see if the CI issue is fixed?

Rebased.

lidavidm · 2022-10-27T16:10:40Z

There's still a lint error here

Warning:  src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:[43,3] (javadoc) JavadocMethod: Missing a Javadoc comment.

lidavidm

Thank you!

ursabot · 2022-10-29T20:51:50Z

Benchmark runs are scheduled for baseline = a2881a1 and contender = dddf38f. dddf38f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.11% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] dddf38f5 ec2-t3-xlarge-us-east-2
[Failed] dddf38f5 test-mac-arm
[Finished] dddf38f5 ursa-i9-9960x
[Finished] dddf38f5 ursa-thinkcentre-m75q
[Finished] a2881a12 ec2-t3-xlarge-us-east-2
[Failed] a2881a12 test-mac-arm
[Finished] a2881a12 ursa-i9-9960x
[Finished] a2881a12 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-29T20:52:03Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…le (apache#14151) This PR is aim to support parquet write from ArrowReader to file. Authored-by: Jia Ke <ke.a.jia@intel.com> Signed-off-by: David Li <li.davidm96@gmail.com>

github-actions bot added the Component: Java label Sep 16, 2022

JkSelf changed the title ~~Support parquet write from scanner to file~~ [ARROW-17752] [C++][JAVA][Parquet] Support parquet write from scanner to file Sep 16, 2022

zhztheplayer reviewed Sep 16, 2022

View reviewed changes

JkSelf changed the title ~~[ARROW-17752] [C++][JAVA][Parquet] Support parquet write from scanner to file~~ [ARROW-11776] [C++][JAVA][Parquet] Support parquet write from scanner to file Sep 19, 2022

lidavidm reviewed Sep 30, 2022

View reviewed changes

lidavidm changed the title ~~[ARROW-11776] [C++][JAVA][Parquet] Support parquet write from scanner to file~~ ARROW-11776: [C++][Java] Support parquet write from scanner to file Sep 30, 2022

lidavidm requested changes Oct 14, 2022

View reviewed changes

JkSelf changed the title ~~ARROW-11776: [C++][Java] Support parquet write from scanner to file~~ ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file Oct 19, 2022

lidavidm requested changes Oct 19, 2022

View reviewed changes

lidavidm requested changes Oct 20, 2022

View reviewed changes

JkSelf force-pushed the parquet_write_from_datasets branch from 85ffbd5 to 7578bca Compare October 20, 2022 13:48

lidavidm reviewed Oct 20, 2022

View reviewed changes

JkSelf force-pushed the parquet_write_from_datasets branch from 7988012 to 5f86c71 Compare October 27, 2022 02:04

Jia Ke added 13 commits October 27, 2022 09:34

support parquet write from scanner to file

58ebf06

add the TestDatasetFileWriter and code refine

9abbc15

use the c Data interface to pass the schema and code refine

e526daa

use the arrow record batch reader

07345a2

use ArrowStream

f824d0e

refine CRecordBatchIterator

759c9c2

remove the CArrowArrayStreamIterator

88893e0

Create new ArrowScannerReader Over Scanner

092baa2

resolve the comments

434ad8a

resolve comments

4619774

doc update

951462f

resolve comments

fbc99b5

code style refine

5f86c71

Add the doc for ArrowScannerReader construct method

50abe7f

lidavidm approved these changes Oct 28, 2022

View reviewed changes

lidavidm merged commit dddf38f into apache:master Oct 28, 2022

This was referenced Oct 29, 2022

[Java][Dataset] Support writing to files within dataset scanner via JNI #27628

Closed

Support JNI-based parquet write to write datasets into files #32984

Closed

igor-suhorukov mentioned this pull request Feb 14, 2023

How to save org.apache.arrow.vector.VectorSchemaRoot into parquet file in Java API #13759

Open

ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file #14151

ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file #14151

Uh oh!

Conversation

JkSelf commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 16, 2022

Uh oh!

zhztheplayer commented Sep 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Sep 16, 2022

Uh oh!

lidavidm commented Sep 16, 2022

Uh oh!

JkSelf commented Sep 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Sep 30, 2022

Uh oh!

github-actions bot commented Sep 30, 2022

Uh oh!

github-actions bot commented Sep 30, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lidavidm commented Oct 24, 2022

Uh oh!

JkSelf commented Oct 26, 2022

Uh oh!

lidavidm commented Oct 26, 2022

Uh oh!

JkSelf commented Oct 27, 2022

Uh oh!

lidavidm commented Oct 27, 2022

JkSelf commented Sep 16, 2022 •

edited

Loading