Skip to content

Conversation

@JkSelf
Copy link

@JkSelf JkSelf commented Sep 16, 2022

This PR is aim to support parquet write from ArrowReader to file.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@JkSelf JkSelf changed the title Support parquet write from scanner to file [ARROW-17752] [C++][JAVA][Parquet] Support parquet write from scanner to file Sep 16, 2022
@zhztheplayer
Copy link
Member

@JkSelf Can we just use the existing ticket ARROW-11776?

Comment on lines 122 to 133
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to conduct this kind of encoding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some unit tests.

You can refer to the previous PR https://github.com/apache/arrow/pull/10201/files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to create this class?

IteratorImpl seems to already handle everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can rename this as CRecordBatchIterator or something to make things more clearer.

Comment on lines 256 to 257
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some better solution to distinguish on the 2 variables than just using iter and itr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrowArray may need to be eventually closed or memory may leak

@lidavidm
Copy link
Member

Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators?

@lidavidm
Copy link
Member

In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader

@JkSelf JkSelf changed the title [ARROW-17752] [C++][JAVA][Parquet] Support parquet write from scanner to file [ARROW-11776] [C++][JAVA][Parquet] Support parquet write from scanner to file Sep 19, 2022
@JkSelf
Copy link
Author

JkSelf commented Sep 23, 2022

@zhztheplayer @lidavidm
Sorry for the delay response. I have resolved the comments. Please help to review again. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question - why are we manually bridging data to C++, why do we have CRecordBatchIterator at all, when we could use ArrowArrayStream instead? This should just take an ArrowArrayStream as the data source

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already used the ArrowArrayStream sharing data to native. Here the CRecordBatchIterator is to iterate the ArrowArrayStream object. I have changed the CRecordBatchIterator name to CArrowArrayStreamIterator. Sorry for the wrong naming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I mean is that you do not need the iterator at all. Implement the ArrowReader API for Scanner or whatever the Java-side data source is, then export that once and import it once on the C++ side. The JNI side should not need to know about any new Java-side classes or methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed explanation. Already removed the iterator and use the ArrowArrayStream as the data source .

@lidavidm
Copy link
Member

CC @lwhite1 @davisusanibar

@lidavidm lidavidm changed the title [ARROW-11776] [C++][JAVA][Parquet] Support parquet write from scanner to file ARROW-11776: [C++][Java] Support parquet write from scanner to file Sep 30, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

Comment on lines 54 to 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. We should implement ArrowReader over Scanner, then write the entire dataset in one go. And then this method can be used to write any ArrowReader, not just Scanners.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ArrowScannerReader over Scanner.

@JkSelf JkSelf changed the title ARROW-11776: [C++][Java] Support parquet write from scanner to file ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file Oct 19, 2022
@JkSelf JkSelf force-pushed the parquet_write_from_datasets branch from 85ffbd5 to 7578bca Compare October 20, 2022 13:48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue if we potentially double-close these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already closed, the close method will directly return and no further operator.

@lidavidm
Copy link
Member

There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990


[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:28: Wrong order for 'java.io.IOException' import. [ImportOrder]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:31: Missing a Javadoc comment. [JavadocType]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:39:3: Missing a Javadoc comment. [JavadocMethod]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:61: 'if' construct must use '{}'s. [NeedBraces]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:86:5: WhitespaceAround: 'try' is not followed by whitespace. Empty blocks may only be represented as {} when not part of a multi-block statement (4.1.3) [WhitespaceAround]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java:60:50: Parameter name 'stream_address' must match pattern '^[a-z][a-zA-Z0-9]*$'. [ParameterName]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:20:8: Unused import: java.io.ByteArrayOutputStream. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:22:8: Unused import: java.io.IOException. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:23:8: Unused import: java.nio.channels.Channels. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:38:8: Unused import: org.apache.arrow.flatbuf.RecordBatch. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:44:8: Unused import: org.apache.arrow.vector.ipc.WriteChannel. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:46:8: Unused import: org.apache.arrow.vector.ipc.message.MessageSerializer. [UnusedImports]

@JkSelf
Copy link
Author

JkSelf commented Oct 26, 2022

@lidavidm
Do you have any further comment?

@lidavidm
Copy link
Member

Can you rebase to see if the CI issue is fixed?

@JkSelf JkSelf force-pushed the parquet_write_from_datasets branch from 7988012 to 5f86c71 Compare October 27, 2022 02:04
@JkSelf
Copy link
Author

JkSelf commented Oct 27, 2022

Can you rebase to see if the CI issue is fixed?

Rebased.

@lidavidm
Copy link
Member

There's still a lint error here

Warning:  src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:[43,3] (javadoc) JavadocMethod: Missing a Javadoc comment.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@lidavidm lidavidm merged commit dddf38f into apache:master Oct 28, 2022
@ursabot
Copy link

ursabot commented Oct 29, 2022

Benchmark runs are scheduled for baseline = a2881a1 and contender = dddf38f. dddf38f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.11% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] dddf38f5 ec2-t3-xlarge-us-east-2
[Failed] dddf38f5 test-mac-arm
[Finished] dddf38f5 ursa-i9-9960x
[Finished] dddf38f5 ursa-thinkcentre-m75q
[Finished] a2881a12 ec2-t3-xlarge-us-east-2
[Failed] a2881a12 test-mac-arm
[Finished] a2881a12 ursa-i9-9960x
[Finished] a2881a12 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 29, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…le (apache#14151)

This PR is aim to support parquet write from ArrowReader to file.

Authored-by: Jia Ke <ke.a.jia@intel.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants