-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11776: [C++][Java] Support parquet write from ArrowReader to file #14151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
@JkSelf Can we just use the existing ticket ARROW-11776? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to conduct this kind of encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some unit tests.
You can refer to the previous PR https://github.com/apache/arrow/pull/10201/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to create this class?
IteratorImpl seems to already handle everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can rename this as CRecordBatchIterator or something to make things more clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some better solution to distinguish on the 2 variables than just using iter and itr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrowArray may need to be eventually closed or memory may leak
|
Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators? |
|
In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader |
|
@zhztheplayer @lidavidm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question - why are we manually bridging data to C++, why do we have CRecordBatchIterator at all, when we could use ArrowArrayStream instead? This should just take an ArrowArrayStream as the data source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already used the ArrowArrayStream sharing data to native. Here the CRecordBatchIterator is to iterate the ArrowArrayStream object. I have changed the CRecordBatchIterator name to CArrowArrayStreamIterator. Sorry for the wrong naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I mean is that you do not need the iterator at all. Implement the ArrowReader API for Scanner or whatever the Java-side data source is, then export that once and import it once on the C++ side. The JNI side should not need to know about any new Java-side classes or methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your detailed explanation. Already removed the iterator and use the ArrowArrayStream as the data source .
|
|
java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right. We should implement ArrowReader over Scanner, then write the entire dataset in one go. And then this method can be used to write any ArrowReader, not just Scanners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ArrowScannerReader over Scanner.
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/file/DatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
85ffbd5 to
7578bca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue if we potentially double-close these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If already closed, the close method will directly return and no further operator.
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java
Outdated
Show resolved
Hide resolved
|
There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990 |
|
@lidavidm |
|
Can you rebase to see if the CI issue is fixed? |
7988012 to
5f86c71
Compare
Rebased. |
|
There's still a lint error here |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
Benchmark runs are scheduled for baseline = a2881a1 and contender = dddf38f. dddf38f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…le (apache#14151) This PR is aim to support parquet write from ArrowReader to file. Authored-by: Jia Ke <ke.a.jia@intel.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This PR is aim to support parquet write from ArrowReader to file.