-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge #5719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b8d0f84 to
e91deca
Compare
|
My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc. |
e91deca to
ff2773b
Compare
I'm not sure I understand. I think arrow_jni should rely directly on the parquet shared library. the reason why there is adapter code for Orc is we depend on an external library. In the case of parquet, there is already code to read arrow data directly (cpp/src/parquet/arrow/) |
ff2773b to
077caf6
Compare
Hi, @emkornfield , yes, I agree with you, so I moved adapter here under jni/parquet, the reason of adding this adapter instead of calling parquet/reader parquet/writer directly from jni_wrapper is because we still need an instance to store connections and file handler, etc. |
077caf6 to
f8f0e56
Compare
Codecov Report
@@ Coverage Diff @@
## master #5719 +/- ##
=========================================
Coverage ? 89.4%
=========================================
Files ? 809
Lines ? 120245
Branches ? 0
=========================================
Hits ? 107502
Misses ? 12743
Partials ? 0
Continue to review full report at Codecov.
|
0ea8eec to
8fd4efb
Compare
|
@xuechendi could you clarify which PRs need to be rereviewed? would you also mind creating sub-tasks for this work, so each PR can correspond to one JIRA item (it makes change log tracking easier). Thanks, for your patience going through the reviews. |
This PR contains all remain works for adding parquet Reader/Writer Java APIs, and this PR need to do a rebase to #5820 may requires codes modification, so can I ping you once my rebase work done. I am occupied by my other works, will start to work on this by end of this week. And I was planning to split this PR, but I figured this PR is to add jni and java, I can't find a proper angle to split it apart. |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Java classes submitted in this commit are used by jni_wrapper to construct recordBatchBuilder. Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Includes two parts common codes such as ArrowRecordBatchBuilder and AdaptorReferenceManager will be put in common Parquet related codes will be put in Parquet Signed-off-by: Chendi Xue <chendi.xue@intel.com>
8fd4efb to
8b8b57a
Compare
|
hi, @emkornfield , I just rebased this PR to commit #5820, and passed CI checks.
|
| /// \param[in] column_indices indexes of columns expected to be read | ||
| /// \param[in] start_pos start position of row_groups expected to be read | ||
| /// \param[in] end_pos end position of row_groups expected to be read | ||
| Status InitRecordBatchReader(const std::vector<int>& column_indices, int64_t start_pos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the use-case for this method? Why isn't working with row group indices sufficient?
| /// | ||
| /// \param[in] column_indices indexes of columns expected to be read | ||
| /// \param[in] row_group_indices indexes of row_groups expected to be read | ||
| Status InitRecordBatchReader(const std::vector<int>& column_indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its more common pattern to hide these implementations as private, and have Create methods that do the Open and Initialization together. is there a reason they need to bexposed?
| /// \param[in] pool a MemoryPool to use for buffer allocations | ||
| /// \param[out] reader the returned reader object | ||
| /// \return Status | ||
| static Status Open(std::shared_ptr<RandomAccessFile>& file, MemoryPool* pool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to use Result<std::unique_ptr<...>> instead of the out parameter unless there is a good reason not to.
| /// \param[in] column_indices indexes of columns expected to be read | ||
| /// \param[in] row_group_indices indexes of row_groups expected to be read | ||
| Status InitRecordBatchReader(const std::vector<int>& column_indices, | ||
| const std::vector<int>& row_group_indices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there something in the dataset api that provides equivilant functionality to this?
|
Left some high level comments. Will try to take a closer look within a week or so. Also #5973 might allow for simplification of the java layer. |
|
Is there interest in rebasing and addressing the comments? |
|
Closing due to inactivity, I think the write path is still of interest but with integration of datasets the read path is probably less relevant? |
|
Hi, I'd be excited to see write support for Parquet with a Java wrapper. @wesm, @emkornfield Is there a chance to have this PR merged? The comments were mostly about visibility of methods and code style but no blockers from what I can see. Or was the review not completed back then? |
|
@xuechendi can probably comment, the work wrapping the Datasets API seems probably more promising? From a maintainability standpoint |
|
@dgloeckner I agree with Wes, I think the wrapping of dataset for reading will hopefully get merged first. For writing, we might be able to resurrect some of this, code. I would need to reread, but I'm not sure I actually code reviewed this in depth (mostly high level design/style issues). |
hi, @emkornfield and @wesm , sorry for my late response, I thought this PR was abandoned long ago, yes, dataset read is more promising and making sense. Thanks. |
We previously work on JIRA Arrow-6720, PR#5522 to add a Parquet Reader/Writer Java API to Arrow
According to previous comments, I splitted that PR into three small ones.
For this PR, I added a parquet adapter under arrow/adapters/parquet to wrap current parquet and provides adapter for jni wrapper(will submit another PR), this PR also used filesystemfactory submitted in PR#5717 to open different types of filesystem.
Signed-off-by: Chendi Xue chendi.xue@intel.com