ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge #5719

xuechendi · 2019-10-23T15:00:22Z

We previously work on JIRA Arrow-6720, PR#5522 to add a Parquet Reader/Writer Java API to Arrow
According to previous comments, I splitted that PR into three small ones.

For this PR, I added a parquet adapter under arrow/adapters/parquet to wrap current parquet and provides adapter for jni wrapper(will submit another PR), this PR also used filesystemfactory submitted in PR#5717 to open different types of filesystem.

Signed-off-by: Chendi Xue chendi.xue@intel.com

github-actions · 2019-10-23T15:07:42Z

https://issues.apache.org/jira/browse/ARROW-6720

xuechendi · 2019-10-23T15:26:05Z

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.
But I noticed that if I put parquet adapter codes under arrow folder, there will be an interdependent issue between parquet_shared and arrow_shared (parquet_shared also need to dynamically link to arrow_shared, and arrow/adapters/parquet depends on parquet_shared).
So I will move current codes under arrow/adapters/parquet to jni/parquet as PR #5522 did.

emkornfield · 2019-10-24T05:26:09Z

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.

I'm not sure I understand. I think arrow_jni should rely directly on the parquet shared library. the reason why there is adapter code for Orc is we depend on an external library. In the case of parquet, there is already code to read arrow data directly (cpp/src/parquet/arrow/)

xuechendi · 2019-10-24T08:09:51Z

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.

I'm not sure I understand. I think arrow_jni should rely directly on the parquet shared library. the reason why there is adapter code for Orc is we depend on an external library. In the case of parquet, there is already code to read arrow data directly (cpp/src/parquet/arrow/)

Hi, @emkornfield , yes, I agree with you, so I moved adapter here under jni/parquet, the reason of adding this adapter instead of calling parquet/reader parquet/writer directly from jni_wrapper is because we still need an instance to store connections and file handler, etc.

codecov-io · 2019-10-24T12:33:09Z

Codecov Report

❗ No coverage uploaded for pull request base (master@ffdf35d). Click here to learn what that means.
The diff coverage is 55.5%.

@@            Coverage Diff            @@
##             master    #5719   +/-   ##
=========================================
  Coverage          ?    89.4%           
=========================================
  Files             ?      809           
  Lines             ?   120245           
  Branches          ?        0           
=========================================
  Hits              ?   107502           
  Misses            ?    12743           
  Partials          ?        0

Impacted Files	Coverage Δ
cpp/src/arrow/filesystem/filesystem_utils.h	`100% <100%> (ø)`
cpp/src/arrow/filesystem/hdfs.h	`100% <100%> (ø)`
cpp/src/arrow/filesystem/hdfs.cc	`22.06% <22.06%> (ø)`
cpp/src/arrow/filesystem/hdfs_test.cc	`48.57% <48.57%> (ø)`
cpp/src/arrow/filesystem/filesystem_utils.cc	`72.97% <72.97%> (ø)`
cpp/src/arrow/filesystem/filesystem_utils_test.cc	`98.3% <98.3%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdf35d...8b8b57a. Read the comment docs.

emkornfield · 2019-11-27T08:15:05Z

@xuechendi could you clarify which PRs need to be rereviewed? would you also mind creating sub-tasks for this work, so each PR can correspond to one JIRA item (it makes change log tracking easier).

Thanks, for your patience going through the reviews.

xuechendi · 2019-11-27T08:22:59Z

@xuechendi could you clarify which PRs need to be rereviewed? would you also mind creating sub-tasks for this work, so each PR can correspond to one JIRA item (it makes change log tracking easier).

Thanks, for your patience going through the reviews.

This PR contains all remain works for adding parquet Reader/Writer Java APIs, and this PR need to do a rebase to #5820 may requires codes modification, so can I ping you once my rebase work done. I am occupied by my other works, will start to work on this by end of this week. And I was planning to split this PR, but I figured this PR is to add jni and java, I can't find a proper angle to split it apart.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Java classes submitted in this commit are used by jni_wrapper to construct recordBatchBuilder. Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Includes two parts common codes such as ArrowRecordBatchBuilder and AdaptorReferenceManager will be put in common Parquet related codes will be put in Parquet Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2019-12-04T07:52:01Z

hi, @emkornfield , I just rebased this PR to commit #5820, and passed CI checks.
This PR is to add parquet reader/writer java API to Arrow.

first commit is to add a jni_wrapper for parquet reader and writer, adapter is a small wrapper of current parquet codes, so jni_wrapper file won't looks too heavy.
second commit is to some java objects will be used by jni_wrapper to pass RecordBatch buffers_ref to java.
the third commit is the parquet java apis, unittest is also included.

emkornfield · 2019-12-12T08:36:56Z

cpp/src/jni/parquet/adapter.h

+  /// \param[in] column_indices indexes of columns expected to be read
+  /// \param[in] start_pos start position of row_groups expected to be read
+  /// \param[in] end_pos end position of row_groups expected to be read
+  Status InitRecordBatchReader(const std::vector<int>& column_indices, int64_t start_pos,


I'm not sure I understand the use-case for this method? Why isn't working with row group indices sufficient?

emkornfield · 2019-12-12T08:38:41Z

cpp/src/jni/parquet/adapter.h

+  ///
+  /// \param[in] column_indices indexes of columns expected to be read
+  /// \param[in] row_group_indices indexes of row_groups expected to be read
+  Status InitRecordBatchReader(const std::vector<int>& column_indices,


its more common pattern to hide these implementations as private, and have Create methods that do the Open and Initialization together. is there a reason they need to bexposed?

emkornfield · 2019-12-12T08:39:36Z

cpp/src/jni/parquet/adapter.h

+  /// \param[in] pool a MemoryPool to use for buffer allocations
+  /// \param[out] reader the returned reader object
+  /// \return Status
+  static Status Open(std::shared_ptr<RandomAccessFile>& file, MemoryPool* pool,


it would be better to use Result<std::unique_ptr<...>> instead of the out parameter unless there is a good reason not to.

emkornfield · 2019-12-12T08:40:13Z

cpp/src/jni/parquet/adapter.h

+  /// \param[in] column_indices indexes of columns expected to be read
+  /// \param[in] row_group_indices indexes of row_groups expected to be read
+  Status InitRecordBatchReader(const std::vector<int>& column_indices,
+                               const std::vector<int>& row_group_indices);


is there something in the dataset api that provides equivilant functionality to this?

emkornfield · 2019-12-12T08:44:15Z

Left some high level comments. Will try to take a closer look within a week or so. Also #5973 might allow for simplification of the java layer.

wesm · 2020-04-01T22:19:11Z

Is there interest in rebasing and addressing the comments?

emkornfield · 2020-05-15T04:44:20Z

Closing due to inactivity, I think the write path is still of interest but with integration of datasets the read path is probably less relevant?

dgloeckner · 2020-07-12T17:47:39Z

Hi,

I'd be excited to see write support for Parquet with a Java wrapper. @wesm, @emkornfield Is there a chance to have this PR merged? The comments were mostly about visibility of methods and code style but no blockers from what I can see. Or was the review not completed back then?

wesm · 2020-07-12T22:12:39Z

@xuechendi can probably comment, the work wrapping the Datasets API seems probably more promising? From a maintainability standpoint

emkornfield · 2020-07-18T21:14:13Z

@dgloeckner I agree with Wes, I think the wrapping of dataset for reading will hopefully get merged first. For writing, we might be able to resurrect some of this, code. I would need to reread, but I'm not sure I actually code reviewed this in depth (mostly high level design/style issues).

xuechendi · 2020-07-20T00:52:46Z

@dgloeckner I agree with Wes, I think the wrapping of dataset for reading will hopefully get merged first. For writing, we might be able to resurrect some of this, code. I would need to reread, but I'm not sure I actually code reviewed this in depth (mostly high level design/style issues).

hi, @emkornfield and @wesm , sorry for my late response, I thought this PR was abandoned long ago, yes, dataset read is more promising and making sense. Thanks.

xuechendi force-pushed the wip_upstream_parquet_cpp branch 2 times, most recently from b8d0f84 to e91deca Compare October 23, 2019 15:04

xuechendi mentioned this pull request Oct 23, 2019

ARROW-6720:[JAVA][C++]Support Parquet Read and Write in Java #5522

Closed

fsaintjacques changed the title ~~ARROW-6720:[C++]Add Parquet Reader/Writer Adapter for JNI Bridge~~ ARROW-6720:[C++] Add Parquet Reader/Writer Adapter for JNI Bridge Oct 23, 2019

xuechendi force-pushed the wip_upstream_parquet_cpp branch from e91deca to ff2773b Compare October 24, 2019 01:06

xuechendi force-pushed the wip_upstream_parquet_cpp branch from ff2773b to 077caf6 Compare October 24, 2019 07:31

xuechendi force-pushed the wip_upstream_parquet_cpp branch from 077caf6 to f8f0e56 Compare October 24, 2019 08:12

fsaintjacques changed the title ~~ARROW-6720:[C++] Add Parquet Reader/Writer Adapter for JNI Bridge~~ ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge Oct 24, 2019

xuechendi mentioned this pull request Oct 24, 2019

ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow::fs #5717

Closed

xuechendi force-pushed the wip_upstream_parquet_cpp branch 4 times, most recently from 0ea8eec to 8fd4efb Compare October 30, 2019 12:37

xuechendi added 3 commits December 4, 2019 14:41

[C++]Add a parquet reader/writer adapter for jni bridge

be486ac

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[JAVA]java classes created by jni_wrapper

dd71748

Java classes submitted in this commit are used by jni_wrapper to construct recordBatchBuilder. Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[JAVA]Parquet Reader Writer Java Codes

8b8b57a

Includes two parts common codes such as ArrowRecordBatchBuilder and AdaptorReferenceManager will be put in common Parquet related codes will be put in Parquet Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the wip_upstream_parquet_cpp branch from 8fd4efb to 8b8b57a Compare December 4, 2019 07:04

emkornfield reviewed Dec 12, 2019

View reviewed changes

wesm force-pushed the master branch from 5fe5b88 to aa55967 Compare April 19, 2020 22:47

kszucs force-pushed the master branch from 1b71ca7 to 5093b80 Compare April 20, 2020 19:21

emkornfield closed this May 15, 2020

asfimport mentioned this pull request Nov 26, 2024

[JAVA][C++]Support Parquet Read and Write in Java apache/arrow-java#279

Open

ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge #5719

ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge #5719

Uh oh!

Conversation

xuechendi commented Oct 23, 2019

Uh oh!

github-actions bot commented Oct 23, 2019

Uh oh!

xuechendi commented Oct 23, 2019

Uh oh!

emkornfield commented Oct 24, 2019

Uh oh!

xuechendi commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

emkornfield commented Nov 27, 2019

Uh oh!

xuechendi commented Nov 27, 2019

Uh oh!

xuechendi commented Dec 4, 2019

Uh oh!

emkornfield Dec 12, 2019

Choose a reason for hiding this comment

Uh oh!

emkornfield Dec 12, 2019

Choose a reason for hiding this comment

Uh oh!

emkornfield Dec 12, 2019

Choose a reason for hiding this comment

Uh oh!

emkornfield Dec 12, 2019

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Dec 12, 2019

Uh oh!

wesm commented Apr 1, 2020

Uh oh!

emkornfield commented May 15, 2020

Uh oh!

dgloeckner commented Jul 12, 2020

Uh oh!

wesm commented Jul 12, 2020

Uh oh!

emkornfield commented Jul 18, 2020

Uh oh!

xuechendi commented Jul 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xuechendi commented Oct 24, 2019 •

edited

Loading

codecov-io commented Oct 24, 2019 •

edited

Loading