Skip to content

Conversation

@xuechendi
Copy link

We previously work on JIRA Arrow-6720, PR#5522 to add a Parquet Reader/Writer Java API to Arrow
According to previous comments, I splitted that PR into three small ones.

For this PR, I added a parquet adapter under arrow/adapters/parquet to wrap current parquet and provides adapter for jni wrapper(will submit another PR), this PR also used filesystemfactory submitted in PR#5717 to open different types of filesystem.

Signed-off-by: Chendi Xue chendi.xue@intel.com

@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch 2 times, most recently from b8d0f84 to e91deca Compare October 23, 2019 15:04
@github-actions
Copy link

@xuechendi
Copy link
Author

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.
But I noticed that if I put parquet adapter codes under arrow folder, there will be an interdependent issue between parquet_shared and arrow_shared (parquet_shared also need to dynamically link to arrow_shared, and arrow/adapters/parquet depends on parquet_shared).
So I will move current codes under arrow/adapters/parquet to jni/parquet as PR #5522 did.

@fsaintjacques fsaintjacques changed the title ARROW-6720:[C++]Add Parquet Reader/Writer Adapter for JNI Bridge ARROW-6720:[C++] Add Parquet Reader/Writer Adapter for JNI Bridge Oct 23, 2019
@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch from e91deca to ff2773b Compare October 24, 2019 01:06
@emkornfield
Copy link
Contributor

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.

I'm not sure I understand. I think arrow_jni should rely directly on the parquet shared library. the reason why there is adapter code for Orc is we depend on an external library. In the case of parquet, there is already code to read arrow data directly (cpp/src/parquet/arrow/)

@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch from ff2773b to 077caf6 Compare October 24, 2019 07:31
@xuechendi
Copy link
Author

xuechendi commented Oct 24, 2019

My original idea is to follow how ORC jni did, only put jni_wrapper under jni/orc, and put an adapter codes under arrow/adapters/orc.

I'm not sure I understand. I think arrow_jni should rely directly on the parquet shared library. the reason why there is adapter code for Orc is we depend on an external library. In the case of parquet, there is already code to read arrow data directly (cpp/src/parquet/arrow/)

Hi, @emkornfield , yes, I agree with you, so I moved adapter here under jni/parquet, the reason of adding this adapter instead of calling parquet/reader parquet/writer directly from jni_wrapper is because we still need an instance to store connections and file handler, etc.

@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch from 077caf6 to f8f0e56 Compare October 24, 2019 08:12
@codecov-io
Copy link

codecov-io commented Oct 24, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@ffdf35d). Click here to learn what that means.
The diff coverage is 55.5%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5719   +/-   ##
=========================================
  Coverage          ?    89.4%           
=========================================
  Files             ?      809           
  Lines             ?   120245           
  Branches          ?        0           
=========================================
  Hits              ?   107502           
  Misses            ?    12743           
  Partials          ?        0
Impacted Files Coverage Δ
cpp/src/arrow/filesystem/filesystem_utils.h 100% <100%> (ø)
cpp/src/arrow/filesystem/hdfs.h 100% <100%> (ø)
cpp/src/arrow/filesystem/hdfs.cc 22.06% <22.06%> (ø)
cpp/src/arrow/filesystem/hdfs_test.cc 48.57% <48.57%> (ø)
cpp/src/arrow/filesystem/filesystem_utils.cc 72.97% <72.97%> (ø)
cpp/src/arrow/filesystem/filesystem_utils_test.cc 98.3% <98.3%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdf35d...8b8b57a. Read the comment docs.

@fsaintjacques fsaintjacques changed the title ARROW-6720:[C++] Add Parquet Reader/Writer Adapter for JNI Bridge ARROW-6720: [C++] Add Parquet Reader/Writer Adapter for JNI Bridge Oct 24, 2019
@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch 4 times, most recently from 0ea8eec to 8fd4efb Compare October 30, 2019 12:37
@emkornfield
Copy link
Contributor

@xuechendi could you clarify which PRs need to be rereviewed? would you also mind creating sub-tasks for this work, so each PR can correspond to one JIRA item (it makes change log tracking easier).

Thanks, for your patience going through the reviews.

@xuechendi
Copy link
Author

@xuechendi could you clarify which PRs need to be rereviewed? would you also mind creating sub-tasks for this work, so each PR can correspond to one JIRA item (it makes change log tracking easier).

Thanks, for your patience going through the reviews.

This PR contains all remain works for adding parquet Reader/Writer Java APIs, and this PR need to do a rebase to #5820 may requires codes modification, so can I ping you once my rebase work done. I am occupied by my other works, will start to work on this by end of this week. And I was planning to split this PR, but I figured this PR is to add jni and java, I can't find a proper angle to split it apart.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Java classes submitted in this commit are used by jni_wrapper to construct recordBatchBuilder.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Includes two parts
common codes such as ArrowRecordBatchBuilder and AdaptorReferenceManager will be put in common
Parquet related codes will be put in Parquet

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the wip_upstream_parquet_cpp branch from 8fd4efb to 8b8b57a Compare December 4, 2019 07:04
@xuechendi
Copy link
Author

hi, @emkornfield , I just rebased this PR to commit #5820, and passed CI checks.
This PR is to add parquet reader/writer java API to Arrow.

  1. first commit is to add a jni_wrapper for parquet reader and writer, adapter is a small wrapper of current parquet codes, so jni_wrapper file won't looks too heavy.
  2. second commit is to some java objects will be used by jni_wrapper to pass RecordBatch buffers_ref to java.
  3. the third commit is the parquet java apis, unittest is also included.

/// \param[in] column_indices indexes of columns expected to be read
/// \param[in] start_pos start position of row_groups expected to be read
/// \param[in] end_pos end position of row_groups expected to be read
Status InitRecordBatchReader(const std::vector<int>& column_indices, int64_t start_pos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the use-case for this method? Why isn't working with row group indices sufficient?

///
/// \param[in] column_indices indexes of columns expected to be read
/// \param[in] row_group_indices indexes of row_groups expected to be read
Status InitRecordBatchReader(const std::vector<int>& column_indices,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its more common pattern to hide these implementations as private, and have Create methods that do the Open and Initialization together. is there a reason they need to bexposed?

/// \param[in] pool a MemoryPool to use for buffer allocations
/// \param[out] reader the returned reader object
/// \return Status
static Status Open(std::shared_ptr<RandomAccessFile>& file, MemoryPool* pool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to use Result<std::unique_ptr<...>> instead of the out parameter unless there is a good reason not to.

/// \param[in] column_indices indexes of columns expected to be read
/// \param[in] row_group_indices indexes of row_groups expected to be read
Status InitRecordBatchReader(const std::vector<int>& column_indices,
const std::vector<int>& row_group_indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there something in the dataset api that provides equivilant functionality to this?

@emkornfield
Copy link
Contributor

Left some high level comments. Will try to take a closer look within a week or so. Also #5973 might allow for simplification of the java layer.

@wesm
Copy link
Member

wesm commented Apr 1, 2020

Is there interest in rebasing and addressing the comments?

@emkornfield
Copy link
Contributor

Closing due to inactivity, I think the write path is still of interest but with integration of datasets the read path is probably less relevant?

@dgloeckner
Copy link

Hi,

I'd be excited to see write support for Parquet with a Java wrapper. @wesm, @emkornfield Is there a chance to have this PR merged? The comments were mostly about visibility of methods and code style but no blockers from what I can see. Or was the review not completed back then?

@wesm
Copy link
Member

wesm commented Jul 12, 2020

@xuechendi can probably comment, the work wrapping the Datasets API seems probably more promising? From a maintainability standpoint

@emkornfield
Copy link
Contributor

@dgloeckner I agree with Wes, I think the wrapping of dataset for reading will hopefully get merged first. For writing, we might be able to resurrect some of this, code. I would need to reread, but I'm not sure I actually code reviewed this in depth (mostly high level design/style issues).

@xuechendi
Copy link
Author

@dgloeckner I agree with Wes, I think the wrapping of dataset for reading will hopefully get merged first. For writing, we might be able to resurrect some of this, code. I would need to reread, but I'm not sure I actually code reviewed this in depth (mostly high level design/style issues).

hi, @emkornfield and @wesm , sorry for my late response, I thought this PR was abandoned long ago, yes, dataset read is more promising and making sense. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants