ARROW-3769: [C++] Add support for reading non-dictionary encoded binary Parquet columns directly as DictionaryArray #3721

hatemhelal · 2019-02-20T18:20:28Z

This patch addresses the following JIRAS:

ARROW-3769: refactored record reader logic to toggle between the different builder depending on the column type (String or Binary) and the requested array type (Chunked "dense" or Dictionary). These changes are covered by unittests and benchmarks.
PARQUET-1537: fixed increment and covered by unittests.

Also included is an experimental class ArrowReaderProperties that can be used to select which columns are read directly as an arrow::DictionaryArray. I think some more work is needed to fully address the requests in ARROW-3772. Namely, the ability automatically infer which columns in a parquet file should be read as DictionaryArray. My current thinking is that this would be solved by introducing optional arrow type metadata to files written with the parquet::arrow::FileWriter. There are some limitations with this approach but it would seem to satisfy the requests for users working with parquet files within the supported arrow ecosystem.

Note that the behavior with this patch is that incremental reading of a parquet file will not resolve the global dictionary for all of the row groups. There are a few possible solutions for this:

Introduce a concept of an "unknown" dictionary. This will enable concatenating multiple row groups together so long as we define unknown dictionaries as equal (assuming indices have the same data type)
Add an API for merging the schemas from multiple tables together. This could be used after reading multiple row groups to enable concatenating the tables together into one.
Add an API for inferring the global dictionary for the entire file. This could be an expensive operation so ideally would be made optional.
Allow a user-specified dictionary. This could be useful in the limited case where a caller already knows the global dictionary list (computed through some other mechanism).

cpp/src/parquet/encoding-benchmark.cc

cpp/src/parquet/encoding-test.cc

cpp/src/parquet/encoding-benchmark.cc

cpp/src/parquet/encoding-test.cc

kevingurney · 2019-02-22T01:30:33Z

cpp/src/parquet/encoding-benchmark.cc

@@ -163,4 +172,132 @@ static void BM_DictDecodingInt64_literals(benchmark::State& state) {

 BENCHMARK(BM_DictDecodingInt64_literals)->Range(1024, 65536);

+std::shared_ptr<::arrow::Array> MakeRandomStringsWithRepeats(size_t num_unique,


Consider adding a comment here which provides an example of what the output of MakeRandomStringsWithRepeats might look like for example values of num_unique and num_values,

Feel free to disregard this if you don't think it is helpful, but you could consider including alphabet (i.e. a list of letters from which to draw randomly) as one of the input arguments to MakeRandomStringsWithRepeats, rather than using ::arrow::random_ascii inside of the function body. This might help facilitate greater flexibility/re-usability of this function in other contexts. For example, there might be a case in the future where a client might want to generate a random string containing Unicode characters, rather than just ASCII.

I like this idea and will take this up in ARROW-4661

cpp/src/parquet/encoding-benchmark.cc

cpp/src/parquet/encoding-test.cc

kevingurney · 2019-02-22T05:10:19Z

@hatemhelal I did an initial review of your changes, and overall they look good!

Almost all of my feedback is minor and primarily stylistic in nature. I am new to this area of the code base, so you can take my comments with a grain of salt. Many of them may stem more from my own lack of knowledge in this area, and my attempt to learn more, than from actual issues with the code.

Let me know if you have any questions regarding any of my feedback.

Thanks!

cpp/src/parquet/encoding-test.cc

hatemhelal · 2019-02-22T16:42:57Z

Thanks @kevingurney and @emkornfield for the code review! Let me know if you think of anything else.

rdmello

I only have some minor comments, I think this looks really good overall!

cpp/src/parquet/encoding-benchmark.cc

rdmello · 2019-02-25T18:46:42Z

cpp/src/parquet/encoding.cc

@@ -941,6 +948,12 @@ class DictByteArrayDecoder : public DictDecoderImpl<ByteArrayType>,
    return result;
  }

+  int DecodeArrowNonNull(int num_values, ::arrow::BinaryDictionaryBuilder* out) override {


I might be getting confused by the diff engine on GitHub, but is this the same code as on lines 725-730? If so, is there a common function both these methods could call?

They are the same but I don't see an easy way to share an implementation. Let me think on this one.

I refactored this slightly in the latest commit. @rdmello let me know how this looks to you.

hatemhelal · 2019-02-26T15:52:24Z

The latest commit adds a test case for PARQUET-1537. Just needed to have a null in the input data to reproduce the reported issue.

wesm · 2019-02-27T15:52:20Z

We've largely stopped using MT in unit tests because it's slower than the alternatives (cc @pitrou)

pitrou · 2019-02-27T16:07:18Z

What does MT mean in this context?

edit: ah, Mersenne Twister. Yes, you wouldn't believe it, but it made the tests significantly slower.

hatemhelal · 2019-02-27T16:11:14Z

Would you recommend swapping back to the default random engine or another engine entirely?

pitrou · 2019-02-27T16:13:40Z

The default random engine, or anything else that's fast (for example a xorshift-like PRNG). Are you worried about poor quality of the default engine?

hatemhelal · 2019-02-27T16:17:41Z

I think default engine should be good reproducibility is the main concern for the benchmark. Probably just an overreaction to "implementation defined"...

hatemhelal · 2019-02-27T18:34:00Z

Switched back to std::default_random_engine. I'd like to move this utility into the testing/random.h in ARROW-4661

wesm · 2019-02-27T22:26:01Z

I can review this soon. Is this still WIP?

hatemhelal · 2019-02-28T08:25:32Z

I can review this soon. Is this still WIP?

I think this is ready to review. This PR should resolve:

My plan is to use ARROW-3772 to build on this change and plumb this through to the parquet -> arrow reader.

hatemhelal · 2019-03-05T20:14:16Z

Most recent commit adds changes that go towards resolving ARROW-3772

ArrowReaderProperties object that is used by parquet::arrow::FileReader to select whether or not to use_threads or read a particular column as a DictionaryArray.
The read_dictionary flag is plumbed down through the column reading logic to toggle between either constructing a DictionaryArray or otherwise a chunked dense array.

@wesm can you let me know if this is heading in the right direction? I'm working on some tests to accompany these changes.

…encoded col

codecov-io · 2019-03-17T21:42:46Z

Codecov Report

Merging #3721 into master will increase coverage by 0.82%.
The diff coverage is 96.88%.

@@            Coverage Diff             @@
##           master    #3721      +/-   ##
==========================================
+ Coverage   87.81%   88.64%   +0.82%     
==========================================
  Files         727      594     -133     
  Lines       89504    80194    -9310     
  Branches     1252        0    -1252     
==========================================
- Hits        78600    71089    -7511     
+ Misses      10788     9105    -1683     
+ Partials      116        0     -116

Impacted Files	Coverage Δ
cpp/src/arrow/testing/random.h	`100% <ø> (ø)`	⬆️
cpp/src/parquet/arrow/reader.h	`100% <100%> (ø)`
cpp/src/parquet/encoding.cc	`93.84% <100%> (+6.03%)`	⬆️
cpp/src/parquet/encoding-test.cc	`100% <100%> (ø)`	⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc	`95.47% <100%> (+0.12%)`	⬆️
cpp/src/parquet/encoding.h	`98.46% <100%> (+0.63%)`	⬆️
cpp/src/parquet/arrow/record_reader.cc	`87.97% <90.21%> (-0.6%)`	⬇️
cpp/src/arrow/testing/random.cc	`97.56% <93.93%> (-2.44%)`	⬇️
cpp/src/parquet/arrow/reader.cc	`84.15% <97.87%> (+0.48%)`	⬆️
cpp/src/plasma/thirdparty/ae/ae.c	`70.75% <0%> (-0.95%)`	⬇️
... and 139 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d73e0a...f644fff. Read the comment docs.

wesm

+1, will merge this once the build passes. Thanks @hatemhelal for your patience and for addressing my comments

wesm · 2019-03-18T00:03:28Z

Ah well the build is passed so merging now

xhochy · 2019-03-18T14:01:42Z

Huge thanks for tackling this @hatemhelal !

As per a TODO left in ARROW-3769 / #3721 we can now use the `GTEST_SKIP` macro in `parquet/encoding-test.cpp`. `GTEST_SKIP` was added in gtest 1.10.0 so this involves bumping our minimal gtest version from 1.8.1 Closes #8782 from arw2019/ARROW-10746-GTEST_SKIP Lead-authored-by: Andrew Wieteska <andrew.r.wieteska@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

hatemhelal force-pushed the arrow-3769 branch from 07fca5d to 9d08e36 Compare February 21, 2019 10:16