Skip to content

Conversation

@hatemhelal
Copy link
Contributor

@hatemhelal hatemhelal commented Feb 20, 2019

This patch addresses the following JIRAS:

  • ARROW-3769: refactored record reader logic to toggle between the different builder depending on the column type (String or Binary) and the requested array type (Chunked "dense" or Dictionary). These changes are covered by unittests and benchmarks.
  • PARQUET-1537: fixed increment and covered by unittests.

Also included is an experimental class ArrowReaderProperties that can be used to select which columns are read directly as an arrow::DictionaryArray. I think some more work is needed to fully address the requests in ARROW-3772. Namely, the ability automatically infer which columns in a parquet file should be read as DictionaryArray. My current thinking is that this would be solved by introducing optional arrow type metadata to files written with the parquet::arrow::FileWriter. There are some limitations with this approach but it would seem to satisfy the requests for users working with parquet files within the supported arrow ecosystem.

Note that the behavior with this patch is that incremental reading of a parquet file will not resolve the global dictionary for all of the row groups. There are a few possible solutions for this:

  • Introduce a concept of an "unknown" dictionary. This will enable concatenating multiple row groups together so long as we define unknown dictionaries as equal (assuming indices have the same data type)
  • Add an API for merging the schemas from multiple tables together. This could be used after reading multiple row groups to enable concatenating the tables together into one.
  • Add an API for inferring the global dictionary for the entire file. This could be an expensive operation so ideally would be made optional.
  • Allow a user-specified dictionary. This could be useful in the limited case where a caller already knows the global dictionary list (computed through some other mechanism).

@@ -163,4 +172,132 @@ static void BM_DictDecodingInt64_literals(benchmark::State& state) {

BENCHMARK(BM_DictDecodingInt64_literals)->Range(1024, 65536);

std::shared_ptr<::arrow::Array> MakeRandomStringsWithRepeats(size_t num_unique,
Copy link
Member

@kevingurney kevingurney Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment here which provides an example of what the output of MakeRandomStringsWithRepeats might look like for example values of num_unique and num_values,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disregard this if you don't think it is helpful, but you could consider including alphabet (i.e. a list of letters from which to draw randomly) as one of the input arguments to MakeRandomStringsWithRepeats, rather than using ::arrow::random_ascii inside of the function body. This might help facilitate greater flexibility/re-usability of this function in other contexts. For example, there might be a case in the future where a client might want to generate a random string containing Unicode characters, rather than just ASCII.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea and will take this up in ARROW-4661

@kevingurney
Copy link
Member

@hatemhelal I did an initial review of your changes, and overall they look good!

Almost all of my feedback is minor and primarily stylistic in nature. I am new to this area of the code base, so you can take my comments with a grain of salt. Many of them may stem more from my own lack of knowledge in this area, and my attempt to learn more, than from actual issues with the code.

Let me know if you have any questions regarding any of my feedback.

Thanks!

@hatemhelal
Copy link
Contributor Author

Thanks @kevingurney and @emkornfield for the code review! Let me know if you think of anything else.

Copy link
Contributor

@rdmello rdmello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have some minor comments, I think this looks really good overall!

@@ -941,6 +948,12 @@ class DictByteArrayDecoder : public DictDecoderImpl<ByteArrayType>,
return result;
}

int DecodeArrowNonNull(int num_values, ::arrow::BinaryDictionaryBuilder* out) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be getting confused by the diff engine on GitHub, but is this the same code as on lines 725-730? If so, is there a common function both these methods could call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same but I don't see an easy way to share an implementation. Let me think on this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this slightly in the latest commit. @rdmello let me know how this looks to you.

@hatemhelal
Copy link
Contributor Author

The latest commit adds a test case for PARQUET-1537. Just needed to have a null in the input data to reproduce the reported issue.

@wesm
Copy link
Member

wesm commented Feb 27, 2019

We've largely stopped using MT in unit tests because it's slower than the alternatives (cc @pitrou)

@pitrou
Copy link
Member

pitrou commented Feb 27, 2019

What does MT mean in this context?

edit: ah, Mersenne Twister. Yes, you wouldn't believe it, but it made the tests significantly slower.

@hatemhelal
Copy link
Contributor Author

Would you recommend swapping back to the default random engine or another engine entirely?

@pitrou
Copy link
Member

pitrou commented Feb 27, 2019

The default random engine, or anything else that's fast (for example a xorshift-like PRNG). Are you worried about poor quality of the default engine?

@hatemhelal
Copy link
Contributor Author

I think default engine should be good reproducibility is the main concern for the benchmark. Probably just an overreaction to "implementation defined"...

@hatemhelal
Copy link
Contributor Author

Switched back to std::default_random_engine. I'd like to move this utility into the testing/random.h in ARROW-4661

@wesm
Copy link
Member

wesm commented Feb 27, 2019

I can review this soon. Is this still WIP?

@hatemhelal hatemhelal marked this pull request as ready for review February 28, 2019 08:18
@hatemhelal
Copy link
Contributor Author

I can review this soon. Is this still WIP?

I think this is ready to review. This PR should resolve:

My plan is to use ARROW-3772 to build on this change and plumb this through to the parquet -> arrow reader.

@hatemhelal
Copy link
Contributor Author

Most recent commit adds changes that go towards resolving ARROW-3772

  • ArrowReaderProperties object that is used by parquet::arrow::FileReader to select whether or not to use_threads or read a particular column as a DictionaryArray.
  • The read_dictionary flag is plumbed down through the column reading logic to toggle between either constructing a DictionaryArray or otherwise a chunked dense array.

@wesm can you let me know if this is heading in the right direction? I'm working on some tests to accompany these changes.

@codecov-io
Copy link

Codecov Report

Merging #3721 into master will increase coverage by 0.82%.
The diff coverage is 96.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3721      +/-   ##
==========================================
+ Coverage   87.81%   88.64%   +0.82%     
==========================================
  Files         727      594     -133     
  Lines       89504    80194    -9310     
  Branches     1252        0    -1252     
==========================================
- Hits        78600    71089    -7511     
+ Misses      10788     9105    -1683     
+ Partials      116        0     -116
Impacted Files Coverage Δ
cpp/src/arrow/testing/random.h 100% <ø> (ø) ⬆️
cpp/src/parquet/arrow/reader.h 100% <100%> (ø)
cpp/src/parquet/encoding.cc 93.84% <100%> (+6.03%) ⬆️
cpp/src/parquet/encoding-test.cc 100% <100%> (ø) ⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 95.47% <100%> (+0.12%) ⬆️
cpp/src/parquet/encoding.h 98.46% <100%> (+0.63%) ⬆️
cpp/src/parquet/arrow/record_reader.cc 87.97% <90.21%> (-0.6%) ⬇️
cpp/src/arrow/testing/random.cc 97.56% <93.93%> (-2.44%) ⬇️
cpp/src/parquet/arrow/reader.cc 84.15% <97.87%> (+0.48%) ⬆️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
... and 139 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d73e0a...f644fff. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will merge this once the build passes. Thanks @hatemhelal for your patience and for addressing my comments

@wesm
Copy link
Member

wesm commented Mar 18, 2019

Ah well the build is passed so merging now

@wesm wesm closed this in fd0b90a Mar 18, 2019
@xhochy
Copy link
Member

xhochy commented Mar 18, 2019

Huge thanks for tackling this @hatemhelal !

@hatemhelal hatemhelal deleted the arrow-3769 branch July 11, 2019 07:46
kou added a commit that referenced this pull request Dec 7, 2020
As per a TODO left in ARROW-3769 / #3721 we can now use the `GTEST_SKIP` macro in `parquet/encoding-test.cpp`. `GTEST_SKIP` was added in gtest 1.10.0 so this involves bumping our minimal gtest version from 1.8.1

Closes #8782 from arw2019/ARROW-10746-GTEST_SKIP

Lead-authored-by: Andrew Wieteska <andrew.r.wieteska@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants