-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3769: [C++] Add support for reading non-dictionary encoded binary Parquet columns directly as DictionaryArray #3721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
07fca5d to
9d08e36
Compare
| @@ -163,4 +172,132 @@ static void BM_DictDecodingInt64_literals(benchmark::State& state) { | |||
|
|
|||
| BENCHMARK(BM_DictDecodingInt64_literals)->Range(1024, 65536); | |||
|
|
|||
| std::shared_ptr<::arrow::Array> MakeRandomStringsWithRepeats(size_t num_unique, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment here which provides an example of what the output of MakeRandomStringsWithRepeats might look like for example values of num_unique and num_values,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to disregard this if you don't think it is helpful, but you could consider including alphabet (i.e. a list of letters from which to draw randomly) as one of the input arguments to MakeRandomStringsWithRepeats, rather than using ::arrow::random_ascii inside of the function body. This might help facilitate greater flexibility/re-usability of this function in other contexts. For example, there might be a case in the future where a client might want to generate a random string containing Unicode characters, rather than just ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea and will take this up in ARROW-4661
|
@hatemhelal I did an initial review of your changes, and overall they look good! Almost all of my feedback is minor and primarily stylistic in nature. I am new to this area of the code base, so you can take my comments with a grain of salt. Many of them may stem more from my own lack of knowledge in this area, and my attempt to learn more, than from actual issues with the code. Let me know if you have any questions regarding any of my feedback. Thanks! |
|
Thanks @kevingurney and @emkornfield for the code review! Let me know if you think of anything else. |
dedbb9c to
61f22c5
Compare
rdmello
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have some minor comments, I think this looks really good overall!
cpp/src/parquet/encoding.cc
Outdated
| @@ -941,6 +948,12 @@ class DictByteArrayDecoder : public DictDecoderImpl<ByteArrayType>, | |||
| return result; | |||
| } | |||
|
|
|||
| int DecodeArrowNonNull(int num_values, ::arrow::BinaryDictionaryBuilder* out) override { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be getting confused by the diff engine on GitHub, but is this the same code as on lines 725-730? If so, is there a common function both these methods could call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same but I don't see an easy way to share an implementation. Let me think on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this slightly in the latest commit. @rdmello let me know how this looks to you.
|
The latest commit adds a test case for PARQUET-1537. Just needed to have a null in the input data to reproduce the reported issue. |
|
We've largely stopped using MT in unit tests because it's slower than the alternatives (cc @pitrou) |
|
What does MT mean in this context? edit: ah, Mersenne Twister. Yes, you wouldn't believe it, but it made the tests significantly slower. |
|
Would you recommend swapping back to the default random engine or another engine entirely? |
|
The default random engine, or anything else that's fast (for example a xorshift-like PRNG). Are you worried about poor quality of the default engine? |
|
I think default engine should be good reproducibility is the main concern for the benchmark. Probably just an overreaction to "implementation defined"... |
|
Switched back to |
|
I can review this soon. Is this still WIP? |
I think this is ready to review. This PR should resolve: My plan is to use ARROW-3772 to build on this change and plumb this through to the parquet -> arrow reader. |
|
Most recent commit adds changes that go towards resolving ARROW-3772
@wesm can you let me know if this is heading in the right direction? I'm working on some tests to accompany these changes. |
Codecov Report
@@ Coverage Diff @@
## master #3721 +/- ##
==========================================
+ Coverage 87.81% 88.64% +0.82%
==========================================
Files 727 594 -133
Lines 89504 80194 -9310
Branches 1252 0 -1252
==========================================
- Hits 78600 71089 -7511
+ Misses 10788 9105 -1683
+ Partials 116 0 -116
Continue to review full report at Codecov.
|
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will merge this once the build passes. Thanks @hatemhelal for your patience and for addressing my comments
|
Ah well the build is passed so merging now |
|
Huge thanks for tackling this @hatemhelal ! |
As per a TODO left in ARROW-3769 / #3721 we can now use the `GTEST_SKIP` macro in `parquet/encoding-test.cpp`. `GTEST_SKIP` was added in gtest 1.10.0 so this involves bumping our minimal gtest version from 1.8.1 Closes #8782 from arw2019/ARROW-10746-GTEST_SKIP Lead-authored-by: Andrew Wieteska <andrew.r.wieteska@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
This patch addresses the following JIRAS:
Also included is an experimental class
ArrowReaderPropertiesthat can be used to select which columns are read directly as anarrow::DictionaryArray. I think some more work is needed to fully address the requests in ARROW-3772. Namely, the ability automatically infer which columns in a parquet file should be read asDictionaryArray. My current thinking is that this would be solved by introducing optional arrow type metadata to files written with theparquet::arrow::FileWriter. There are some limitations with this approach but it would seem to satisfy the requests for users working with parquet files within the supported arrow ecosystem.Note that the behavior with this patch is that incremental reading of a parquet file will not resolve the global dictionary for all of the row groups. There are a few possible solutions for this: