ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

emkornfield · 2020-09-10T05:23:14Z

This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case.

The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata.

Refactors necessary APIs to use LevelInfo and makes use of them in
column_reader
Adds implementation for reconstructing list validity bitmaps using rep/def levels..
Adds implementation for reconstruction list lengths (for rep/dev levels)
Adds dynamic dispatch for level comparison algorithms for AVX2 and BMI2.
Adds a pextract alternative that uses BitRunReader that can be
used as a fallback.

This adds helper methods for reconstructing all necessary metadata for arrow types. For now this doesn't handle null_slot_usage (i.e. children of FixedSizeList), it throws exceptions when nulls are encountered in this case. The unit tests demonstrate how to use the helper methods in combination with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct the metadata. - Refactors necessary APIs to use LevelInfo and makes use of them in column_reader - Adds implementations for reconstructing list validity bitmaps (one uses rep/def levels. one uses greater then bitmap generated from rep/def levels). - Adds implementations for reconstruction list lengths (one uses rep/def levels. one uses greater then bitmap generated from rep/def levels). - Adds dynamic dispatch for level comparison algorithms for AVX2 and BMI2. - Adds a pextract alternative that uses BitRunReader that can be used as a fallback.

github-actions · 2020-09-10T05:32:14Z

https://issues.apache.org/jira/browse/ARROW-9810

pitrou · 2020-09-10T13:02:56Z

cpp/src/parquet/level_comparison.h

+  return ::arrow::BitUtil::ToLittleEndian(mask);
+}
+
+/// Builds a  bitmap where each set bit indicates the corresponding level is greater


I don't understand why you're doing this. Integer comparisons are cheap, while writing and reading bitmaps is expensive. Why go through an intermediate bitmap (and then spend some time trying to optimize it?)?

This is rearranging code from a prior which on my box showed 20% end-to-end parquet to arrow read performance on existing benchmarks, indicating that while cheap the comparison take a considerable amount of time in decoding.

Why generalize in this way?
The generation of bitmaps can be made cheaper than it currently is. Right now parquet RLE encodes rep/def levels. We fully decode these levels into int16_t and then try to reconstruct nested metadata from them. For places where there are actually runs, the bitmaps become much less expensive to generate (its a simple mask of N-bits). For cases when max def/rep-level is one, I don't think comparison would even be necessary for non-RLE encoded data (the data is already in bitmap form).

Working at the bitmaps level allows taking advantage of bit-level parallelism at the cost of doing extra for each column (The scalar version of the algorithm I include actually also does the same extra work when compared with the current reconstruction algorithm).

It is also worth pointing out there are roughly three cases to consider:

No-nested lists (in which case using this method will directly generate validity bitmaps and be a strict performance win).

Single list (when BMI2 is usable I believe this approach will also be faster than any other approach, I'm a little bit less confident when BMI2 isn't usable, but I would expect at least equal performance).

Nested lists. At a certain point using something similar to the batch approach will start to win versus the bitmap approach. This could be as early as 2 lists, but I would guess it is likely >= 3 lists.

pitrou · 2020-09-10T13:04:45Z

Hmm... I don't know if this is easy to tear apart, but I'd appreciate if the PR concentrated on the functionality, and micro-optimizations (SIMD, BMI or otherwise) separated into later PRs. Would that be reasonably doable?

emkornfield · 2020-09-10T15:41:47Z

Hmm... I don't know if this is easy to tear apart, but I'd appreciate if the PR concentrated on the functionality, and micro-optimizations (SIMD, BMI or otherwise) separated into later PRs. Would that be reasonably doable?

How about I remove the net new bitmap related code? The SIMD/BMI code was already present hidden behind compiler flags instead of runtime dispatched. The runtime dispatch is cleaning up technical debt to allow users to make use of these in prepackaged binaries (the PR that introduced them showed a 20% improvement on our end to end parquet->Arrow reading benchmarks).

emkornfield · 2020-09-11T04:01:02Z

@pitrou I've removed the code for reconstructing list information based on bitmaps, I would appreciate keeping the rest of the in this PR.

emkornfield · 2020-09-13T05:00:15Z

Closing in favor of: #8177

emkornfield requested a review from pitrou September 10, 2020 05:23

pitrou reviewed Sep 10, 2020

View reviewed changes

remove bitmap code

6be229c

emkornfield closed this Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

Uh oh!

emkornfield commented Sep 10, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Sep 10, 2020

Uh oh!

pitrou Sep 10, 2020 •

edited

Loading

Uh oh!

emkornfield Sep 10, 2020

Uh oh!

pitrou commented Sep 10, 2020 •

edited

Loading

Uh oh!

emkornfield commented Sep 10, 2020

Uh oh!

emkornfield commented Sep 11, 2020

Uh oh!

emkornfield commented Sep 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

Uh oh!

Conversation

emkornfield commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2020

Uh oh!

pitrou Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

pitrou commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Sep 10, 2020

Uh oh!

emkornfield commented Sep 11, 2020

Uh oh!

emkornfield commented Sep 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emkornfield commented Sep 10, 2020 •

edited

Loading

pitrou Sep 10, 2020 •

edited

Loading

pitrou commented Sep 10, 2020 •

edited

Loading