-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9810: [C++] Generalized nested reconstruction helpers #8156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds helper methods for reconstructing all necessary metadata for arrow types. For now this doesn't handle null_slot_usage (i.e. children of FixedSizeList), it throws exceptions when nulls are encountered in this case. The unit tests demonstrate how to use the helper methods in combination with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct the metadata. - Refactors necessary APIs to use LevelInfo and makes use of them in column_reader - Adds implementations for reconstructing list validity bitmaps (one uses rep/def levels. one uses greater then bitmap generated from rep/def levels). - Adds implementations for reconstruction list lengths (one uses rep/def levels. one uses greater then bitmap generated from rep/def levels). - Adds dynamic dispatch for level comparison algorithms for AVX2 and BMI2. - Adds a pextract alternative that uses BitRunReader that can be used as a fallback.
| return ::arrow::BitUtil::ToLittleEndian(mask); | ||
| } | ||
|
|
||
| /// Builds a bitmap where each set bit indicates the corresponding level is greater |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you're doing this. Integer comparisons are cheap, while writing and reading bitmaps is expensive. Why go through an intermediate bitmap (and then spend some time trying to optimize it?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rearranging code from a prior which on my box showed 20% end-to-end parquet to arrow read performance on existing benchmarks, indicating that while cheap the comparison take a considerable amount of time in decoding.
Why generalize in this way?
The generation of bitmaps can be made cheaper than it currently is. Right now parquet RLE encodes rep/def levels. We fully decode these levels into int16_t and then try to reconstruct nested metadata from them. For places where there are actually runs, the bitmaps become much less expensive to generate (its a simple mask of N-bits). For cases when max def/rep-level is one, I don't think comparison would even be necessary for non-RLE encoded data (the data is already in bitmap form).
Working at the bitmaps level allows taking advantage of bit-level parallelism at the cost of doing extra for each column (The scalar version of the algorithm I include actually also does the same extra work when compared with the current reconstruction algorithm).
It is also worth pointing out there are roughly three cases to consider:
- No-nested lists (in which case using this method will directly generate validity bitmaps and be a strict performance win).
- Single list (when BMI2 is usable I believe this approach will also be faster than any other approach, I'm a little bit less confident when BMI2 isn't usable, but I would expect at least equal performance).
- Nested lists. At a certain point using something similar to the batch approach will start to win versus the bitmap approach. This could be as early as 2 lists, but I would guess it is likely >= 3 lists.
|
Hmm... I don't know if this is easy to tear apart, but I'd appreciate if the PR concentrated on the functionality, and micro-optimizations (SIMD, BMI or otherwise) separated into later PRs. Would that be reasonably doable? |
How about I remove the net new bitmap related code? The SIMD/BMI code was already present hidden behind compiler flags instead of runtime dispatched. The runtime dispatch is cleaning up technical debt to allow users to make use of these in prepackaged binaries (the PR that introduced them showed a 20% improvement on our end to end parquet->Arrow reading benchmarks). |
|
@pitrou I've removed the code for reconstructing list information based on bitmaps, I would appreciate keeping the rest of the in this PR. |
|
Closing in favor of: #8177 |
This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case.
The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata.
column_reader
used as a fallback.