-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs #8177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/parquet/level_conversion.cc
Outdated
| // filtered out above). | ||
| if (lengths != nullptr) { | ||
| ++lengths; | ||
| *lengths = (def_levels[x] >= level_info.def_level) ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider doing the sume of the previous element here. Originally I did not because I thought at some point getting raw lengths would make it easier to handled chunked_arrays in reader.cc but I think that case is esoteric enough that removing the need to touch this data twice will be better.
python/pyarrow/tests/test_parquet.py
Outdated
| @@ -692,14 +692,6 @@ def test_pandas_can_write_nested_data(tempdir): | |||
| # This succeeds under V2 | |||
| _write_table(arrow_table, imos) | |||
|
|
|||
| # Under V1 it fails. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was meant for my other PR< I willl revert it.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.
| // Arrow schema: struct(a: list(int32 not null) not null) not null | ||
| SetParquetSchema(GroupNode::Make( | ||
| "a", Repetition::REQUIRED, | ||
| "a", Repetition::REQUIRED, // this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... is the comment pointing to some particular detail? It seems a bit cryptic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry. removed.
| const int max_rep_level = 1; | ||
| LevelInfo level_info; | ||
| level_info.def_level = 3; | ||
| level_info.rep_level = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, is rep_level useful in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added a comment about this in level_conversion.cc.
// It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
172 is deleted in a follow-up release.
Once this is cleaned up it is not required.
| uint64_t validity_output; | ||
| ValidityBitmapInputOutput validity_io; | ||
| validity_io.values_read_upper_bound = 4; | ||
| validity_io.valid_bits = reinterpret_cast<uint8_t*>(&validity_output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make validity_output a uint8_t or a uint8_t[1]. We don't want to encourage endianness issues (I realize this wouldn't happen here because we don't actually test the value of validity_output?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping here.
| typename TypeParam::ListLengthType* next_position = this->converter_.ComputeListInfo( | ||
| this->test_data_, level_info, &validity_io, lengths.data()); | ||
|
|
||
| EXPECT_THAT(next_position, lengths.data() + 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... is this supposed to be EXPECT_EQ? I'm curious why/how this line works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a good question maybe implicit EQ. fixed.
| this->test_data_, level_info, &validity_io, lengths.data()); | ||
|
|
||
| EXPECT_THAT(next_position, lengths.data() + 4); | ||
| EXPECT_THAT(lengths, testing::ElementsAre(0, 3, 7, 7, 7)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit miffed here. Can lengths be renamed offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should be. sorry about that, I changed my mind on semantics late in this PR and didn't rename.
| // repeated_ancenstor_def_level | ||
| uint64_t present_bitmap = internal::GreaterThanBitmap( | ||
| def_levels, batch_size, level_info.repeated_ancestor_def_level - 1); | ||
| uint64_t selected_bits = ExtractBits(defined_bitmap, present_bitmap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR, I think that if BMI isn't available, you can still use a batch size of 5 or 6 bits and use a fast lookup table for ExtractBits (rather than the probably slow emulation code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to think about this algorithm a little bit more and my expectation is that we should still be seeing runs for 0s or 1s in most cases. As noted before if this simulation doesn't work well on an AMD box we can revert to the scalar version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can think about that for another PR anyway. Will try to run benchmarks.
|
Are there any benchmarks worth running here? |
Please let me know if there is more confusion, I will attempt to add clarifying comments. I think I addressed all your comments except for some in level_conversion_test.cc I'll address those tomorrow (I assume there will be more comments in reader.cc as well).
parquet-level-conversion-benchmark |
| } | ||
|
|
||
| inline uint64_t ExtractBits(uint64_t bitmap, uint64_t select_bitmap) { | ||
| #if defined(ARROW_HAVE_BMI2) && !defined(__MINGW32__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment why MinGW is left out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/level_conversion.h
Outdated
| // Converts def_levels to validity bitmaps for non-list arrays and structs that have | ||
| // at least one member that is not a list and has no list descendents. | ||
| // For lists use DefRepLevelsToList and structs where all descendants contain | ||
| // a list use DefRepLevelsToBitmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun :-)
cpp/src/parquet/level_conversion.h
Outdated
| // the next offset. See documentation onf DefLevelsToBitmap for when to use this | ||
| // method vs the other ones in this file for reconstruction. | ||
| // | ||
| // Offsets must be size to 1 + values_read_upper_bound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comments, thank you.
cpp/src/parquet/level_conversion.h
Outdated
| @@ -19,6 +19,7 @@ | |||
|
|
|||
| #include <cstdint> | |||
|
|
|||
| #include "arrow/util/bitmap.h" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this include is not used after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope.
cpp/src/parquet/level_conversion.cc
Outdated
| : standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>; | ||
| fn(def_levels, num_def_levels, level_info, output); | ||
| #else | ||
| // This indicates we are likely on a big-endian platformat which don't have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"platform"
But the comment is mistaken: ARM is little-endian most of the time (technically it supports both, but Linux runs it in little-endian mode AFAIK).
Also, I don't understand why DefLevelsToBitmapScalar is preferred here but DefLevelsToBitmapSimd is preferred below? Don't the same arguments apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a specific case for little endian.
I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on reader.cc now.
cpp/src/parquet/arrow/reader.cc
Outdated
| } | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| std::shared_ptr<Buffer> lengths_buffer, | ||
| AllocateBuffer(sizeof(IndexType) * std::max(int64_t{2}, number_of_slots + 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking there was some issue with with list arrays that always required two elements. I couldn't find the issue though.
cpp/src/parquet/arrow/reader.cc
Outdated
| validity_io.valid_bits = validity_buffer->mutable_data(); | ||
| } | ||
| ARROW_ASSIGN_OR_RAISE( | ||
| std::shared_ptr<Buffer> lengths_buffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"offsets" rather than "lengths"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. done.
cpp/src/parquet/arrow/reader.cc
Outdated
| ctx_->pool)); | ||
| // ensure zero initialization in case we have reached a zero length list (and | ||
| // because first entry is always zero). | ||
| IndexType* length_data = reinterpret_cast<IndexType*>(lengths_buffer->mutable_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"offset_data" or "offsets_data"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/arrow/reader.cc
Outdated
| // ensure we've set all the bits here. | ||
| if (validity_io.values_read < number_of_slots) { | ||
| // + 1 because arrays lengths are values + 1 | ||
| std::fill(length_data + validity_io.values_read + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to write some values past the offsets end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to resize.
cpp/src/parquet/arrow/reader.cc
Outdated
| ::parquet::internal::DefRepLevelsToList(def_levels, rep_levels, num_levels, | ||
| level_info_, &validity_io, length_data); | ||
| END_PARQUET_CATCH_EXCEPTIONS | ||
| // We might return less then the requested slot (i.e. reaching an end of a file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means, shouldn't you know up front the number of values? Do you mean the file was truncated before the row group end (is that supported)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is number_of_slots just an upper bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. it is an upper bound. renamed it. and removed comment.
cpp/src/parquet/arrow/reader.cc
Outdated
| children_(std::move(children)) { | ||
| // There could be a mix of children some might be repeated some might not be. | ||
| // If possible use one that isn't since that will be guaranteed to have the least | ||
| // number of rep/def levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "of rep levels"? You could have arbitrarily nested structs with a lot of dep levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrased a little bit. There is always an equal number of repetition and definition levels for any particular leaf.
cpp/src/parquet/arrow/reader.cc
Outdated
| *length = 0; | ||
| return Status::OK(); | ||
| return Status::Invalid("StructReader had no childre"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"children"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/arrow/reader.cc
Outdated
|
|
||
| if (!found_nullable_child) { | ||
| if (*data == nullptr) { | ||
| // Only happens if there are actually 0 rows available. | ||
| *data = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant with your if condition above? Also, why does length need to be filled out explicitly below?
(doesn't def_rep_level_child_->GetDefLevels do it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it is, removed.
cpp/src/parquet/arrow/reader.cc
Outdated
| } | ||
| if (data == nullptr) { | ||
| // Only happens if there are actually 0 rows available. | ||
| *data = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're dereferencing a null pointer (see if condition above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, removed. we shouldn't need this.
cpp/src/parquet/arrow/reader.cc
Outdated
| DefLevelsToBitmap(def_levels, num_levels, level_info_, &validity_io); | ||
| } | ||
|
|
||
| // Ensure all values are initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Shouldn't you resize the buffer instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think about using Resizable buffers. changed all of the places that were filled to it.
|
Impressive! |
|
No changes on |
No, I don't think so. Since we were already using SIMD for non-nested types. |
|
@pitrou unfortunately, I was missing an "info.rep_level = 1;" in the benchmark, so it likely not as impressive on AMD, would you mind running again? (working on addressing the rest of the feedback. |
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to refactor level_conversion_test but I think i forgot to respond to the last review here.
| @@ -50,7 +50,7 @@ if(ARROW_CPU_FLAG STREQUAL "x86") | |||
| # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ | |||
| set(ARROW_AVX512_FLAG "-march=skylake-avx512 -mbmi2") | |||
| # Append the avx2/avx512 subset option also, fix issue ARROW-9877 for homebrew-cpp | |||
| set(ARROW_AVX2_FLAG "${ARROW_AVX2_FLAG} -mavx2") | |||
| set(ARROW_AVX2_FLAG "${ARROW_AVX2_FLAG} -mavx2 -mbmi2") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I removed. and place -mbmi2 specifically only for the parquet file. I think this should be safe because it is guarded via runtime dispatch.
I might not have been clear, but MSVC doesn't have any way of distinguishing these things so if we ever turn on AVX2 by default we have the same issue.
cpp/src/parquet/level_conversion.cc
Outdated
| : standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>; | ||
| fn(def_levels, num_def_levels, level_info, output); | ||
| #else | ||
| // This indicates we are likely on a big-endian platformat which don't have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a specific case for little endian.
I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.
cpp/src/parquet/level_conversion.h
Outdated
| @@ -19,6 +19,7 @@ | |||
|
|
|||
| #include <cstdint> | |||
|
|
|||
| #include "arrow/util/bitmap.h" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope.
| } | ||
|
|
||
| inline uint64_t ExtractBits(uint64_t bitmap, uint64_t select_bitmap) { | ||
| #if defined(ARROW_HAVE_BMI2) && !defined(__MINGW32__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| this->test_data_.rep_levels_ = std::vector<int16_t>{0}; | ||
|
|
||
| std::vector<typename TypeParam::ListLengthType> lengths( | ||
| 2, std::numeric_limits<typename TypeParam::ListLengthType>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/arrow/reader.cc
Outdated
| std::vector<std::shared_ptr<ArrayData>>{item_chunk}, validity_io.null_count); | ||
|
|
||
| std::shared_ptr<Array> result = ::arrow::MakeArray(data); | ||
| RETURN_NOT_OK(result->Validate()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move this to above to avoid recursively calling things multiple times, but i think we should be validating at least for structs (and not validating full) since rep/def level information could be inconsistent within them. It felt easier to call validate (and not too expensive) then writing custom logic for this.
I'm open to removing them, but it feels like there should be a contract here that someplace in this code for an underlying library we validate consistency.
cpp/src/parquet/arrow/reader.cc
Outdated
|
|
||
| if (!found_nullable_child) { | ||
| if (*data == nullptr) { | ||
| // Only happens if there are actually 0 rows available. | ||
| *data = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it is, removed.
cpp/src/parquet/arrow/reader.cc
Outdated
| } | ||
| if (data == nullptr) { | ||
| // Only happens if there are actually 0 rows available. | ||
| *data = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, removed. we shouldn't need this.
cpp/src/parquet/arrow/reader.cc
Outdated
| ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ArrayData> item_chunk, ChunksToSingle(**out)); | ||
|
|
||
| std::vector<std::shared_ptr<Buffer>> buffers{ | ||
| validity_io.null_count > 0 ? validity_buffer : std::shared_ptr<Buffer>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I thought this was causing problems with type inference at some pont.
cpp/src/parquet/arrow/reader.cc
Outdated
| ctx_->pool)); | ||
| // ensure zero initialization in case we have reached a zero length list (and | ||
| // because first entry is always zero). | ||
| IndexType* length_data = reinterpret_cast<IndexType*>(lengths_buffer->mutable_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
@pitrou I think I responded to all review comments at this point, apologies if I missed something. level_conversion_test.cc is now refactored to a point where duplicate code I think adds to test understandability but there is still some redundancy. Also, please see my note about the level_conversion_benchmark having a bug in it on your prior run. |
|
I think the macOS failure is fixed by #8196 but the Appveyor failure looks legit: |
|
👍 we've reduced the failures to Flight (i.e. surely unrelated) issues. |
|
I'll take a look again on Monday, if that's ok with you. |
|
SGTM. If i have time I might get one or two CLs out based on this one but I can rebase afterwards. |
|
Running |
|
Things are a bit more balanced if the scalar version is used: |
|
Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading? |
Add export to RunBasedExtract
add PARQUET_EXPORT to GreaterThanBitmap
This reverts commit d19ebc02fec16bc363ba610833ee76d8e9b02668.
|
Other than that, I see a ~20% improvement on |
|
I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement! That said, I seem to get a test error on the Python side (pasted below). Let's see if it reproduces on CI: (beware: I rebased your branch on git master) |
@pitrou thank you for the thoughtful review. Let me know if you still have issues with the AVX/BMI2 after I added the comment (perhaps I didn't revert some compilation) or my analysis is wrong. I think the BMI2 check will be difficult/impossible at compile time for windows, so I'm not sure if it is worth the effort on linux. I also removed the failing test for parquet (which I should have removed in a prior PR, its strange it showed up again). |
We need to support LargeList, and Map which should be smaller change (I'm working on a PR) at the schema level inference. There are a few other JIRAs still open about benchmarking and randomized testing, Past that, there are some open JIRAs about performance improvements:
There is also an unrelated bug on the write side #8219 which I asked for @wesm to review (it is based on some changes in this PR). |
|
Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-) |
|
Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough). |
I think the internal tests you wrote have pretty good coverage. After #8219 is merged I was planning on make some of the one way (the ones I wrote for write and the ones your wrote for read) fully round-trip. If you think there are gaps, by all means we should add tests. |
My main concern is getting some data that reflects real workloads. @jorisvandenbossche it looks it sounded like you had Geo data that has multiple levels of nesting, I wonder if there is a canonical dataset we could make use of for benchmarking. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also: ARROW-9810 (generalize rep/def level conversion to list lengths/bitmaps)
This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case. It uses there for generic reconstruction.
The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata. The arrow reader.cc is now rewritten to use these method.
column_reader
(uses rep/def levels)
(uses rep/def levels.).
and BMI2.
used as a fallback.