Skip to content

Conversation

@emkornfield
Copy link
Contributor

Also: ARROW-9810 (generalize rep/def level conversion to list lengths/bitmaps)

This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case. It uses there for generic reconstruction.

The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata. The arrow reader.cc is now rewritten to use these method.

  • Refactors necessary APIs to use LevelInfo and makes use of them in
    column_reader
  • Adds implementations for reconstructing list validity bitmaps
    (uses rep/def levels)
  • Adds implementations for reconstruction list lengths
    (uses rep/def levels.).
  • Adds dynamic dispatch for level comparison algorithms for AVX2
    and BMI2.
  • Adds a pextract alternative that uses BitRunReader that can be
    used as a fallback.
  • Fixes some bugs in detailed reconstruction to array tests.

@github-actions
Copy link

// filtered out above).
if (lengths != nullptr) {
++lengths;
*lengths = (def_levels[x] >= level_info.def_level) ? 1 : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider doing the sume of the previous element here. Originally I did not because I thought at some point getting raw lengths would make it easier to handled chunked_arrays in reader.cc but I think that case is esoteric enough that removing the need to touch this data twice will be better.

@emkornfield emkornfield changed the title ARROW-8494: [C++] Full support for mixed lista and structs ARROW-8494: [C++][Parquet] Full support for reading mixed lista and structs Sep 14, 2020
@@ -692,14 +692,6 @@ def test_pandas_can_write_nested_data(tempdir):
# This succeeds under V2
_write_table(arrow_table, imos)

# Under V1 it fails.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was meant for my other PR< I willl revert it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.

// Arrow schema: struct(a: list(int32 not null) not null) not null
SetParquetSchema(GroupNode::Make(
"a", Repetition::REQUIRED,
"a", Repetition::REQUIRED, // this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is the comment pointing to some particular detail? It seems a bit cryptic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. removed.

const int max_rep_level = 1;
LevelInfo level_info;
level_info.def_level = 3;
level_info.rep_level = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, is rep_level useful in this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added a comment about this in level_conversion.cc.

// It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
172 is deleted in a follow-up release.

Once this is cleaned up it is not required.

uint64_t validity_output;
ValidityBitmapInputOutput validity_io;
validity_io.values_read_upper_bound = 4;
validity_io.valid_bits = reinterpret_cast<uint8_t*>(&validity_output);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make validity_output a uint8_t or a uint8_t[1]. We don't want to encourage endianness issues (I realize this wouldn't happen here because we don't actually test the value of validity_output?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping here.

typename TypeParam::ListLengthType* next_position = this->converter_.ComputeListInfo(
this->test_data_, level_info, &validity_io, lengths.data());

EXPECT_THAT(next_position, lengths.data() + 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is this supposed to be EXPECT_EQ? I'm curious why/how this line works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good question maybe implicit EQ. fixed.

this->test_data_, level_info, &validity_io, lengths.data());

EXPECT_THAT(next_position, lengths.data() + 4);
EXPECT_THAT(lengths, testing::ElementsAre(0, 3, 7, 7, 7));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit miffed here. Can lengths be renamed offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be. sorry about that, I changed my mind on semantics late in this PR and didn't rename.

// repeated_ancenstor_def_level
uint64_t present_bitmap = internal::GreaterThanBitmap(
def_levels, batch_size, level_info.repeated_ancestor_def_level - 1);
uint64_t selected_bits = ExtractBits(defined_bitmap, present_bitmap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, I think that if BMI isn't available, you can still use a batch size of 5 or 6 bits and use a fast lookup table for ExtractBits (rather than the probably slow emulation code).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to think about this algorithm a little bit more and my expectation is that we should still be seeing runs for 0s or 1s in most cases. As noted before if this simulation doesn't work well on an AMD box we can revert to the scalar version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can think about that for another PR anyway. Will try to run benchmarks.

@pitrou
Copy link
Member

pitrou commented Sep 14, 2020

Are there any benchmarks worth running here?

@emkornfield
Copy link
Contributor Author

Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.

Please let me know if there is more confusion, I will attempt to add clarifying comments. I think I addressed all your comments except for some in level_conversion_test.cc I'll address those tomorrow (I assume there will be more comments in reader.cc as well).

Are there any benchmarks worth running here?

parquet-level-conversion-benchmark
parquet-arrow-reader-writer-benchmark (this won't cover the nested cases though) There is an open JIRA under ARROW-1644 to add benchmarks for nested cases @npr mentioned there might be some example datasets that we wanted to try this on.

@pitrou pitrou changed the title ARROW-8494: [C++][Parquet] Full support for reading mixed lista and structs ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs Sep 15, 2020
}

inline uint64_t ExtractBits(uint64_t bitmap, uint64_t select_bitmap) {
#if defined(ARROW_HAVE_BMI2) && !defined(__MINGW32__)
Copy link
Member

@pitrou pitrou Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment why MinGW is left out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Converts def_levels to validity bitmaps for non-list arrays and structs that have
// at least one member that is not a list and has no list descendents.
// For lists use DefRepLevelsToList and structs where all descendants contain
// a list use DefRepLevelsToBitmap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun :-)

// the next offset. See documentation onf DefLevelsToBitmap for when to use this
// method vs the other ones in this file for reconstruction.
//
// Offsets must be size to 1 + values_read_upper_bound.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comments, thank you.

@@ -19,6 +19,7 @@

#include <cstdint>

#include "arrow/util/bitmap.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this include is not used after all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope.

: standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>;
fn(def_levels, num_def_levels, level_info, output);
#else
// This indicates we are likely on a big-endian platformat which don't have a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"platform"

But the comment is mistaken: ARM is little-endian most of the time (technically it supports both, but Linux runs it in little-endian mode AFAIK).

Also, I don't understand why DefLevelsToBitmapScalar is preferred here but DefLevelsToBitmapSimd is preferred below? Don't the same arguments apply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a specific case for little endian.

I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on reader.cc now.

}
ARROW_ASSIGN_OR_RAISE(
std::shared_ptr<Buffer> lengths_buffer,
AllocateBuffer(sizeof(IndexType) * std::max(int64_t{2}, number_of_slots + 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking there was some issue with with list arrays that always required two elements. I couldn't find the issue though.

validity_io.valid_bits = validity_buffer->mutable_data();
}
ARROW_ASSIGN_OR_RAISE(
std::shared_ptr<Buffer> lengths_buffer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"offsets" rather than "lengths"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. done.

ctx_->pool));
// ensure zero initialization in case we have reached a zero length list (and
// because first entry is always zero).
IndexType* length_data = reinterpret_cast<IndexType*>(lengths_buffer->mutable_data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"offset_data" or "offsets_data"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// ensure we've set all the bits here.
if (validity_io.values_read < number_of_slots) {
// + 1 because arrays lengths are values + 1
std::fill(length_data + validity_io.values_read + 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to write some values past the offsets end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to resize.

::parquet::internal::DefRepLevelsToList(def_levels, rep_levels, num_levels,
level_info_, &validity_io, length_data);
END_PARQUET_CATCH_EXCEPTIONS
// We might return less then the requested slot (i.e. reaching an end of a file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means, shouldn't you know up front the number of values? Do you mean the file was truncated before the row group end (is that supported)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is number_of_slots just an upper bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. it is an upper bound. renamed it. and removed comment.

children_(std::move(children)) {
// There could be a mix of children some might be repeated some might not be.
// If possible use one that isn't since that will be guaranteed to have the least
// number of rep/def levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "of rep levels"? You could have arbitrarily nested structs with a lot of dep levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased a little bit. There is always an equal number of repetition and definition levels for any particular leaf.

*length = 0;
return Status::OK();
return Status::Invalid("StructReader had no childre");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"children"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


if (!found_nullable_child) {
if (*data == nullptr) {
// Only happens if there are actually 0 rows available.
*data = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant with your if condition above? Also, why does length need to be filled out explicitly below?
(doesn't def_rep_level_child_->GetDefLevels do it?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it is, removed.

}
if (data == nullptr) {
// Only happens if there are actually 0 rows available.
*data = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're dereferencing a null pointer (see if condition above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, removed. we shouldn't need this.

DefLevelsToBitmap(def_levels, num_levels, level_info_, &validity_io);
}

// Ensure all values are initialized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Shouldn't you resize the buffer instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about using Resizable buffers. changed all of the places that were filled to it.

@pitrou
Copy link
Member

pitrou commented Sep 15, 2020

parquet-level-conversion-benchmark (AMD Zen 2, clang 10, Ubuntu 20.04):

  • before:
BM_DefinitionLevelsToBitmapRepeatedAllMissing         966 ns          966 ns       714328 bytes_per_second=1.9753G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        1626 ns         1626 ns       426358 bytes_per_second=1.173G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       1850 ns         1850 ns       367809 bytes_per_second=1055.83M/s
  • after:
BM_DefinitionLevelsToBitmapRepeatedAllMissing         560 ns          560 ns      1239730 bytes_per_second=3.40515G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         562 ns          562 ns      1244256 bytes_per_second=3.39684G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent        562 ns          562 ns      1244691 bytes_per_second=3.39515G/s

Impressive!

@pitrou
Copy link
Member

pitrou commented Sep 15, 2020

No changes on parquet-arrow-reader-writer-benchmark. I suspect it doesn't trigger any of the updated code?

@emkornfield
Copy link
Contributor Author

No changes on parquet-arrow-reader-writer-benchmark. I suspect it doesn't trigger any of the updated code?

No, I don't think so. Since we were already using SIMD for non-nested types.

@emkornfield
Copy link
Contributor Author

@pitrou unfortunately, I was missing an "info.rep_level = 1;" in the benchmark, so it likely not as impressive on AMD, would you mind running again? (working on addressing the rest of the feedback.

Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to refactor level_conversion_test but I think i forgot to respond to the last review here.

@@ -50,7 +50,7 @@ if(ARROW_CPU_FLAG STREQUAL "x86")
# skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
set(ARROW_AVX512_FLAG "-march=skylake-avx512 -mbmi2")
# Append the avx2/avx512 subset option also, fix issue ARROW-9877 for homebrew-cpp
set(ARROW_AVX2_FLAG "${ARROW_AVX2_FLAG} -mavx2")
set(ARROW_AVX2_FLAG "${ARROW_AVX2_FLAG} -mavx2 -mbmi2")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I removed. and place -mbmi2 specifically only for the parquet file. I think this should be safe because it is guarded via runtime dispatch.

I might not have been clear, but MSVC doesn't have any way of distinguishing these things so if we ever turn on AVX2 by default we have the same issue.

: standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>;
fn(def_levels, num_def_levels, level_info, output);
#else
// This indicates we are likely on a big-endian platformat which don't have a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a specific case for little endian.

I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.

@@ -19,6 +19,7 @@

#include <cstdint>

#include "arrow/util/bitmap.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope.

}

inline uint64_t ExtractBits(uint64_t bitmap, uint64_t select_bitmap) {
#if defined(ARROW_HAVE_BMI2) && !defined(__MINGW32__)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

this->test_data_.rep_levels_ = std::vector<int16_t>{0};

std::vector<typename TypeParam::ListLengthType> lengths(
2, std::numeric_limits<typename TypeParam::ListLengthType>::max());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

std::vector<std::shared_ptr<ArrayData>>{item_chunk}, validity_io.null_count);

std::shared_ptr<Array> result = ::arrow::MakeArray(data);
RETURN_NOT_OK(result->Validate());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move this to above to avoid recursively calling things multiple times, but i think we should be validating at least for structs (and not validating full) since rep/def level information could be inconsistent within them. It felt easier to call validate (and not too expensive) then writing custom logic for this.

I'm open to removing them, but it feels like there should be a contract here that someplace in this code for an underlying library we validate consistency.


if (!found_nullable_child) {
if (*data == nullptr) {
// Only happens if there are actually 0 rows available.
*data = nullptr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it is, removed.

}
if (data == nullptr) {
// Only happens if there are actually 0 rows available.
*data = nullptr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, removed. we shouldn't need this.

ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ArrayData> item_chunk, ChunksToSingle(**out));

std::vector<std::shared_ptr<Buffer>> buffers{
validity_io.null_count > 0 ? validity_buffer : std::shared_ptr<Buffer>(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought this was causing problems with type inference at some pont.

ctx_->pool));
// ensure zero initialization in case we have reached a zero length list (and
// because first entry is always zero).
IndexType* length_data = reinterpret_cast<IndexType*>(lengths_buffer->mutable_data());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@emkornfield
Copy link
Contributor Author

@pitrou I think I responded to all review comments at this point, apologies if I missed something. level_conversion_test.cc is now refactored to a point where duplicate code I think adds to test understandability but there is still some redundancy. Also, please see my note about the level_conversion_benchmark having a bug in it on your prior run.

@nealrichardson
Copy link
Member

I think the macOS failure is fixed by #8196 but the Appveyor failure looks legit:

C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(295): error C2220: warning treated as error - no 'object' file generated
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: while compiling class template member function 'void parquet::internal::NestedListTest_SimpleLongList_Test<T>::TestBody(void)'
        with
        [
            T=Type
        ]
googletest_ep-prefix\include\gtest/internal/gtest-internal.h(665): note: see reference to class template instantiation 'parquet::internal::NestedListTest_SimpleLongList_Test<T>' being compiled
        with
        [
            T=Type
        ]
googletest_ep-prefix\include\gtest/internal/gtest-internal.h(657): note: while compiling class template member function 'bool testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>::Register(const char *,const testing::internal::CodeLocation &,const char *,const char *,int,const std::vector<std::string,std::allocator<_Ty>> &)'
        with
        [
            _Ty=std::string
        ]
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: see reference to function template instantiation 'bool testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>::Register(const char *,const testing::internal::CodeLocation &,const char *,const char *,int,const std::vector<std::string,std::allocator<_Ty>> &)' being compiled
        with
        [
            _Ty=std::string
        ]
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: see reference to class template instantiation 'testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>' being compiled
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(295): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
[212/245] Linking CXX shared library release\parquet.dll
   Creating library release\parquet.lib and object release\parquet.exp
[213/245] Building CXX object src\parquet\CMakeFiles\parquet-arrow-test.dir\Unity\unity_0_cxx.cxx.obj
ninja: build stopped: subcommand failed.

@nealrichardson
Copy link
Member

👍 we've reduced the failures to Flight (i.e. surely unrelated) issues.

@emkornfield emkornfield requested a review from pitrou September 17, 2020 16:24
@pitrou
Copy link
Member

pitrou commented Sep 17, 2020

I'll take a look again on Monday, if that's ok with you.

@emkornfield
Copy link
Contributor Author

SGTM. If i have time I might get one or two CLs out based on this one but I can rebase afterwards.

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Running parquet-level-conversion-benchmark again, results are much less good:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         878 ns          878 ns       816435 bytes_per_second=2.17265G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         967 ns          966 ns       713185 bytes_per_second=1.9736G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       3258 ns         3258 ns       214817 bytes_per_second=599.563M/s

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Things are a bit more balanced if the scalar version is used:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         660 ns          660 ns      1019688 bytes_per_second=2.8884G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        1946 ns         1946 ns       358335 bytes_per_second=1003.9M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       2024 ns         2023 ns       346078 bytes_per_second=965.229M/s

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading?

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Other than that, I see a ~20% improvement on BM_ReadStructColumn and BM_ReadListColumn

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement!

That said, I seem to get a test error on the Python side (pasted below). Let's see if it reproduces on CI:

Traceback (most recent call last):
  File "/home/antoine/arrow/dev/python/pyarrow/tests/test_parquet.py", line 700, in test_pandas_can_write_nested_data
    _write_table(arrow_table, imos)
  File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/_pytest/python_api.py", line 747, in __exit__
    fail(self.message)
  File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/_pytest/outcomes.py", line 128, in fail
    raise Failed(msg=msg, pytrace=pytrace)
Failed: DID NOT RAISE <class 'ValueError'>

(beware: I rebased your branch on git master)

@emkornfield
Copy link
Contributor Author

I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement!

@pitrou thank you for the thoughtful review. Let me know if you still have issues with the AVX/BMI2 after I added the comment (perhaps I didn't revert some compilation) or my analysis is wrong. I think the BMI2 check will be difficult/impossible at compile time for windows, so I'm not sure if it is worth the effort on linux.

I also removed the failing test for parquet (which I should have removed in a prior PR, its strange it showed up again).

@emkornfield
Copy link
Contributor Author

emkornfield commented Sep 21, 2020

Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading?

We need to support LargeList, and Map which should be smaller change (I'm working on a PR) at the schema level inference. There are a few other JIRAs still open about benchmarking and randomized testing, Past that, there are some open JIRAs about performance improvements:

  • Computing all all offsets/bitmaps together (the JIRA is about non-vectorized). I would expect that for deeply nested structures containing lists this would start to show performance improvements.
  • Using bitmap based code that was removed from this. For non-list types I think it can be a big performance (potentially another 20% on our benchmarks) win on all platforms and a win at least for shallowly nested lists I expect it to be better for native Intel.

There is also an unrelated bug on the write side #8219 which I asked for @wesm to review (it is based on some changes in this PR).

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-)

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough).

@emkornfield
Copy link
Contributor Author

Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-)

I think the internal tests you wrote have pretty good coverage. After #8219 is merged I was planning on make some of the one way (the ones I wrote for write and the ones your wrote for read) fully round-trip. If you think there are gaps, by all means we should add tests.

@emkornfield
Copy link
Contributor Author

Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough).

My main concern is getting some data that reflects real workloads. @jorisvandenbossche it looks it sounded like you had Geo data that has multiple levels of nesting, I wonder if there is a canonical dataset we could make use of for benchmarking.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants