ARROW-8794: [C++] Expand performance coverage of parquet to arrow reading #7175

emkornfield · 2020-05-14T04:34:48Z

No description provided.

github-actions · 2020-05-14T04:46:36Z

https://issues.apache.org/jira/browse/ARROW-8794

pitrou

Thanks for doing this!

pitrou · 2020-05-14T12:50:50Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

  std::shared_ptr<::arrow::DataType> type = std::make_shared<ArrowType<ParquetType>>();
  NumericBuilder<ArrowType<ParquetType>> builder;
  if (nullable) {
    std::vector<uint8_t> valid_bytes(BENCHMARK_SIZE, 0);
-    int n = {0};
-    std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
+    if (null_percentage == -1) {


Should this be kAlternatingOrNa?

pitrou · 2020-05-14T12:51:44Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

-    int n = {0};
-    std::generate(valid_bytes.begin(), valid_bytes.end(),
-                  [&n] { return (n++ % 2) != 0; });
+    if (null_percentage == -1) {


Perhaps nulls generation can be factored out?

pitrou · 2020-05-14T12:51:58Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

 template <typename ParquetType>
 std::shared_ptr<::arrow::Table> TableFromVector(
-    const std::vector<typename ParquetType::c_type>& vec, bool nullable) {
+    const std::vector<typename ParquetType::c_type>& vec, bool nullable,
+    int null_percentage = kAlternatingOrNa) {


int64_t above

cpp/src/parquet/arrow/reader_writer_benchmark.cc

pitrou · 2020-05-14T12:54:54Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

+    ->Args({/*null_percentage=*/10, /*first_value_percentage=*/50})
+    ->Args({/*null_percentage=*/25, /*first_value_percentage=*/25})
+    ->Args({/*null_percentage=*/30, /*first_value_percentage=*/25})
+    ->Args({/*null_percentage=*/35, /*first_value_percentage=*/25})


Do we need such a granularity in null_percentage values?

Maybe not permanently, but this uncovered an interesting pattern for #7143

I adjusted these a little bit to have a little bit more consistency and bias towards runs.

pitrou · 2020-05-14T12:55:44Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

+
+BENCHMARK_TEMPLATE2(BM_ReadColumn, false, DoubleType)
+    ->Args({kAlternatingOrNa, 0})
+    ->Args({1, 20});


pitrou

Some more comments. Thanks for the update!

pitrou · 2020-05-18T12:32:23Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

+    const std::vector<typename ParquetType::c_type>& vec, bool nullable,
+    int64_t null_percentage = kAlternatingOrNa) {
+  if (!nullable) {
+    DCHECK(null_percentage = kAlternatingOrNa);


ARROW_CHECK_EQ. There's an error in your statement. Also, DCHECKs are compiled out in non-debug mode, which is usually the case for benchmarks...

thanks, good point.

pitrou · 2020-05-18T12:34:17Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

-    std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
+    // Note true values select index 1 of sample_values
+    auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage,
+                                             BENCHMARK_SIZE, /*sample_values=*/{1, 0});


Does this mean the valid bitmap only contains bytes 0x00 and 0x01?

Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.

I do not think that is true it is a confusing contract (maybe taking bool* would be better?) but I read this as converting 1 and 0 to corresponding bits (Under the covers if I traced correctly this calls ArrayBuilder::UnsafeAppendToBitmap which ultimately calls GenerateBitsUnrolled which coverts bytes to bits )

Ah, I see. Pity.

pitrou · 2020-05-18T12:37:35Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

+constexpr int64_t kAlternatingOrNa = -1;
+
+template <typename T>
+std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size,


Can't this be factored out in testing/random.h?

It'll need to depend on libarrow_testing.so, not sure if this is a problem.

If it is OK, I'd prefer to save this refactoring for a later point in time, in case more changes are need to this.

fsaintjacques

Did you check a profile of the benchmark? I just noted that creating/opening the reader is in the benchmark loop. If the number of rows is small enough, deserializing the thrift metadata might hide what we're interested in benchmarking.

fsaintjacques · 2020-05-19T00:12:14Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

+constexpr int64_t kAlternatingOrNa = -1;
+
+template <typename T>
+std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size,


It'll need to depend on libarrow_testing.so, not sure if this is a problem.

fsaintjacques · 2020-05-19T00:15:05Z

cpp/src/parquet/arrow/reader_writer_benchmark.cc

-    std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
+    // Note true values select index 1 of sample_values
+    auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage,
+                                             BENCHMARK_SIZE, /*sample_values=*/{1, 0});


Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.

fsaintjacques · 2020-05-19T00:49:12Z

I just validated and the reader opening doesn't show up in profile much. On a more interesting note, BM_ReadColumn<true,Int32Type> reflects a lot the profile I get with real-life dataset (nyc taxi dataset). If this can guide you in further performance validation.

emkornfield · 2020-05-19T04:17:21Z

BM_ReadColumn<true,Int32Type> reflects a lot the profile I get with real-life dataset (nyc taxi dataset). If this can guide you in further performance validation.

I don't think I'm going to be doing much more performance related work past #7143 (which if you don't mind trying out it would be good to see if that improves performance on real world data). The last potential easy performance win is pushing the all null/no nulls remaining checks directly into the loops (for small batch sizes I wouldn't expect a huge difference there). My main goal is to get full nested functionality working, and I got a little distracted

Other changes will probably require a bigger refactoring then I want to take on right now.

cpp/src/parquet/arrow/reader_writer_benchmark.cc

pitrou

+1. Thank you @emkornfield

pitrou · 2020-05-20T12:45:41Z

Rebased.

…ding Closes apache#7175 from emkornfield/ARROW-8794-benchmark Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou reviewed May 14, 2020

View reviewed changes

emkornfield requested a review from pitrou May 16, 2020 06:33

emkornfield mentioned this pull request May 17, 2020

ARROW-8504: [C++] Add BitRunReader and use it in parquet #7143

Closed

4 tasks

pitrou reviewed May 18, 2020

View reviewed changes

fsaintjacques reviewed May 19, 2020

View reviewed changes

pitrou reviewed May 19, 2020

View reviewed changes

cpp/src/parquet/arrow/reader_writer_benchmark.cc Outdated Show resolved Hide resolved

pitrou approved these changes May 20, 2020

View reviewed changes

emkornfield added 6 commits May 20, 2020 14:45

expand performance test coverage

da0b1a8

add more points to int64

6c62950

remove range for boolean

0e6aa53

address review comments

fc009ec

ARROW_CHECK_EQ

7f9c6dd

fix typo

e3fadad

pitrou force-pushed the ARROW-8794-benchmark branch from ea913f7 to e3fadad Compare May 20, 2020 12:45

pitrou closed this in a70b4a0 May 20, 2020

asfimport mentioned this pull request May 20, 2020

[C++] Expand benchmark coverage for arrow from parquet reading #24939

Closed

ARROW-8794: [C++] Expand performance coverage of parquet to arrow reading #7175

ARROW-8794: [C++] Expand performance coverage of parquet to arrow reading #7175

Uh oh!

Conversation

emkornfield commented May 14, 2020

Uh oh!

github-actions bot commented May 14, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaintjacques commented May 19, 2020

Uh oh!

emkornfield commented May 19, 2020

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants