-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8794: [C++] Expand performance coverage of parquet to arrow reading #7175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
| std::shared_ptr<::arrow::DataType> type = std::make_shared<ArrowType<ParquetType>>(); | ||
| NumericBuilder<ArrowType<ParquetType>> builder; | ||
| if (nullable) { | ||
| std::vector<uint8_t> valid_bytes(BENCHMARK_SIZE, 0); | ||
| int n = {0}; | ||
| std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; }); | ||
| if (null_percentage == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be kAlternatingOrNa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| int n = {0}; | ||
| std::generate(valid_bytes.begin(), valid_bytes.end(), | ||
| [&n] { return (n++ % 2) != 0; }); | ||
| if (null_percentage == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps nulls generation can be factored out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| template <typename ParquetType> | ||
| std::shared_ptr<::arrow::Table> TableFromVector( | ||
| const std::vector<typename ParquetType::c_type>& vec, bool nullable) { | ||
| const std::vector<typename ParquetType::c_type>& vec, bool nullable, | ||
| int null_percentage = kAlternatingOrNa) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| ->Args({/*null_percentage=*/10, /*first_value_percentage=*/50}) | ||
| ->Args({/*null_percentage=*/25, /*first_value_percentage=*/25}) | ||
| ->Args({/*null_percentage=*/30, /*first_value_percentage=*/25}) | ||
| ->Args({/*null_percentage=*/35, /*first_value_percentage=*/25}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need such a granularity in null_percentage values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not permanently, but this uncovered an interesting pattern for #7143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted these a little bit to have a little bit more consistency and bias towards runs.
|
|
||
| BENCHMARK_TEMPLATE2(BM_ReadColumn, false, DoubleType) | ||
| ->Args({kAlternatingOrNa, 0}) | ||
| ->Args({1, 20}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. Thanks for the update!
| const std::vector<typename ParquetType::c_type>& vec, bool nullable, | ||
| int64_t null_percentage = kAlternatingOrNa) { | ||
| if (!nullable) { | ||
| DCHECK(null_percentage = kAlternatingOrNa); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_CHECK_EQ. There's an error in your statement. Also, DCHECKs are compiled out in non-debug mode, which is usually the case for benchmarks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, good point.
| std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; }); | ||
| // Note true values select index 1 of sample_values | ||
| auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage, | ||
| BENCHMARK_SIZE, /*sample_values=*/{1, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the valid bitmap only contains bytes 0x00 and 0x01?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that is true it is a confusing contract (maybe taking bool* would be better?) but I read this as converting 1 and 0 to corresponding bits (Under the covers if I traced correctly this calls ArrayBuilder::UnsafeAppendToBitmap which ultimately calls GenerateBitsUnrolled which coverts bytes to bits )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Pity.
| constexpr int64_t kAlternatingOrNa = -1; | ||
|
|
||
| template <typename T> | ||
| std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be factored out in testing/random.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll need to depend on libarrow_testing.so, not sure if this is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is OK, I'd prefer to save this refactoring for a later point in time, in case more changes are need to this.
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check a profile of the benchmark? I just noted that creating/opening the reader is in the benchmark loop. If the number of rows is small enough, deserializing the thrift metadata might hide what we're interested in benchmarking.
| constexpr int64_t kAlternatingOrNa = -1; | ||
|
|
||
| template <typename T> | ||
| std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll need to depend on libarrow_testing.so, not sure if this is a problem.
| std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; }); | ||
| // Note true values select index 1 of sample_values | ||
| auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage, | ||
| BENCHMARK_SIZE, /*sample_values=*/{1, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.
|
I just validated and the reader opening doesn't show up in profile much. On a more interesting note, |
I don't think I'm going to be doing much more performance related work past #7143 (which if you don't mind trying out it would be good to see if that improves performance on real world data). The last potential easy performance win is pushing the all null/no nulls remaining checks directly into the loops (for small batch sizes I wouldn't expect a huge difference there). My main goal is to get full nested functionality working, and I got a little distracted Other changes will probably require a bigger refactoring then I want to take on right now. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thank you @emkornfield
ea913f7 to
e3fadad
Compare
|
Rebased. |
…ding Closes apache#7175 from emkornfield/ARROW-8794-benchmark Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.