Skip to content

Conversation

@emkornfield
Copy link
Contributor

No description provided.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

std::shared_ptr<::arrow::DataType> type = std::make_shared<ArrowType<ParquetType>>();
NumericBuilder<ArrowType<ParquetType>> builder;
if (nullable) {
std::vector<uint8_t> valid_bytes(BENCHMARK_SIZE, 0);
int n = {0};
std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
if (null_percentage == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be kAlternatingOrNa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

int n = {0};
std::generate(valid_bytes.begin(), valid_bytes.end(),
[&n] { return (n++ % 2) != 0; });
if (null_percentage == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps nulls generation can be factored out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

template <typename ParquetType>
std::shared_ptr<::arrow::Table> TableFromVector(
const std::vector<typename ParquetType::c_type>& vec, bool nullable) {
const std::vector<typename ParquetType::c_type>& vec, bool nullable,
int null_percentage = kAlternatingOrNa) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

->Args({/*null_percentage=*/10, /*first_value_percentage=*/50})
->Args({/*null_percentage=*/25, /*first_value_percentage=*/25})
->Args({/*null_percentage=*/30, /*first_value_percentage=*/25})
->Args({/*null_percentage=*/35, /*first_value_percentage=*/25})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such a granularity in null_percentage values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not permanently, but this uncovered an interesting pattern for #7143

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted these a little bit to have a little bit more consistency and bias towards runs.


BENCHMARK_TEMPLATE2(BM_ReadColumn, false, DoubleType)
->Args({kAlternatingOrNa, 0})
->Args({1, 20});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. Thanks for the update!

const std::vector<typename ParquetType::c_type>& vec, bool nullable,
int64_t null_percentage = kAlternatingOrNa) {
if (!nullable) {
DCHECK(null_percentage = kAlternatingOrNa);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_CHECK_EQ. There's an error in your statement. Also, DCHECKs are compiled out in non-debug mode, which is usually the case for benchmarks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good point.

std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
// Note true values select index 1 of sample_values
auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage,
BENCHMARK_SIZE, /*sample_values=*/{1, 0});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the valid bitmap only contains bytes 0x00 and 0x01?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that is true it is a confusing contract (maybe taking bool* would be better?) but I read this as converting 1 and 0 to corresponding bits (Under the covers if I traced correctly this calls ArrayBuilder::UnsafeAppendToBitmap which ultimately calls GenerateBitsUnrolled which coverts bytes to bits )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Pity.

constexpr int64_t kAlternatingOrNa = -1;

template <typename T>
std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be factored out in testing/random.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll need to depend on libarrow_testing.so, not sure if this is a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is OK, I'd prefer to save this refactoring for a later point in time, in case more changes are need to this.

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check a profile of the benchmark? I just noted that creating/opening the reader is in the benchmark loop. If the number of rows is small enough, deserializing the thrift metadata might hide what we're interested in benchmarking.

constexpr int64_t kAlternatingOrNa = -1;

template <typename T>
std::vector<T> RandomVector(int64_t true_percentage, int64_t vector_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll need to depend on libarrow_testing.so, not sure if this is a problem.

std::generate(valid_bytes.begin(), valid_bytes.end(), [&n] { return n++ % 2; });
// Note true values select index 1 of sample_values
auto valid_bytes = RandomVector<uint8_t>(/*true_percengate=*/null_percentage,
BENCHMARK_SIZE, /*sample_values=*/{1, 0});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the bitmap only has 0b00000001 and 0b00000000 as possible words, or more-or-less one bit every 8th position.

@fsaintjacques
Copy link
Contributor

I just validated and the reader opening doesn't show up in profile much. On a more interesting note, BM_ReadColumn<true,Int32Type> reflects a lot the profile I get with real-life dataset (nyc taxi dataset). If this can guide you in further performance validation.

@emkornfield
Copy link
Contributor Author

BM_ReadColumn<true,Int32Type> reflects a lot the profile I get with real-life dataset (nyc taxi dataset). If this can guide you in further performance validation.

I don't think I'm going to be doing much more performance related work past #7143 (which if you don't mind trying out it would be good to see if that improves performance on real world data). The last potential easy performance win is pushing the all null/no nulls remaining checks directly into the loops (for small batch sizes I wouldn't expect a huge difference there). My main goal is to get full nested functionality working, and I got a little distracted

Other changes will probably require a bigger refactoring then I want to take on right now.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thank you @emkornfield

@pitrou pitrou force-pushed the ARROW-8794-benchmark branch from ea913f7 to e3fadad Compare May 20, 2020 12:45
@pitrou
Copy link
Member

pitrou commented May 20, 2020

Rebased.

@pitrou pitrou closed this in a70b4a0 May 20, 2020
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request May 26, 2020
…ding

Closes apache#7175 from emkornfield/ARROW-8794-benchmark

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants