Add parquet-filter and sort benchmarks to dfbench #7120

alamb · 2023-07-27T21:39:17Z

Note this looks like a large change but it a lot of moving code around rather than any logic changes

Which issue does this PR close?

Part of #7052

Rationale for this change

see #7052

TLDR is that making benchmarks easier to run means more people will find them and run them :)

What changes are included in this PR?

Combine / consolidate the parquet filter pushdown and sort benchmarks
Update documentation
Inline the help text into the tool

Like #7054, this PR maintains the old entrypoint (parquet) as well

So these two commands do the same thing (run the filter pushdown benchmark):

# New
cargo run  --bin dfbench -- parquet-filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old 
cargo run  --bin parquet filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

Likewise for sort benchmark:

# New
cargo run  --bin dfbench sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old
cargo run  --bin parquet sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

The readme looks like this:

cargo run  --bin dfbench -- parquet-filter --help

dfbench-parquet-filter 28.0.0
Test performance of parquet filter pushdown

The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

Example

dfbench parquet-filter  --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different
`ParquetScanOption` settings.

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...

Are these changes tested?

I tested them manually, both alone and with bench.sh

Are there any user-facing changes?

No, this is a development tool

alamb · 2023-08-14T10:28:16Z

Thanks @Dandandan 🙏

alamb added 2 commits July 27, 2023 17:32

Add parquet-filter and sort benchmarks to dfbench

6fff778

fix

339b1cb

alamb marked this pull request as ready for review July 27, 2023 21:52

alamb marked this pull request as draft July 28, 2023 10:06

alamb added 2 commits July 28, 2023 06:07

fix docs

221e3e5

fix ci bench

7ff4f9a

alamb marked this pull request as ready for review July 28, 2023 14:53

Merge remote-tracking branch 'apache/main' into alamb/parquet_and_sort

8b096c2

alamb mentioned this pull request Aug 5, 2023

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Closed

Update docs

4aea3a7

Dandandan approved these changes Aug 14, 2023

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/parquet_and_sort

c524f9f

alamb merged commit 2ec0bc1 into apache:main Aug 14, 2023

alamb deleted the alamb/parquet_and_sort branch August 17, 2023 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add parquet-filter and sort benchmarks to dfbench #7120

Add parquet-filter and sort benchmarks to dfbench #7120

Uh oh!

alamb commented Jul 27, 2023 •

edited

Loading

Uh oh!

alamb commented Aug 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add parquet-filter and sort benchmarks to dfbench #7120

Add parquet-filter and sort benchmarks to dfbench #7120

Uh oh!

Conversation

alamb commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Aug 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb commented Jul 27, 2023 •

edited

Loading