Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 27, 2023

Note this looks like a large change but it a lot of moving code around rather than any logic changes

Which issue does this PR close?

Part of #7052

Rationale for this change

see #7052

TLDR is that making benchmarks easier to run means more people will find them and run them :)

What changes are included in this PR?

  1. Combine / consolidate the parquet filter pushdown and sort benchmarks
  2. Update documentation
  3. Inline the help text into the tool

Like #7054, this PR maintains the old entrypoint (parquet) as well

So these two commands do the same thing (run the filter pushdown benchmark):

# New
cargo run  --bin dfbench -- parquet-filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old 
cargo run  --bin parquet filter --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

Likewise for sort benchmark:

# New
cargo run  --bin dfbench sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp
# Old
cargo run  --bin parquet sort --iterations=5 --partitions=1 --scale-factor=0.01 --path=/tmp

The readme looks like this:

cargo run  --bin dfbench -- parquet-filter --help

dfbench-parquet-filter 28.0.0
Test performance of parquet filter pushdown

The queries are executed on a synthetic dataset generated during
the benchmark execution and designed to simulate web server access
logs.

Example

dfbench parquet-filter  --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different
`ParquetScanOption` settings.

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...

Are these changes tested?

I tested them manually, both alone and with bench.sh

Are there any user-facing changes?

No, this is a development tool

@alamb alamb marked this pull request as ready for review July 27, 2023 21:52
@alamb alamb marked this pull request as draft July 28, 2023 10:06
@alamb alamb marked this pull request as ready for review July 28, 2023 14:53
@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2023

Thanks @Dandandan 🙏

@alamb alamb merged commit 2ec0bc1 into apache:main Aug 14, 2023
@alamb alamb deleted the alamb/parquet_and_sort branch August 17, 2023 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants