-
Notifications
You must be signed in to change notification settings - Fork 1.9k
TEST: enable pushdown_filters and reorder_filters by default #18873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I am also testing with just I am going to focus my efforts on profiling these queries which seem to have gotten the most slower: Here is the query: set datafusion.execution.parquet.binary_as_string = true
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;Basically my next steps are to profile these queries and see what is slower (and if it is related to filter representation, I will go focus on apache/arrow-rs#8902) |
Looks like we are very close! FYI, there a couple more slow than query 24: |
|
I did some more analysis: The idea is to isolate why filter pushdown is slowing down clickbench q24 See more details here #18873 This is after upgrading to arrow 57.1.0 The only difference in the two binaries is if filter pushdown is on by default: -rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 23 07:31 datafusion-cli-alamb_upgrade_arrow_57.1.0
-rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 22 07:57 datafusion-cli-almab_pushdown_no_reorderUsing hits partitioned dataset ln -s ~/Software/datafusion/benchmarks/data/hits_partitioned ./hitsHere is q24.sql set datafusion.execution.parquet.binary_as_string = true;
-- turn on pushdown (is hard coded)
-- set datafusion.execution.parquet.pushdown_filters = true;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;You can see the pushdown is slightly slower ./datafusion-cli-almab_pushdown_no_reorder -f q24.sql | grep Elapsed
Elapsed 0.000 seconds.
Elapsed 0.183 seconds.
Elapsed 0.154 seconds.
Elapsed 0.155 seconds.
Elapsed 0.153 seconds.
Elapsed 0.154 seconds.
Elapsed 0.154 seconds.
Elapsed 0.150 seconds.
Elapsed 0.154 seconds.
Elapsed 0.156 seconds.
Elapsed 0.152 seconds../datafusion-cli-alamb_upgrade_arrow_57.1.0 -f q24.sql | grep Elapsed
Elapsed 0.002 seconds.
Elapsed 0.164 seconds.
Elapsed 0.137 seconds.
Elapsed 0.137 seconds.
Elapsed 0.133 seconds.
Elapsed 0.132 seconds.
Elapsed 0.135 seconds.
Elapsed 0.131 seconds.
Elapsed 0.137 seconds.
Elapsed 0.137 seconds.
Elapsed 0.133 seconds.So let's profile what the pushdown one is doing
So more than 5% of the time is being spent converting filters back and forth. Thus, this gives me more motivation to keep working on |
8a06d63 to
35f137c
Compare
|
run benchmarks |
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#18873 (comment)).
|
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#18873 (comment)).
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
@alamb I think that calls to |
|
|
show benchmark queue |
|
run benchmark tpch |
|
🤖 |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#18873 (comment)).
|
|
🤖: Benchmark completed Details
|
thanks |
|
You piqued my interest with why this is slow. couple of questions:
ideas:
|
Thanks @rluvaton I think the sizes are typically the batch size (8192 rows) the masks come from https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html (which DataFusion provdes) I think reason it is currently slower is that the BooleanArrays are converted back to RowSelections always -- specifically https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.from_filters For patterns with many small selections, this is much worse and takes a lot of time This is basically what I am working on avoiding in apache/arrow-rs#8902 The ideas are good. I will try and incorporate them in apache/arrow-rs#8902 |

( I am using this PR to test, I don't intend to merge it yet )
Which issue does this PR close?
filter_pushdown) by default #3463Rationale for this change
We have made non trivial progress in filter representation in Parquet. Let's see where performance is now.
What changes are included in this PR?
arrow,parquetto57.1.0#18820pushdown_filtersandreorder_filtersAre these changes tested?
By CI tests
Are there any user-facing changes?