-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Right now some early runs with Arrowbench and the OT PR (#12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:
We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)- If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
- The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
- If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.
Reporter: Weston Pace / @westonpace
Subtasks:
- [C++] Investigate batching filter node output
- [C++] Investigate reporting filter selectivity for filter order optimization
Related issues:
Note: This issue was originally created as ARROW-15519. Please see the migration documentation for further details.