Skip to content

[C++] Investigate potential performance improvements for the filter node #30992

@asfimport

Description

@asfimport

Right now some early runs with Arrowbench and the OT PR (#12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:

  • We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)
  • If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
  • The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
  • If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.

Reporter: Weston Pace / @westonpace

Subtasks:

Related issues:

Note: This issue was originally created as ARROW-15519. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions