Skip to content

[EPIC] Support Exprs as equality join predicates (constants) rather than just Columns in physical joins #9056

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

TLDR is that it would be nice to more easily use DataFusion as an execution engine for engines like Spark (e.g. the comet project apache/datafusion-comet#1), where operators directly take general expressions as join keys.

As @viirya said on #8991:

Currently the join keys of join operators like SortMergeJoin are restricted to be Column. But it is commonly we use expressions (e.g., l_col + 1 = r_col + 2) other than simply columns as join keys. From the query plan, DataFusion seems to add additional Project under join operator which projects the expressions into columns. So the above join operators take join keys as columns.

However, in other query engines, e.g., Spark, its query plan doesn't have the additional projection but its join operators directly take general expressions as join keys. (note that by adding additional projection before join in Spark it means more data to be shuffled/sorted which can be bad for performance)

That means if we cannot delegate such join operators to DataFusion physical join operators which require join keys must be columns.

This patch tries to relax this join keys constraint of physical join operators. So we can construct DataFusion physical join operator using general expressions as join keys.

This patch doesn't change how DataFusion plans the join operators. I.e., DataFusion still plans a join operation using non-column join keys into projection + join operator with columns. (We probably can remove this additional projection later if it also adds additional cost to DataFusion. Currently I'm not sure if/how DataFusion plans partitioning for the join operators.)

Describe the solution you'd like

No response

Describe alternatives you've considered

We could potentially still require joins to take only columns, and require other parts of the system to insert ProjectionExecs

In fact it seems like It seems like substrait represents equi joins as columns (not expressions): https://substrait.io/relations/physical_relations/#hash-equijoin-properties

It might be possible for engines like comet to insert ProjectionExec in the appropriate places in the plan using an optimizer pass

For example,

HashJoinExec(exprs=(l_col + 1, r_col + 2))

Could be rewritten to

HashJoinExec(exprs=(x, y))
  ProjectionExec(exprs=[l_col + 1 as "x", r_col + 2 as "y"])

Known Tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions