Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion datafusion/core/src/physical_optimizer/join_selection.rs
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,9 @@ fn swap_join_projection(
}

/// This function swaps the inputs of the given join operator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This function swaps the inputs of the given join operator.
/// This function swaps the inputs of the given join operator, to construct `HashJoinExec` with right side as the build side

fn swap_hash_join(
/// This function is public so other downstream projects can use it
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This function is public so other downstream projects can use it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a clear comment, it might be wrongly moved out of public APIs easily.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but the entire idea of pub modifier implies 3rd party users to use the pub method? 🤔

/// to construct `HashJoinExec` with right side as the build side.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// to construct `HashJoinExec` with right side as the build side.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a benefit in leaving the documentation that explains the rationale for why this is public

pub fn swap_hash_join(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark supports build right in HashJoin operator internally. In planning stage, it decides which side is build side.

However, DataFusion HashJoin planning takes alternative approach. DataFusion HashJoin operator only supports build left. In optimization stage, once DataFusion finds build right is better, it will call this function to swap inputs for HashJoin to achieve build right effect.

So, to make Comet to use DataFusion HashJoin to support Spark build right HashJoin, we only need this function to be public so Comet can directly use it.

Copy link
Copy Markdown
Contributor

@alamb alamb May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this input takes &HashJoinExec as input, it seems like it might be easier to find / make more sense as a method on HashJoinExec itself 🤔

So maybe we could update the code so this function is HashJoinExec::swap_inputs or something

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. 👍
Let me update it accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, some functions used in swap_hash_join are also used by other functions like swap_nl_join. So I cannot simply move these functions into HashJoinExec as private ones.

Only I can do is move these functions to joins/utils.rs in physical-plan crate and make them as public APIs so both join_selection.rs and HashJoinExec can use them.

It makes more APIs public actually, including swap_join_filter, swap_reverting_projection and swap_join_type.

I think it is not so worth making more public APIs just for swap_hash_join.

Seems the current approach is better?

WDYT?

hash_join: &HashJoinExec,
partition_mode: PartitionMode,
) -> Result<Arc<dyn ExecutionPlan>> {
Expand Down