-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10585: [Rust] [DataFusion] Add join support to DataFrame and LogicalPlan #8720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
31f7b89 to
def6c52
Compare
|
|
||
| /// The on clause of the join, as vector of (left, right) columns. | ||
| pub type JoinOn<'a> = [(&'a str, &'a str)]; | ||
| pub type JoinOn = [(String, String)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into ownership issues that I couldn't figure out. I am happy to change this back if someone can show me how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too small compared to the others ops to justify the effort at this point, IMO.
| JoinType::Inner => { | ||
| // inner: all fields are there | ||
| let on_right = &on.iter().map(|on| on.1.to_string()).collect::<HashSet<_>>(); | ||
| // remove right-side join keys if they have the same names as the left-side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rustdoc test for DataFrame.join was failing until I made this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is finished, but I gave it a quick look and it looks good so far. Left some optional comments.
I do not think we need SQL on this PR.
|
|
||
| /// The on clause of the join, as vector of (left, right) columns. | ||
| pub type JoinOn<'a> = [(&'a str, &'a str)]; | ||
| pub type JoinOn = [(String, String)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too small compared to the others ops to justify the effort at this point, IMO.
| left_keys: Vec<String>, | ||
| /// Columns in right input to use for join keys | ||
| right_keys: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use Vec<(left_i, right_i)> because it automatically enforces the invariant that left_keys.len() == right_keys.len(). We can still keep the public interface left_keys, right_keys and perform the check before passing them to the builder.
Atm, when we use left.zip(right), we take the shortest vector, which may hide a bug in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree and I have made this change. The user-facing DataFrame method is now the only plan that accepts the two separate lists of column names and we verify they are the same length when creating the logical plan.
| let join_rows = join.collect().await?; | ||
| assert_eq!(100, left_rows.len()); | ||
| assert_eq!(100, right_rows.len()); | ||
| assert_eq!(1000, join_rows.len()); //TODO determine expected number but should be > 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test currently fails and I need to debug why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was invalid. Fixed now.
|
I will go ahead and merge once CI is green and then start implementing TPC-H queries to really test this out. @alamb fyi |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @andygrove -- I am catching up with reviews and this is super exciting to see
This PR adds
DataFrame.joinand plumbs it through to the physical join plan.