Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Nov 20, 2020

This PR adds DataFrame.join and plumbs it through to the physical join plan.

@andygrove
Copy link
Member Author

@jorgecarleitao @alamb fyi

@github-actions
Copy link


/// The on clause of the join, as vector of (left, right) columns.
pub type JoinOn<'a> = [(&'a str, &'a str)];
pub type JoinOn = [(String, String)];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into ownership issues that I couldn't figure out. I am happy to change this back if someone can show me how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too small compared to the others ops to justify the effort at this point, IMO.

JoinType::Inner => {
// inner: all fields are there
let on_right = &on.iter().map(|on| on.1.to_string()).collect::<HashSet<_>>();
// remove right-side join keys if they have the same names as the left-side
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rustdoc test for DataFrame.join was failing until I made this change.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is finished, but I gave it a quick look and it looks good so far. Left some optional comments.

I do not think we need SQL on this PR.


/// The on clause of the join, as vector of (left, right) columns.
pub type JoinOn<'a> = [(&'a str, &'a str)];
pub type JoinOn = [(String, String)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too small compared to the others ops to justify the effort at this point, IMO.

Comment on lines 111 to 113
left_keys: Vec<String>,
/// Columns in right input to use for join keys
right_keys: Vec<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use Vec<(left_i, right_i)> because it automatically enforces the invariant that left_keys.len() == right_keys.len(). We can still keep the public interface left_keys, right_keys and perform the check before passing them to the builder.

Atm, when we use left.zip(right), we take the shortest vector, which may hide a bug in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree and I have made this change. The user-facing DataFrame method is now the only plan that accepts the two separate lists of column names and we verify they are the same length when creating the logical plan.

let join_rows = join.collect().await?;
assert_eq!(100, left_rows.len());
assert_eq!(100, right_rows.len());
assert_eq!(1000, join_rows.len()); //TODO determine expected number but should be > 100
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test currently fails and I need to debug why

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was invalid. Fixed now.

@andygrove
Copy link
Member Author

I will go ahead and merge once CI is green and then start implementing TPC-H queries to really test this out.

@alamb fyi

@andygrove andygrove closed this in 208b258 Nov 21, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @andygrove -- I am catching up with reviews and this is super exciting to see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants