-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-14907: [R] right_join() function does not produce the expected outcome #15077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this. This works, but I think the underlying is cause is elsewhere, so might be better to fix there.
r/tests/testthat/test-dplyr-join.R
Outdated
| } | ||
| }) | ||
|
|
||
| test_that("right_join correctly coalesces keys", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of another test, could we modify the existing left and to_join to have keys that don't fully overlap?
arrow/r/tests/testthat/test-dplyr-join.R
Lines 20 to 27 in 785ab5f
| left <- example_data | |
| left$some_grouping <- rep(c(1, 2), 5) | |
| to_join <- tibble::tibble( | |
| some_grouping = c(1, 2), | |
| capital_letters = c("A", "B"), | |
| another_column = TRUE | |
| ) |
Then we get coverage from:
arrow/r/tests/testthat/test-dplyr-join.R
Lines 143 to 152 in 785ab5f
| test_that("right_join", { | |
| for (keep in c(TRUE, FALSE)) { | |
| compare_dplyr_binding( | |
| .input %>% | |
| right_join(to_join, by = "some_grouping", keep = !!keep) %>% | |
| collect(), | |
| left | |
| ) | |
| } | |
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done now.
r/R/dplyr-join.R
Outdated
| # Initially keep join keys so we can coalesce them after when keep=FALSE | ||
| query <- do_join(x, y, by, copy, suffix, ..., keep = TRUE, join_type = "RIGHT_OUTER") | ||
|
|
||
| # If we are doing a right outer join and not keeping the join keys of | ||
| # both sides, we need to coalesce. Otherwise, rows that exist in the | ||
| # RHS will have NAs for the join keys. | ||
| if (!keep) { | ||
| query$selected_columns <- post_join_projection(names(x), names(y), handle_join_by(by, x, y), suffix) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but it might be straightforward to fix here:
Lines 37 to 43 in 785ab5f
| # can coalesce them afterwards. | |
| left_output <- names(x) | |
| right_output <- if (keep || join_type == "FULL_OUTER") { | |
| names(y) | |
| } else { | |
| setdiff(names(y), by) | |
| } |
in the case of right join, we need to do the setdiff to produce the left_output and leave right_output as names(y).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the way you suggested, but I think I still need the post_join_projection() call; if I revert what I've done and just make the changes suggested above, I get an error in my test as the schema expects some_grouping.x and some_grouping.y columns to have been created, but they have not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try this (works for me locally):
left_output <- if (!keep && join_type == "RIGHT_OUTER") {
setdiff(names(x), by)
} else {
names(x)
}
right_output <- if (keep || join_type %in% c("FULL_OUTER", "RIGHT_OUTER")) {
names(y)
} else {
setdiff(names(y), by)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was missing the !keep! from the left_output!
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
…ed outcome (apache#15077) * Closes: apache#14907 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
|
Benchmark runs are scheduled for baseline = 53d73f8 and contender = c45ce81. c45ce81 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
Uh oh!
There was an error while loading. Please reload this page.