Skip to content

Conversation

@thisisnic
Copy link
Member

@thisisnic thisisnic commented Dec 23, 2022

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #14907 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this. This works, but I think the underlying is cause is elsewhere, so might be better to fix there.

}
})

test_that("right_join correctly coalesces keys", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of another test, could we modify the existing left and to_join to have keys that don't fully overlap?

left <- example_data
left$some_grouping <- rep(c(1, 2), 5)
to_join <- tibble::tibble(
some_grouping = c(1, 2),
capital_letters = c("A", "B"),
another_column = TRUE
)

Then we get coverage from:

test_that("right_join", {
for (keep in c(TRUE, FALSE)) {
compare_dplyr_binding(
.input %>%
right_join(to_join, by = "some_grouping", keep = !!keep) %>%
collect(),
left
)
}
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done now.

r/R/dplyr-join.R Outdated
Comment on lines 76 to 84
# Initially keep join keys so we can coalesce them after when keep=FALSE
query <- do_join(x, y, by, copy, suffix, ..., keep = TRUE, join_type = "RIGHT_OUTER")

# If we are doing a right outer join and not keeping the join keys of
# both sides, we need to coalesce. Otherwise, rows that exist in the
# RHS will have NAs for the join keys.
if (!keep) {
query$selected_columns <- post_join_projection(names(x), names(y), handle_join_by(by, x, y), suffix)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but it might be straightforward to fix here:

arrow/r/R/dplyr-join.R

Lines 37 to 43 in 785ab5f

# can coalesce them afterwards.
left_output <- names(x)
right_output <- if (keep || join_type == "FULL_OUTER") {
names(y)
} else {
setdiff(names(y), by)
}

in the case of right join, we need to do the setdiff to produce the left_output and leave right_output as names(y).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the way you suggested, but I think I still need the post_join_projection() call; if I revert what I've done and just make the changes suggested above, I get an error in my test as the schema expects some_grouping.x and some_grouping.y columns to have been created, but they have not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try this (works for me locally):

  left_output <- if (!keep && join_type == "RIGHT_OUTER") {
    setdiff(names(x), by)
  } else {
    names(x)
  }
  right_output <- if (keep || join_type %in% c("FULL_OUTER", "RIGHT_OUTER")) {
    names(y)
  } else {
    setdiff(names(y), by)
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was missing the !keep! from the left_output!

@thisisnic thisisnic requested a review from wjones127 January 4, 2023 11:24
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@wjones127 wjones127 merged commit c45ce81 into apache:master Jan 4, 2023
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…ed outcome (apache#15077)

* Closes: apache#14907

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@ursabot
Copy link

ursabot commented Jan 5, 2023

Benchmark runs are scheduled for baseline = 53d73f8 and contender = c45ce81. c45ce81 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️11.48% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️2.86% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c45ce810 ec2-t3-xlarge-us-east-2
[Failed] c45ce810 test-mac-arm
[Finished] c45ce810 ursa-i9-9960x
[Failed] c45ce810 ursa-thinkcentre-m75q
[Finished] 53d73f8f ec2-t3-xlarge-us-east-2
[Failed] 53d73f8f test-mac-arm
[Finished] 53d73f8f ursa-i9-9960x
[Failed] 53d73f8f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jan 5, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] right_join() function does not produce the expected outcome

3 participants