-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
As described in #14907 (comment). Reprex copied below.
library(dplyr)
library(arrow)
tbl1 <- tibble::tibble(
a = 1:3,
b = c("a", "b", NA),
d = c(letters[4:6])
)
tbl2 <- tibble::tibble(
b = c("b", NA),
c = c("a should be 2", "a should be 3")
)
# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 f
# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>
# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>The problem here is the question of "can we join on NA values in arrow?"
Not right now! But, here are a couple of workarounds. The first uses extra code, and the second passes the data to duckdb and back.
library(arrow)
library(dplyr)
tbl1 <- tibble::tibble(
a = 1:3,
b = c("a", "b", NA),
d = c(letters[4:6])
)
tbl2 <- tibble::tibble(
b = c("b", NA),
c = c("a should be 2", "a should be 3")
)
as_arrow_table(tbl2) |>
# replace NAs in tbl2 with alternative value
mutate(b = ifelse(is.na(b), "temp_value", b)) |>
left_join(
as_arrow_table(tbl1) |>
# replace NAs in tbl1 with alternative value
mutate(b = ifelse(is.na(b), "temp_value", b))
) |>
# replace alternative value in results with NA
mutate(b = ifelse(b == "temp_value", NA, b)) |>
collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 fOr with DuckDB:
tbl1_duckdb <- as_arrow_table(tbl1) |>
to_duckdb()
as_arrow_table(tbl2) |>
to_duckdb() |>
left_join(tbl1_duckdb, na_matches = "na") |>
collect()
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 f
Note that we have to pass in na_matches = "na" explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.
I also had a look to try to work out whether we can implement this in Arrow or not. dbplyr implements a function sql_expr_matches() which is what allows matching on NAs, and here's a snippet from it:
sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
glue_sql2(
con,
"CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
"THEN 0 ",
"ELSE 1 ",
"END = 0"
)
}
The C++ changes in #11579 may allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.
Component(s)
R