Skip to content

[R] Support joining using NA as join key #37902

@thisisnic

Description

@thisisnic

Describe the enhancement requested

As described in #14907 (comment). Reprex copied below.

library(dplyr)
library(arrow)

tbl1 <- tibble::tibble(
  a = 1:3,
  b = c("a", "b", NA),
  d = c(letters[4:6])
)

tbl2 <- tibble::tibble(
  b = c("b", NA),
  c = c("a should be 2", "a should be 3")
)


# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f


# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>


# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>

The problem here is the question of "can we join on NA values in arrow?"

Not right now! But, here are a couple of workarounds. The first uses extra code, and the second passes the data to duckdb and back.

library(arrow)
library(dplyr)

tbl1 <- tibble::tibble(
  a = 1:3,
  b = c("a", "b", NA),
  d = c(letters[4:6])
)

tbl2 <- tibble::tibble(
  b = c("b", NA),
  c = c("a should be 2", "a should be 3")
)

as_arrow_table(tbl2) |>
  # replace NAs in tbl2 with alternative value
  mutate(b = ifelse(is.na(b), "temp_value", b)) |>
  left_join(
    as_arrow_table(tbl1) |>
      # replace NAs in tbl1 with alternative value
      mutate(b = ifelse(is.na(b), "temp_value", b))
  ) |>
  # replace alternative value in results with NA
  mutate(b = ifelse(b == "temp_value", NA, b)) |>
  collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f

Or with DuckDB:

tbl1_duckdb <- as_arrow_table(tbl1) |>
  to_duckdb()

as_arrow_table(tbl2) |>
  to_duckdb() |>
  left_join(tbl1_duckdb, na_matches = "na") |>
  collect()
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f

Note that we have to pass in na_matches = "na" explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.

I also had a look to try to work out whether we can implement this in Arrow or not. dbplyr implements a function sql_expr_matches() which is what allows matching on NAs, and here's a snippet from it:

sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
  glue_sql2(
    con,
    "CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
    "THEN 0 ",
    "ELSE 1 ",
    "END = 0"
  )
}

The C++ changes in #11579 may allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions