Skip to content

data.table-style coalesce #3424

@MichaelChirico

Description

@MichaelChirico

Did a bunch of coalesceing today and was sorely missing an efficient version.

@HughParsonage, would you be happy to add hutils::coalesce to data.table? I have the discussion in #2677 in mind...

I made tinkered with hutils::coalesce to come up with:

coalesce <- function(x, ...) {
  if (!any(na_idx <- is.na(x)) || missing(..1)) return(x)
  values <- list(...)
  
  lx <- length(x)
  lengths <- c(lx, vapply(values, length, FUN.VALUE = 0L))
  lengthsn1 <- lengths != 1L
  if (any(lengthsn1 & lengths != lx)) {
    wrong_len_i <- which(lengthsn1 & lengths != lx)
    stop("Argument ", wrong_len_i[1], " had length ", lengths[wrong_len_i[1]], ", ",
         "but length(x) = ", lx, ". ",
         "The only permissible lengths in ... are 1 or the length of `x` (", lx, ").")
  }
  
  typeof_x <- typeof(x)
  x_not_factor <- !inherits(x, what = 'factor')
  lv <- length(values)
  
  for (i in seq_len(lv)) {
    vi <- values[[i]]
    if (typeof(vi) != typeof_x) {
      stop("Argument ", i + 1L, " had type '", typeof(vi), "' but ",
           "typeof(x) was ", typeof_x, ". All types ",
           "in `...` must be the same type.")
    }
    
    if (inherits(vi, what = "factor") && x_not_factor) {
      stop("Argument ", i + 1L, " was a factor, but `x` was not. ",
           "All `...` must be the same type.")
    }
    
    if (lengthsn1[i + 1L]) {
      has_value_idx <- !is.na(vi[na_idx])
      x[na_idx][has_value_idx] <- vi[na_idx][has_value_idx]
      if (all(has_value_idx)) break
      na_idx[has_value_idx] = FALSE
    } else {
      if (is.na(vi)) next 
      else x[na_idx] <- vi
      break
    }
  }
  x
}

main difference being to skip running anyNA every iteration and instead focus on "whittling down" the is.na(x) vector

Benchmarked against hmisc::coalesce and it's hit or miss... maybe need more replications (function evaluation takes at most around 2 seconds so this is doable)? Or there's some extra optimization I'm missing...

Anyway, the same logic would probably be faster in C...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions