Skip to content

na.fill LOCF, NOCB, #854#3341

Merged
mattdowle merged 23 commits intomasterfrom
nafill
Apr 16, 2019
Merged

na.fill LOCF, NOCB, #854#3341
mattdowle merged 23 commits intomasterfrom
nafill

Conversation

@jangorecki
Copy link
Copy Markdown
Member

@jangorecki jangorecki commented Jan 31, 2019

Closes #854
It has much bigger scope but this PR I think address most of the user use cases.
For integer and double should be fully functional. There is some redundancy there in the process, declaring pointers to integer and double, even if only one of those has been used. That allows to use more generic struct (types.h file), which can point to int and double as well.
Looking forward for feedback.

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Jan 31, 2019

Below optimistic scenario for data.table where 20 columns has to be adjusted with LOCF on a 20 cores machine, vs single threaded R implemented zoo.

set.seed(108)
library(data.table)
library(zoo)
packageVersion("data.table")
#[1] ‘1.12.1’
packageVersion("zoo")
#[1] ‘1.8.4’
N = 1e8
x = rnorm(N)
random_na = function(x, k) {
  nx = length(x)
  if (k > nx/2) k = nx/2 # dont need more than half of NAs
  i = sample(nx, k)
  x[i] = NA
  x
}
d = lapply(1:20, function(i) random_na(x=x+i, k=i*1e5))
sapply(d[1:3], function(x) sum(is.na(x)))
#[1] 100000 200000 300000
setDT(d)
d[1:3, 1:3]
#          V1       V2       V3
#       <num>    <num>    <num>
#1: 0.8871079 1.887108 2.887108
#2: 0.6181806 1.618181 2.618181
#3: 0.9083082 1.908308 2.908308
prettyNum(dim(d), big.mark="'")
#[1] "100'000'000" "20"

system.time(setDF(ans1<-fnafill(d, "locf")))
#   user  system elapsed
#  8.834   7.456   0.890

setDF(d)
system.time(ans2<-na.locf(d, na.rm=FALSE))
#   user  system elapsed 
# 56.552  36.852  93.403

all.equal(ans1, ans2)
#[1] TRUE

@mattdowle mattdowle added this to the 1.12.4 milestone Feb 1, 2019
@jangorecki jangorecki changed the title na.fill LOCF, NOCB, #854 [WIP] na.fill LOCF, NOCB, #854 Feb 3, 2019
@jangorecki jangorecki changed the title [WIP] na.fill LOCF, NOCB, #854 na.fill LOCF, NOCB, #854 Feb 3, 2019
@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Feb 3, 2019

This PR provides simple API to one the very popular problems: https://stackoverflow.com/a/54504456/2490497

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented Feb 4, 2019

This PR seems to export two new functions: fnafill and setnafill. What's the difference between them? My first thought is that there are two operations: i) replacing NAs with another value (like 0) and ii) filling NAs forwards with the last observation. Then there's nomatch= (and the pending issue to move LOCF() into nomatch=) and roll=. And similar function names in other package like na.locf.

The point of data.table is to keep data irregular. Then join a regular series to it using roll= (or in future nomatch=) to fill forward or backwards or limit staleness. I wonder if the people wanting to fill NAs realize that it's not supposed to be done that way.

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Feb 5, 2019

Both functions can replace with constant value and locf/nocb, the difference that setnafill makes that in-place, which is actually much more common scenario than creating new columns. nomatch will be designed to be used during joins, and will likely carry some overhead because of that.


DONE Do not merge this PR yet, I will add cols argument to setnafill function so it will be more convenient to pass whole dt to it, and fill only particular columns, like

setnafill(bigdt, "locf", cols=c("v3","v5"))

instead of

# counter intuitive to update in place without `:=`
bigdt[, setnafill(list(v3, v5), "locf")]
# or
setnafill(list(bigdt$v3, bigdt$v5), "locf")

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Feb 6, 2019

I added mentioned functionality so it is easy to fill only selected columns from data.table. New argument cols accept int, double or character.

setnafill(dt, fill=0, cols=c("V1","V2"))
setnafill(dt, fill=0, cols=c(1L,2L))
setnafill(dt, fill=0, cols=c(1,2))

Processing of that argument has been isolated into own function SEXP colnamesInt(SEXP x, SEXP cols); where x is a data.table to match columns against to, and return value is INTSXP of column positions. It handle exceptions, like NA, numbers outside of range of data.table columns, duplicates. There is not exported R wrapper to it colnamesInt(x, cols) so that functional can be more easily tested. We should be able to re-use it in few places, both in R and C, but lets keep scope of this PR narrow.

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Feb 7, 2019

Added verbose argument to print to console time spent on each column, and all of them. Processing time of each column is measured in parallel region, but printing is deferred so it is thread safe. Example on 40 thread machine and 40 columns data.table.

library(data.table)
N = 1e8
dt = setDT(c(
  replicate(20, sample(c(seq_len(N*0.9),rep(NA_integer_, N*0.1))), simplify=FALSE),
  replicate(20, sample(c(seq_len(N*0.9)/2,rep(NA_real_, N*0.1))), simplify=FALSE)
))
system.time(ans<-nafill(dt, "locf", verbose=TRUE))
nafillInteger: took 2.095s
nafillInteger: took 2.017s
nafillInteger: took 2.051s
nafillInteger: took 2.028s
nafillInteger: took 1.861s
nafillInteger: took 2.075s
nafillInteger: took 1.907s
nafillInteger: took 1.872s
nafillInteger: took 1.974s
nafillInteger: took 1.798s
nafillInteger: took 2.107s
nafillInteger: took 1.864s
nafillInteger: took 2.023s
nafillInteger: took 1.730s
nafillInteger: took 2.056s
nafillInteger: took 1.866s
nafillInteger: took 2.024s
nafillInteger: took 1.892s
nafillInteger: took 1.978s
nafillInteger: took 1.956s
nafillDouble: took 3.140s
nafillDouble: took 2.987s
nafillDouble: took 3.148s
nafillDouble: took 3.003s
nafillDouble: took 3.828s
nafillDouble: took 3.060s
nafillDouble: took 3.135s
nafillDouble: took 3.209s
nafillDouble: took 3.090s
nafillDouble: took 3.069s
nafillDouble: took 3.052s
nafillDouble: took 3.200s
nafillDouble: took 3.163s
nafillDouble: took 3.113s
nafillDouble: took 3.220s
nafillDouble: took 3.167s
nafillDouble: took 3.129s
nafillDouble: took 3.010s
nafillDouble: took 3.008s
nafillDouble: took 2.497s
nafillR: parallel processing of 40 column(s) took 3.979s
   user  system elapsed 
 23.375  75.110   4.138

@MichaelChirico
Copy link
Copy Markdown
Member

MichaelChirico commented Feb 8, 2019

nafillInteger: took 2.095s
nafillInteger: took 2.017s
nafillInteger: took 2.051s
nafillInteger: took 2.028s
nafillInteger: took 1.861s
nafillInteger: took 2.075s
nafillInteger: took 1.907s
nafillInteger: took 1.872s

Such output doesn't look all that helpful. Everything is subsumed by nafillR: parallel processing of 40 column(s) took 3.979s, no?

Is it possible to tag the output with the column name, e.g.

nafillInteger: V1 took 2.095s
nafillInteger: V2 took 2.017s
nafillInteger: V3 took 2.051s
nafillInteger: V4 took 2.028s
...

Then the column-by-column output has some value add.

What about a case of 100s of columns, do we still want to print everything?

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented Feb 8, 2019

Will fix Codacy fail


@MichaelChirico

Is it possible to tag the output with the column name, e.g.

Inner functions that populate that message are not aware of column name. It is like passing a list to lapply, you cannot access the name of each list element.
This is actually good suggestion to extend ans_t struct to carry a tag. This raise problem how to define such tag, column name is obvious, but why not column index? for froll it wouldn't be just column name but concatenated column name and window size. Using index would be more consistent then, but we actually don't need to carry index as we store array of ans_t which is already indexed and retain order. And because it is a low level structure it would be best to have it well and strictly defined.

What about a case of 100s of columns, do we still want to print everything?

Yes, there is verbose=FALSE to relief that pain. The only reasonable solution would be to allow verbose to be int, decremented with each inner function call. So for more details you call verbose=2, for seeing just messages from top called function verbose=1.

@MichaelChirico
Copy link
Copy Markdown
Member

there is verbose=FALSE to relief that pain

True (not TRUE 😉), OK.

why not column index?

Also would be fine. Something to make the output there actionable... "Oh, Column 45 took 10x as long as all the other columns, I should investigate..."

@jangorecki
Copy link
Copy Markdown
Member Author

@MichaelChirico added index in printed message, it is indexing from 0 btw.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 8, 2019

Codecov Report

Merging #3341 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3341      +/-   ##
==========================================
+ Coverage   96.68%   96.72%   +0.03%     
==========================================
  Files          65       66       +1     
  Lines       12281    12424     +143     
==========================================
+ Hits        11874    12017     +143     
  Misses        407      407
Impacted Files Coverage Δ
src/init.c 100% <ø> (ø) ⬆️
src/nafill.c 100% <100%> (ø)
R/shift.R 96.15% <100%> (+2.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66166b3...c429b3b. Read the comment docs.

@MichaelChirico
Copy link
Copy Markdown
Member

Switched to one-based unless there's something I'm missing?

@mattdowle
Copy link
Copy Markdown
Member

I added a brief news item. Can someone add a good brief example to the news item please as follow up PR.
I edited Jan's original comment to say that this PR closes #854. If that wasn't right thing to do, could anything not addressed be added as new follow up issues please.

@mattdowle mattdowle merged commit c30e161 into master Apr 16, 2019
@mattdowle mattdowle deleted the nafill branch April 16, 2019 22:28
Comment thread src/nafill.c

for (R_len_t i=0; i<nx; i++) {
if (bverbose && (vans[i].message[0][0] != '\0')) Rprintf("%s: %d: %s", __func__, i+1, vans[i].message[0]);
if (vans[i].message[1][0] != '\0') REprintf("%s: %d: %s", __func__, i+1, vans[i].message[1]); // # nocov start
Copy link
Copy Markdown
Member Author

@jangorecki jangorecki Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cannot be suppressed with suppressMessages, but lower level funs are not producing any messages as of now.

Comment thread src/nafill.c
ifill = INTEGER(fill)[0];
dfill = INTEGER(fill)[0]==NA_INTEGER ? NA_REAL : (double)INTEGER(fill)[0];
} else if (isReal(fill)) {
ifill = ISNA(REAL(fill)[0]) ? NA_INTEGER : (int32_t)REAL(fill)[0];
Copy link
Copy Markdown
Member Author

@jangorecki jangorecki Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle is this type of casting safe? and similar in line 152. I should always use coerceVector in such cases, correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks that coerceVector is not necessary: 6383823#diff-74714f9d6425434f378360350d1a4c78R41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fill in missing data

3 participants