na.fill LOCF, NOCB, #854 by jangorecki · Pull Request #3341 · Rdatatable/data.table

jangorecki · 2019-01-31T13:46:22Z

Closes #854
It has much bigger scope but this PR I think address most of the user use cases.
For integer and double should be fully functional. There is some redundancy there in the process, declaring pointers to integer and double, even if only one of those has been used. That allows to use more generic struct (types.h file), which can point to int and double as well.
Looking forward for feedback.

jangorecki · 2019-01-31T15:00:29Z

Below optimistic scenario for data.table where 20 columns has to be adjusted with LOCF on a 20 cores machine, vs single threaded R implemented zoo.

set.seed(108)
library(data.table)
library(zoo)
packageVersion("data.table")
#[1] ‘1.12.1’
packageVersion("zoo")
#[1] ‘1.8.4’
N = 1e8
x = rnorm(N)
random_na = function(x, k) {
  nx = length(x)
  if (k > nx/2) k = nx/2 # dont need more than half of NAs
  i = sample(nx, k)
  x[i] = NA
  x
}
d = lapply(1:20, function(i) random_na(x=x+i, k=i*1e5))
sapply(d[1:3], function(x) sum(is.na(x)))
#[1] 100000 200000 300000
setDT(d)
d[1:3, 1:3]
#          V1       V2       V3
#       <num>    <num>    <num>
#1: 0.8871079 1.887108 2.887108
#2: 0.6181806 1.618181 2.618181
#3: 0.9083082 1.908308 2.908308
prettyNum(dim(d), big.mark="'")
#[1] "100'000'000" "20"

system.time(setDF(ans1<-fnafill(d, "locf")))
#   user  system elapsed
#  8.834   7.456   0.890

setDF(d)
system.time(ans2<-na.locf(d, na.rm=FALSE))
#   user  system elapsed 
# 56.552  36.852  93.403

all.equal(ans1, ans2)
#[1] TRUE

jangorecki · 2019-02-03T15:36:08Z

This PR provides simple API to one the very popular problems: https://stackoverflow.com/a/54504456/2490497

mattdowle · 2019-02-04T23:32:37Z

This PR seems to export two new functions: fnafill and setnafill. What's the difference between them? My first thought is that there are two operations: i) replacing NAs with another value (like 0) and ii) filling NAs forwards with the last observation. Then there's nomatch= (and the pending issue to move LOCF() into nomatch=) and roll=. And similar function names in other package like na.locf.

The point of data.table is to keep data irregular. Then join a regular series to it using roll= (or in future nomatch=) to fill forward or backwards or limit staleness. I wonder if the people wanting to fill NAs realize that it's not supposed to be done that way.

jangorecki · 2019-02-05T05:37:07Z

Both functions can replace with constant value and locf/nocb, the difference that setnafill makes that in-place, which is actually much more common scenario than creating new columns. nomatch will be designed to be used during joins, and will likely carry some overhead because of that.

DONE ~~Do not merge this PR yet, I will add cols argument to setnafill function~~ so it will be more convenient to pass whole dt to it, and fill only particular columns, like

setnafill(bigdt, "locf", cols=c("v3","v5"))

instead of

# counter intuitive to update in place without `:=`
bigdt[, setnafill(list(v3, v5), "locf")]
# or
setnafill(list(bigdt$v3, bigdt$v5), "locf")

…e column

jangorecki · 2019-02-06T16:25:33Z

I added mentioned functionality so it is easy to fill only selected columns from data.table. New argument cols accept int, double or character.

setnafill(dt, fill=0, cols=c("V1","V2"))
setnafill(dt, fill=0, cols=c(1L,2L))
setnafill(dt, fill=0, cols=c(1,2))

Processing of that argument has been isolated into own function SEXP colnamesInt(SEXP x, SEXP cols); where x is a data.table to match columns against to, and return value is INTSXP of column positions. It handle exceptions, like NA, numbers outside of range of data.table columns, duplicates. There is not exported R wrapper to it colnamesInt(x, cols) so that functional can be more easily tested. We should be able to re-use it in few places, both in R and C, but lets keep scope of this PR narrow.

jangorecki · 2019-02-07T12:19:21Z

Added verbose argument to print to console time spent on each column, and all of them. Processing time of each column is measured in parallel region, but printing is deferred so it is thread safe. Example on 40 thread machine and 40 columns data.table.

library(data.table)
N = 1e8
dt = setDT(c(
  replicate(20, sample(c(seq_len(N*0.9),rep(NA_integer_, N*0.1))), simplify=FALSE),
  replicate(20, sample(c(seq_len(N*0.9)/2,rep(NA_real_, N*0.1))), simplify=FALSE)
))
system.time(ans<-nafill(dt, "locf", verbose=TRUE))

nafillInteger: took 2.095s
nafillInteger: took 2.017s
nafillInteger: took 2.051s
nafillInteger: took 2.028s
nafillInteger: took 1.861s
nafillInteger: took 2.075s
nafillInteger: took 1.907s
nafillInteger: took 1.872s
nafillInteger: took 1.974s
nafillInteger: took 1.798s
nafillInteger: took 2.107s
nafillInteger: took 1.864s
nafillInteger: took 2.023s
nafillInteger: took 1.730s
nafillInteger: took 2.056s
nafillInteger: took 1.866s
nafillInteger: took 2.024s
nafillInteger: took 1.892s
nafillInteger: took 1.978s
nafillInteger: took 1.956s
nafillDouble: took 3.140s
nafillDouble: took 2.987s
nafillDouble: took 3.148s
nafillDouble: took 3.003s
nafillDouble: took 3.828s
nafillDouble: took 3.060s
nafillDouble: took 3.135s
nafillDouble: took 3.209s
nafillDouble: took 3.090s
nafillDouble: took 3.069s
nafillDouble: took 3.052s
nafillDouble: took 3.200s
nafillDouble: took 3.163s
nafillDouble: took 3.113s
nafillDouble: took 3.220s
nafillDouble: took 3.167s
nafillDouble: took 3.129s
nafillDouble: took 3.010s
nafillDouble: took 3.008s
nafillDouble: took 2.497s
nafillR: parallel processing of 40 column(s) took 3.979s
   user  system elapsed 
 23.375  75.110   4.138

MichaelChirico · 2019-02-08T10:59:36Z

nafillInteger: took 2.095s
nafillInteger: took 2.017s
nafillInteger: took 2.051s
nafillInteger: took 2.028s
nafillInteger: took 1.861s
nafillInteger: took 2.075s
nafillInteger: took 1.907s
nafillInteger: took 1.872s

Such output doesn't look all that helpful. Everything is subsumed by nafillR: parallel processing of 40 column(s) took 3.979s, no?

Is it possible to tag the output with the column name, e.g.

nafillInteger: V1 took 2.095s
nafillInteger: V2 took 2.017s
nafillInteger: V3 took 2.051s
nafillInteger: V4 took 2.028s
...

Then the column-by-column output has some value add.

What about a case of 100s of columns, do we still want to print everything?

jangorecki · 2019-02-08T11:41:26Z

Will fix Codacy fail

@MichaelChirico

Is it possible to tag the output with the column name, e.g.

Inner functions that populate that message are not aware of column name. It is like passing a list to lapply, you cannot access the name of each list element.
This is actually good suggestion to extend ans_t struct to carry a tag. This raise problem how to define such tag, column name is obvious, but why not column index? for froll it wouldn't be just column name but concatenated column name and window size. Using index would be more consistent then, but we actually don't need to carry index as we store array of ans_t which is already indexed and retain order. And because it is a low level structure it would be best to have it well and strictly defined.

What about a case of 100s of columns, do we still want to print everything?

Yes, there is verbose=FALSE to relief that pain. The only reasonable solution would be to allow verbose to be int, decremented with each inner function call. So for more details you call verbose=2, for seeing just messages from top called function verbose=1.

MichaelChirico · 2019-02-08T11:56:48Z

there is verbose=FALSE to relief that pain

True (not TRUE 😉), OK.

why not column index?

Also would be fine. Something to make the output there actionable... "Oh, Column 45 took 10x as long as all the other columns, I should investigate..."

jangorecki · 2019-02-08T12:11:02Z

@MichaelChirico added index in printed message, it is indexing from 0 btw.

codecov · 2019-02-08T12:23:27Z

Codecov Report

Merging #3341 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3341      +/-   ##
==========================================
+ Coverage   96.68%   96.72%   +0.03%     
==========================================
  Files          65       66       +1     
  Lines       12281    12424     +143     
==========================================
+ Hits        11874    12017     +143     
  Misses        407      407

Impacted Files	Coverage Δ
src/init.c	`100% <ø> (ø)`	⬆️
src/nafill.c	`100% <100%> (ø)`
R/shift.R	`96.15% <100%> (+2.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66166b3...c429b3b. Read the comment docs.

MichaelChirico · 2019-02-08T12:30:06Z

Switched to one-based unless there's something I'm missing?

mattdowle · 2019-04-16T22:18:25Z

I added a brief news item. Can someone add a good brief example to the news item please as follow up PR.
I edited Jan's original comment to say that this PR closes #854. If that wasn't right thing to do, could anything not addressed be added as new follow up issues please.

jangorecki · 2019-04-19T16:49:47Z

+
+  for (R_len_t i=0; i<nx; i++) {
+    if (bverbose && (vans[i].message[0][0] != '\0')) Rprintf("%s: %d: %s", __func__, i+1, vans[i].message[0]);
+    if (vans[i].message[1][0] != '\0') REprintf("%s: %d: %s", __func__, i+1, vans[i].message[1]); // # nocov start


this cannot be suppressed with suppressMessages, but lower level funs are not producing any messages as of now.

jangorecki · 2019-06-05T13:46:58Z

+      ifill = INTEGER(fill)[0];
+      dfill = INTEGER(fill)[0]==NA_INTEGER ? NA_REAL : (double)INTEGER(fill)[0];
+    } else if (isReal(fill)) {
+      ifill = ISNA(REAL(fill)[0]) ? NA_INTEGER : (int32_t)REAL(fill)[0];


@mattdowle is this type of casting safe? and similar in line 152. I should always use coerceVector in such cases, correct?

it looks that coerceVector is not necessary: 6383823#diff-74714f9d6425434f378360350d1a4c78R41

mattdowle added this to the 1.12.4 milestone Feb 1, 2019

jangorecki force-pushed the nafill branch from ff33361 to 19b59a6 Compare February 2, 2019 15:49

jangorecki added 4 commits February 3, 2019 10:42

na.fill LOCF, NOCB, #854

2988c09

minor redesign and add in-place nafill

aaeee23

push down input validation, in-place tests, mv tests

7836444

fnafill tests included in pkg check

08639be

jangorecki force-pushed the nafill branch from bb65277 to 08639be Compare February 3, 2019 05:19

jangorecki changed the title ~~na.fill LOCF, NOCB, #854~~ [WIP] na.fill LOCF, NOCB, #854 Feb 3, 2019

proper exception handling thanks to Codacy, although not used here

89d880e

jangorecki changed the title ~~[WIP] na.fill LOCF, NOCB, #854~~ na.fill LOCF, NOCB, #854 Feb 3, 2019

fix proper coercion of fill argument

a047fff

jangorecki added 5 commits February 6, 2019 17:10

rename fnafill to nafill as name is still free

ce06101

test exception handling

a394f54

no need for new set function, reuse existing passing pointer to sourc…

6a0b891

…e column

also remove def of removed funs

c7ae262

setnafill gets cols argument to update only selected columns

3f6a0a1

jangorecki mentioned this pull request Feb 6, 2019

[R-Forge #2352] Columnwise set()? #610

Closed

jangorecki added 2 commits February 6, 2019 22:28

add entry to init.c for new .Call

d58ff64

add verbose arg and reporting timing

17a2212

missing verbose in usage manual

c5439f9

jangorecki mentioned this pull request Feb 7, 2019

platform independent timing in C #3358

Closed

fix osx pdf generation from Rd

6187464

fix codacy fail, amend feedback to verbose messages

39190c3

one-based indexing

c8b6c18

jangorecki and others added 6 commits February 9, 2019 15:50

omp for schedule auto in nafill

02aeb82

nocov can be smaller nafill

b7d1f5c

Merge branch 'master' into nafill

415aec6

chmatch now has one less argument in master since this PR last tested

e79d494

added news item

ec48840

coverage

c429b3b

mattdowle merged commit c30e161 into master Apr 16, 2019

mattdowle deleted the nafill branch April 16, 2019 22:28

jangorecki commented Apr 19, 2019

View reviewed changes

jangorecki commented Jun 5, 2019

View reviewed changes

Conversation

jangorecki commented Jan 31, 2019 • edited by mattdowle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdowle commented Feb 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Feb 8, 2019

Uh oh!

jangorecki commented Feb 8, 2019

Uh oh!

codecov bot commented Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichaelChirico commented Feb 8, 2019

Uh oh!

mattdowle commented Apr 16, 2019

Uh oh!

jangorecki Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jangorecki Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jangorecki Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jangorecki commented Jan 31, 2019 •

edited by mattdowle

Loading

jangorecki commented Jan 31, 2019 •

edited

Loading

jangorecki commented Feb 3, 2019 •

edited

Loading

mattdowle commented Feb 4, 2019 •

edited

Loading

jangorecki commented Feb 5, 2019 •

edited

Loading

jangorecki commented Feb 6, 2019 •

edited

Loading

jangorecki commented Feb 7, 2019 •

edited

Loading

MichaelChirico commented Feb 8, 2019 •

edited

Loading

jangorecki commented Feb 8, 2019 •

edited

Loading

codecov bot commented Feb 8, 2019 •

edited

Loading

jangorecki Apr 19, 2019 •

edited

Loading

jangorecki Jun 5, 2019 •

edited

Loading