Conversation
|
Below optimistic scenario for data.table where 20 columns has to be adjusted with LOCF on a 20 cores machine, vs single threaded R implemented zoo. set.seed(108)
library(data.table)
library(zoo)
packageVersion("data.table")
#[1] ‘1.12.1’
packageVersion("zoo")
#[1] ‘1.8.4’
N = 1e8
x = rnorm(N)
random_na = function(x, k) {
nx = length(x)
if (k > nx/2) k = nx/2 # dont need more than half of NAs
i = sample(nx, k)
x[i] = NA
x
}
d = lapply(1:20, function(i) random_na(x=x+i, k=i*1e5))
sapply(d[1:3], function(x) sum(is.na(x)))
#[1] 100000 200000 300000
setDT(d)
d[1:3, 1:3]
# V1 V2 V3
# <num> <num> <num>
#1: 0.8871079 1.887108 2.887108
#2: 0.6181806 1.618181 2.618181
#3: 0.9083082 1.908308 2.908308
prettyNum(dim(d), big.mark="'")
#[1] "100'000'000" "20"
system.time(setDF(ans1<-fnafill(d, "locf")))
# user system elapsed
# 8.834 7.456 0.890
setDF(d)
system.time(ans2<-na.locf(d, na.rm=FALSE))
# user system elapsed
# 56.552 36.852 93.403
all.equal(ans1, ans2)
#[1] TRUE |
|
This PR provides simple API to one the very popular problems: https://stackoverflow.com/a/54504456/2490497 |
|
This PR seems to export two new functions: The point of data.table is to keep data irregular. Then join a regular series to it using roll= (or in future nomatch=) to fill forward or backwards or limit staleness. I wonder if the people wanting to fill NAs realize that it's not supposed to be done that way. |
|
Both functions can replace with constant value and locf/nocb, the difference that DONE setnafill(bigdt, "locf", cols=c("v3","v5"))instead of # counter intuitive to update in place without `:=`
bigdt[, setnafill(list(v3, v5), "locf")]
# or
setnafill(list(bigdt$v3, bigdt$v5), "locf") |
|
I added mentioned functionality so it is easy to fill only selected columns from data.table. New argument setnafill(dt, fill=0, cols=c("V1","V2"))
setnafill(dt, fill=0, cols=c(1L,2L))
setnafill(dt, fill=0, cols=c(1,2))Processing of that argument has been isolated into own function |
|
Added library(data.table)
N = 1e8
dt = setDT(c(
replicate(20, sample(c(seq_len(N*0.9),rep(NA_integer_, N*0.1))), simplify=FALSE),
replicate(20, sample(c(seq_len(N*0.9)/2,rep(NA_real_, N*0.1))), simplify=FALSE)
))
system.time(ans<-nafill(dt, "locf", verbose=TRUE)) |
Such output doesn't look all that helpful. Everything is subsumed by Is it possible to tag the output with the column name, e.g. Then the column-by-column output has some value add. What about a case of 100s of columns, do we still want to print everything? |
|
Will fix Codacy fail
Inner functions that populate that message are not aware of column name. It is like passing a list to
Yes, there is |
True (not
Also would be fine. Something to make the output there actionable... "Oh, Column 45 took 10x as long as all the other columns, I should investigate..." |
|
@MichaelChirico added index in printed message, it is indexing from 0 btw. |
Codecov Report
@@ Coverage Diff @@
## master #3341 +/- ##
==========================================
+ Coverage 96.68% 96.72% +0.03%
==========================================
Files 65 66 +1
Lines 12281 12424 +143
==========================================
+ Hits 11874 12017 +143
Misses 407 407
Continue to review full report at Codecov.
|
|
Switched to one-based unless there's something I'm missing? |
|
I added a brief news item. Can someone add a good brief example to the news item please as follow up PR. |
|
|
||
| for (R_len_t i=0; i<nx; i++) { | ||
| if (bverbose && (vans[i].message[0][0] != '\0')) Rprintf("%s: %d: %s", __func__, i+1, vans[i].message[0]); | ||
| if (vans[i].message[1][0] != '\0') REprintf("%s: %d: %s", __func__, i+1, vans[i].message[1]); // # nocov start |
There was a problem hiding this comment.
this cannot be suppressed with suppressMessages, but lower level funs are not producing any messages as of now.
| ifill = INTEGER(fill)[0]; | ||
| dfill = INTEGER(fill)[0]==NA_INTEGER ? NA_REAL : (double)INTEGER(fill)[0]; | ||
| } else if (isReal(fill)) { | ||
| ifill = ISNA(REAL(fill)[0]) ? NA_INTEGER : (int32_t)REAL(fill)[0]; |
There was a problem hiding this comment.
@mattdowle is this type of casting safe? and similar in line 152. I should always use coerceVector in such cases, correct?
There was a problem hiding this comment.
it looks that coerceVector is not necessary: 6383823#diff-74714f9d6425434f378360350d1a4c78R41
Closes #854
It has much bigger scope but this PR I think address most of the user use cases.
For integer and double should be fully functional. There is some redundancy there in the process, declaring pointers to integer and double, even if only one of those has been used. That allows to use more generic struct (
types.hfile), which can point to int and double as well.Looking forward for feedback.