retain timezones on CJ() by RoyalTS · Pull Request #2150 · Rdatatable/data.table

RoyalTS · 2017-05-08T06:27:20Z

fixes #2029

codecov-io · 2017-05-08T06:41:13Z

Codecov Report

Merging #2150 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2150      +/-   ##
==========================================
+ Coverage   90.75%   90.78%   +0.03%     
==========================================
  Files          59       59              
  Lines       11584    11585       +1     
==========================================
+ Hits        10513    10518       +5     
+ Misses       1071     1067       -4

Impacted Files	Coverage Δ
R/setkey.R	`93.51% <100%> (+0.02%)`	⬆️
src/forder.c	`94.47% <0%> (+0.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35beaf8...9288e7e. Read the comment docs.

mattdowle

Great spot and thanks for the pull. Comments inline.

mattdowle · 2017-05-11T00:32:19Z

    dups = FALSE # fix for #1513
    # using rep.int instead of rep speeds things up considerably (but attributes are dropped).
    j = lapply(l, class)  # changed "vapply" to avoid errors with "ordered" "factor" input
+    tzones = lapply(l, function(col) attr(col, 'tzone'))


This can be tzones = lapply(l, attr, 'tzone') please and can it be moved down inside the if() so that it is only calculated when needed?

Makes sense. Was trying to mirror the code that copies over classes. Any reason the j = ... line above my edit isn't inside the else if?

mattdowle · 2017-05-11T00:34:21Z

                l[[i]] = rep.int(rep.int(y, times = rep.int(x[i], n[i])), times = nrow/(x[i]*n[i]))
            if (any(class(l[[i]]) != j[[i]]))
                setattr(l[[i]], 'class', j[[i]]) # reset "Date" class - rep.int coerces to integer
+            if (any(!sapply(tzones, is.null))) {


Are the tzone for all columns being looked at here, within a loop through columns? It doesn't seem quite right to me but I haven't debugged or run it. See next comment on tests.

I think that this if-statement can be simply removed.
It serves no real purpose. If the tzones are all NULL, well, let them all be NULL and set them anyways.

I suppose this could make sense if is.null is significantly faster than just setting the attributes even if they're NULL? In any case, the same would go for the lines that reset class immediately above?

The difference to the line about the class is, that the latter tests for each i.
So, if we want to retain the tzone if-statement, it should read:
if(!is.null(tzones[[i]])){...}. That would make more sense in my opinion.

mattdowle · 2017-05-11T00:38:20Z


+# CJ should retain timezone information, #2029
+df <- CJ(week=as.POSIXct('2016-01-01', tz = 'UTC'), id=1:10)
+test(1762, attr(df$week, 'tzone'), 'UTC')


More tests please including multi-column data.table with 1 and 2 columns of POSIXct with other non-POSIXct (say plain integer) columns at the beginning, middle and end. Enough variations to test that the code is actually retaining the tzone on the desired columns and not adding tzones to columns that shouldn't, because I'm not sure it's right currently just by glancing at it. The test is just a single column.

You mean something likeCJ(dt1=as.POSIXct('2016-01-01', tz = 'UTC'), id=1:10, dt2=as.POSIXct(c('2016-01-01', '2016-01-02'), tz = 'UTC')) (behaves as expected) and various permutations of the order of these columns?

MarkusBonsch · 2017-05-17T16:32:33Z

Dear RoyalTS,

thanks a lot for taking care of this. One suggestion / proposal:
can't we extend this to all attributes by simply
attribs <- lapply(l, attributes)
instead of
tzones = lapply(l, function(col) attr(col, 'tzone'))
and
attributes(l[[i]]) <- attribs[[i]]
instead of
setattr(l[[i]], 'tzone', tzones[[i]])
? This would be a consistent solution for all attributes that might pop up in the future.

I would be happy to contribute, if you like.

Kind regards, Markus

MarkusBonsch · 2017-05-18T07:27:02Z

I have below posted a CJ implementation that consistently retains all attributes and addresses Matt's concerns . Additionally, I have proposed a comprehensive set of tests of the new implementation that also addresses Matt's comments.

I conducted a benchmark for a ten-column CJ (2 rows in each column) input to test if the if-statement speeds up things

expr	min	lq	mean	median	uq	max	neval
CJ_master	0.878273	1.209456	6.242917	1.762042	3.674923	185.8787	10000

|
| CJ_attributes_with_if | 0.871674 | 1.215139 | 6.244461 | 1.940190 | 3.676939 | 192.1245 | 10000
|
CJ_attributes_no_if | 0.872408 | 1.209823 | 6.735057 | 1.850016 | 3.686286 | 184.1207 | 10000|

Both implementations are just as fast as the master. Therefore, I would propose to keep the if because it is a little safer since attributes are only updated if necessary.

There is, however, a speed issue when all ten columns have an attribute that needs to be replaced:

expr	min	lq	mean	median	uq	max	neval
CJ_master	0.905031	1.389070	4.556453	1.690564	3.792404	185.8065	10000
CJ_attributes_with_if	1.306778	2.585514	8.374277	4.480982	4.905639	187.9216	10000
CJ_attributes_no_if	1.275621	2.599443	8.327741	4.498394	4.911871	184.9495	10000

Again, there is no difference between the implementation with if and without if. And it is not surprising that retaining attributes comes at a cost. The good thing is, that this cost really only arises if attributes are actually present.

Kind regards,
Markus

Here is the code:

# CJ implementation that retains attributes
CJ <- function(..., sorted = TRUE, unique = FALSE){
    # Pass in a list of unique values, e.g. ids and dates
    # Cross Join will then produce a join table with the combination of all values (cross product).
    # The last vector is varied the quickest in the table, so dates should be last for roll for example
    l = list(...)
    if (unique) l = lapply(l, unique)
    dups = FALSE # fix for #1513
    if (length(l)==1L && sorted && length(o <- forderv(l[[1L]])))
        l[[1L]] = l[[1L]][o]
    else if (length(l) > 1L) {
        # using rep.int instead of rep speeds things up considerably (but attributes are dropped).
        attribs = lapply(l, attributes)  # remember attributes for resetting after rep.int
        n = vapply(l, length, 0L)
        nrow = prod(n)
        x = c(rev(take(cumprod(rev(n)))), 1L)
        for (i in seq_along(x)) {
            y = l[[i]]
            # fix for #1513
            if (sorted) {
                if (length(o <- forderv(y, retGrp=TRUE))) y = y[o]
                if (!dups) dups = attr(o, 'maxgrpn') > 1L 
            }
            if (i == 1L) 
                l[[i]] = rep.int(y, times = rep.int(x[i], n[i]))   # i.e. rep(y, each=x[i])
            else if (i == length(n))
                l[[i]] = rep.int(y, times = nrow/(x[i]*n[i]))
            else
                l[[i]] = rep.int(rep.int(y, times = rep.int(x[i], n[i])), times = nrow/(x[i]*n[i]))
            if (!is.null(attribs[[i]])){
                attributes(l[[i]]) <- attribs[[i]] # reset all attributes that were destroyed by rep.int
            }
        }
    }
    setattr(l, "row.names", .set_row_names(length(l[[1L]])))
    setattr(l, "class", c("data.table", "data.frame"))

    if (is.null(vnames <- names(l))) 
        vnames = vector("character", length(l)) 
    if (any(tt <- vnames == "")) {
        vnames[tt] = paste("V", which(tt), sep="")
        setattr(l, "names", vnames)
    }
    l <- alloc.col(l)  # a tiny bit wasteful to over-allocate a fixed join table (column slots only), doing it anyway for consistency, and it's possible a user may wish to use SJ directly outside a join and would expect consistent over-allocation.
    if (sorted) {
        if (!dups) setattr(l, 'sorted', names(l)) 
        else setkey(l) # fix #1513
    }
    l
}

And here are the tests

# CJ retains attributes and classes, #2150

l <- list(a = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"), ## according to comment about CJ loosing date class
          d = factor(c("a", "b", "c"), ordered = TRUE), ## according to comment about bug with ordered factors
          e = factor(c("a", "b", "c"), ordered = FALSE),
          f = c(1,2),
          g = c("a", "b"),
          h = c(TRUE, FALSE))
setattr(l$g, "test", "testval")## add hand-made attribute

test(1762.1, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.2, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = factor(c("a", "b", "c"), ordered = TRUE),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"),
          d = factor(c("a", "b", "c"), ordered = TRUE),
          e = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"),
          f = c(1,2),
          g = c("a", "b"),
          h = c(TRUE, FALSE))


test(1762.3, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.4, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = factor(c("a", "b", "c"), ordered = TRUE),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"),
          d = factor(c("a", "b", "c"), ordered = TRUE),
          e = c(TRUE, FALSE),
          f = c(1,2),
          g = c("a", "b"),
          h = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"))

test(1762.5, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.6, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = NA,
          c = c(1,2),
          d = as.POSIXct("2016-01-01", tz = "UTC"))

test(1762.7, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.8, lapply(l, class), lapply(do.call(CJ, l), class))

RoyalTS · 2017-05-18T18:17:24Z

This is great! How should we do this? Close this PR and you'll create one of your own?

MarkusBonsch · 2017-05-19T08:19:52Z

I am not familiar with github. Can't you give me push rights for your PR? Or you just insert the suggestions yourself, and tell in the news that I also contributed.
as you prefer.

due to MarkusBonsch

RoyalTS · 2017-05-19T17:31:29Z

Just made the changes, all tests are passing. Can we get another reviewer pass on this?

MarkusBonsch · 2017-05-29T12:09:17Z

Do you know, why this AppVeyor check fails? It leaves this nasty red cross at the project.

RoyalTS · 2017-05-29T18:11:25Z

Just more recent patches creating merge conflicts. All resolved, this is now mergeable again in principle.

retain timezones on CJ()

ee19734

mattdowle requested changes May 11, 2017

View reviewed changes

updated patch

375b8f6

due to MarkusBonsch

Merge branch 'master' into patch-2029

dd0e8e8

Merge branch 'master' into patch-2029

57db682

mattdowle added 2 commits August 4, 2017 15:45

Moved news item to end to workaround diff

7614d62

Merge branch 'master' into patch-2029

aa6efea

mattdowle approved these changes Aug 4, 2017

View reviewed changes

mattdowle added this to the v1.10.6 milestone Aug 4, 2017

Moved test to the end

9288e7e

mattdowle merged commit 93f2ce8 into Rdatatable:master Aug 4, 2017

Conversation

RoyalTS commented May 8, 2017

Uh oh!

codecov-io commented May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mattdowle left a comment

Choose a reason for hiding this comment

Uh oh!

mattdowle May 11, 2017

Choose a reason for hiding this comment

Uh oh!

RoyalTS May 11, 2017

Choose a reason for hiding this comment

Uh oh!

mattdowle May 11, 2017

Choose a reason for hiding this comment

Uh oh!

MarkusBonsch May 17, 2017

Choose a reason for hiding this comment

Uh oh!

RoyalTS May 17, 2017

Choose a reason for hiding this comment

Uh oh!

MarkusBonsch May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdowle May 11, 2017

Choose a reason for hiding this comment

Uh oh!

RoyalTS May 11, 2017

Choose a reason for hiding this comment

Uh oh!

MarkusBonsch commented May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkusBonsch commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RoyalTS commented May 18, 2017

Uh oh!

MarkusBonsch commented May 19, 2017

Uh oh!

RoyalTS commented May 19, 2017

Uh oh!

MarkusBonsch commented May 29, 2017

Uh oh!

RoyalTS commented May 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented May 8, 2017 •

edited

Loading

MarkusBonsch May 17, 2017 •

edited

Loading

MarkusBonsch commented May 17, 2017 •

edited

Loading

MarkusBonsch commented May 18, 2017 •

edited

Loading