Skip to content

retain timezones on CJ()#2150

Merged
mattdowle merged 7 commits intoRdatatable:masterfrom
RoyalTS:patch-2029
Aug 4, 2017
Merged

retain timezones on CJ()#2150
mattdowle merged 7 commits intoRdatatable:masterfrom
RoyalTS:patch-2029

Conversation

@RoyalTS
Copy link
Copy Markdown
Contributor

@RoyalTS RoyalTS commented May 8, 2017

fixes #2029

@codecov-io
Copy link
Copy Markdown

codecov-io commented May 8, 2017

Codecov Report

Merging #2150 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2150      +/-   ##
==========================================
+ Coverage   90.75%   90.78%   +0.03%     
==========================================
  Files          59       59              
  Lines       11584    11585       +1     
==========================================
+ Hits        10513    10518       +5     
+ Misses       1071     1067       -4
Impacted Files Coverage Δ
R/setkey.R 93.51% <100%> (+0.02%) ⬆️
src/forder.c 94.47% <0%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35beaf8...9288e7e. Read the comment docs.

Copy link
Copy Markdown
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great spot and thanks for the pull. Comments inline.

Comment thread R/setkey.R Outdated
dups = FALSE # fix for #1513
# using rep.int instead of rep speeds things up considerably (but attributes are dropped).
j = lapply(l, class) # changed "vapply" to avoid errors with "ordered" "factor" input
tzones = lapply(l, function(col) attr(col, 'tzone'))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be tzones = lapply(l, attr, 'tzone') please and can it be moved down inside the if() so that it is only calculated when needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Was trying to mirror the code that copies over classes. Any reason the j = ... line above my edit isn't inside the else if?

Comment thread R/setkey.R Outdated
l[[i]] = rep.int(rep.int(y, times = rep.int(x[i], n[i])), times = nrow/(x[i]*n[i]))
if (any(class(l[[i]]) != j[[i]]))
setattr(l[[i]], 'class', j[[i]]) # reset "Date" class - rep.int coerces to integer
if (any(!sapply(tzones, is.null))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the tzone for all columns being looked at here, within a loop through columns? It doesn't seem quite right to me but I haven't debugged or run it. See next comment on tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this if-statement can be simply removed.
It serves no real purpose. If the tzones are all NULL, well, let them all be NULL and set them anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could make sense if is.null is significantly faster than just setting the attributes even if they're NULL? In any case, the same would go for the lines that reset class immediately above?

Copy link
Copy Markdown
Contributor

@MarkusBonsch MarkusBonsch May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference to the line about the class is, that the latter tests for each i.
So, if we want to retain the tzone if-statement, it should read:
if(!is.null(tzones[[i]])){...}. That would make more sense in my opinion.

Comment thread inst/tests/tests.Rraw Outdated

# CJ should retain timezone information, #2029
df <- CJ(week=as.POSIXct('2016-01-01', tz = 'UTC'), id=1:10)
test(1762, attr(df$week, 'tzone'), 'UTC')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests please including multi-column data.table with 1 and 2 columns of POSIXct with other non-POSIXct (say plain integer) columns at the beginning, middle and end. Enough variations to test that the code is actually retaining the tzone on the desired columns and not adding tzones to columns that shouldn't, because I'm not sure it's right currently just by glancing at it. The test is just a single column.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something likeCJ(dt1=as.POSIXct('2016-01-01', tz = 'UTC'), id=1:10, dt2=as.POSIXct(c('2016-01-01', '2016-01-02'), tz = 'UTC')) (behaves as expected) and various permutations of the order of these columns?

@MarkusBonsch
Copy link
Copy Markdown
Contributor

MarkusBonsch commented May 17, 2017

Dear RoyalTS,

thanks a lot for taking care of this. One suggestion / proposal:
can't we extend this to all attributes by simply
attribs <- lapply(l, attributes)
instead of
tzones = lapply(l, function(col) attr(col, 'tzone'))
and
attributes(l[[i]]) <- attribs[[i]]
instead of
setattr(l[[i]], 'tzone', tzones[[i]])
? This would be a consistent solution for all attributes that might pop up in the future.

I would be happy to contribute, if you like.

Kind regards, Markus

@MarkusBonsch
Copy link
Copy Markdown
Contributor

MarkusBonsch commented May 18, 2017

I have below posted a CJ implementation that consistently retains all attributes and addresses Matt's concerns . Additionally, I have proposed a comprehensive set of tests of the new implementation that also addresses Matt's comments.

I conducted a benchmark for a ten-column CJ (2 rows in each column) input to test if the if-statement speeds up things

expr min lq mean median uq max neval
CJ_master 0.878273 1.209456 6.242917 1.762042 3.674923 185.8787 10000

|
| CJ_attributes_with_if | 0.871674 | 1.215139 | 6.244461 | 1.940190 | 3.676939 | 192.1245 | 10000
|
CJ_attributes_no_if | 0.872408 | 1.209823 | 6.735057 | 1.850016 | 3.686286 | 184.1207 | 10000|

Both implementations are just as fast as the master. Therefore, I would propose to keep the if because it is a little safer since attributes are only updated if necessary.

There is, however, a speed issue when all ten columns have an attribute that needs to be replaced:

expr min lq mean median uq max neval
CJ_master 0.905031 1.389070 4.556453 1.690564 3.792404 185.8065 10000
CJ_attributes_with_if 1.306778 2.585514 8.374277 4.480982 4.905639 187.9216 10000
CJ_attributes_no_if 1.275621 2.599443 8.327741 4.498394 4.911871 184.9495 10000

Again, there is no difference between the implementation with if and without if. And it is not surprising that retaining attributes comes at a cost. The good thing is, that this cost really only arises if attributes are actually present.

Kind regards,
Markus

Here is the code:

# CJ implementation that retains attributes
CJ <- function(..., sorted = TRUE, unique = FALSE){
    # Pass in a list of unique values, e.g. ids and dates
    # Cross Join will then produce a join table with the combination of all values (cross product).
    # The last vector is varied the quickest in the table, so dates should be last for roll for example
    l = list(...)
    if (unique) l = lapply(l, unique)
    dups = FALSE # fix for #1513
    if (length(l)==1L && sorted && length(o <- forderv(l[[1L]])))
        l[[1L]] = l[[1L]][o]
    else if (length(l) > 1L) {
        # using rep.int instead of rep speeds things up considerably (but attributes are dropped).
        attribs = lapply(l, attributes)  # remember attributes for resetting after rep.int
        n = vapply(l, length, 0L)
        nrow = prod(n)
        x = c(rev(take(cumprod(rev(n)))), 1L)
        for (i in seq_along(x)) {
            y = l[[i]]
            # fix for #1513
            if (sorted) {
                if (length(o <- forderv(y, retGrp=TRUE))) y = y[o]
                if (!dups) dups = attr(o, 'maxgrpn') > 1L 
            }
            if (i == 1L) 
                l[[i]] = rep.int(y, times = rep.int(x[i], n[i]))   # i.e. rep(y, each=x[i])
            else if (i == length(n))
                l[[i]] = rep.int(y, times = nrow/(x[i]*n[i]))
            else
                l[[i]] = rep.int(rep.int(y, times = rep.int(x[i], n[i])), times = nrow/(x[i]*n[i]))
            if (!is.null(attribs[[i]])){
                attributes(l[[i]]) <- attribs[[i]] # reset all attributes that were destroyed by rep.int
            }
        }
    }
    setattr(l, "row.names", .set_row_names(length(l[[1L]])))
    setattr(l, "class", c("data.table", "data.frame"))

    if (is.null(vnames <- names(l))) 
        vnames = vector("character", length(l)) 
    if (any(tt <- vnames == "")) {
        vnames[tt] = paste("V", which(tt), sep="")
        setattr(l, "names", vnames)
    }
    l <- alloc.col(l)  # a tiny bit wasteful to over-allocate a fixed join table (column slots only), doing it anyway for consistency, and it's possible a user may wish to use SJ directly outside a join and would expect consistent over-allocation.
    if (sorted) {
        if (!dups) setattr(l, 'sorted', names(l)) 
        else setkey(l) # fix #1513
    }
    l
}

And here are the tests

# CJ retains attributes and classes, #2150

l <- list(a = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"), ## according to comment about CJ loosing date class
          d = factor(c("a", "b", "c"), ordered = TRUE), ## according to comment about bug with ordered factors
          e = factor(c("a", "b", "c"), ordered = FALSE),
          f = c(1,2),
          g = c("a", "b"),
          h = c(TRUE, FALSE))
setattr(l$g, "test", "testval")## add hand-made attribute

test(1762.1, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.2, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = factor(c("a", "b", "c"), ordered = TRUE),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"),
          d = factor(c("a", "b", "c"), ordered = TRUE),
          e = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"),
          f = c(1,2),
          g = c("a", "b"),
          h = c(TRUE, FALSE))


test(1762.3, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.4, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = factor(c("a", "b", "c"), ordered = TRUE),
          b = as.POSIXct(c("2016-01-01", "2017-01-01")),
          c = as.Date("2015-01-01"),
          d = factor(c("a", "b", "c"), ordered = TRUE),
          e = c(TRUE, FALSE),
          f = c(1,2),
          g = c("a", "b"),
          h = as.POSIXct(c("2016-01-01", "2017-01-01"), tz = "UTC"))

test(1762.5, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.6, lapply(l, class), lapply(do.call(CJ, l), class))

l <- list(a = NA,
          c = c(1,2),
          d = as.POSIXct("2016-01-01", tz = "UTC"))

test(1762.7, lapply(l, attributes), lapply(do.call(CJ, l), attributes))
test(1762.8, lapply(l, class), lapply(do.call(CJ, l), class))

@RoyalTS
Copy link
Copy Markdown
Contributor Author

RoyalTS commented May 18, 2017

This is great! How should we do this? Close this PR and you'll create one of your own?

@MarkusBonsch
Copy link
Copy Markdown
Contributor

I am not familiar with github. Can't you give me push rights for your PR? Or you just insert the suggestions yourself, and tell in the news that I also contributed.
as you prefer.

due to MarkusBonsch
@RoyalTS
Copy link
Copy Markdown
Contributor Author

RoyalTS commented May 19, 2017

Just made the changes, all tests are passing. Can we get another reviewer pass on this?

@MarkusBonsch
Copy link
Copy Markdown
Contributor

Do you know, why this AppVeyor check fails? It leaves this nasty red cross at the project.

@RoyalTS
Copy link
Copy Markdown
Contributor Author

RoyalTS commented May 29, 2017

Just more recent patches creating merge conflicts. All resolved, this is now mergeable again in principle.

@mattdowle mattdowle added this to the v1.10.6 milestone Aug 4, 2017
@mattdowle mattdowle merged commit 93f2ce8 into Rdatatable:master Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] CJ() looses timezone of POSIXct vector

4 participants