Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jun 27, 2022

This adds lubridate-like qday function. Counts number of days elapsed since beginning of the quarter.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@rok rok requested a review from paleolimbot June 27, 2022 18:45
@paleolimbot
Copy link
Member

(I'm on vacation this week but look forward to taking a look on Monday!)

@rok rok requested a review from thisisnic June 28, 2022 07:00
thisisnic
thisisnic previously approved these changes Jun 28, 2022
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one comment needs adding to clarify things, but otherwise looks fine to me. Thanks!

@rok rok force-pushed the ARROW-14819 branch 2 times, most recently from 409b3e1 to b1926a3 Compare June 29, 2022 00:11
@thisisnic thisisnic dismissed their stale review June 29, 2022 08:06

More updates were needed - review no longer valid

@rok
Copy link
Member Author

rok commented Jun 29, 2022

@dragosmg
I added 2 suggestions.

Merged, thank you!

@rok rok requested a review from thisisnic June 30, 2022 20:06
@rok rok force-pushed the ARROW-14819 branch 2 times, most recently from 4cb6f6b to 37ee4c1 Compare July 5, 2022 18:20
@rok
Copy link
Member Author

rok commented Jul 5, 2022

@thisisnic after the rebase this is now (almost) c++ free :).

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than test_df, it might make sense for this binding to use a sequence of dates (or datetimes) that spans a year (or maybe that spans a year and a leap year if that matters here). Maybe something like

tibble::tibble(date = seq(as.Date("2000-01-01"), as.Date("2000-12-31"), by = "day"))
#> # A tibble: 366 × 1
#>    date      
#>    <date>    
#>  1 2000-01-01
#>  2 2000-01-02
#>  3 2000-01-03
#>  4 2000-01-04
#>  5 2000-01-05
#>  6 2000-01-06
#>  7 2000-01-07
#>  8 2000-01-08
#>  9 2000-01-09
#> 10 2000-01-10
#> # … with 356 more rows
tibble::tibble(datetime = seq(as.POSIXct("2000-01-01 00:00:00"), as.POSIXct("2000-12-31 23:00:00"), by = "day"))
#> # A tibble: 366 × 1
#>    datetime           
#>    <dttm>             
#>  1 2000-01-01 00:00:00
#>  2 2000-01-02 00:00:00
#>  3 2000-01-03 00:00:00
#>  4 2000-01-04 00:00:00
#>  5 2000-01-05 00:00:00
#>  6 2000-01-06 00:00:00
#>  7 2000-01-07 00:00:00
#>  8 2000-01-08 00:00:00
#>  9 2000-01-09 00:00:00
#> 10 2000-01-10 00:00:00
#> # … with 356 more rows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of more complete tests! I would actually propose a development-time test suite (it seems overkill for CI) that tests every moment over the past century.

The test you're proposing however hits this bug where rounding kernels interpret 32 bit arrays as 64 bit ones (ARROW-16142) so I suppose we really need to fix this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could try that and see how long it takes...it might only be a few ms and then I'd say keep it in the regular test suite. We do have a mechanism for running extra tests but right now it's limited to the large memory tests (via the env var ARROW_LARGE_MEMORY_TESTS). Given that a single real-world poke at this exposed an error, I'd say at least a year is a must in our normal test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, why not. I'll give it a try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ARROW-16142 was resolved so this is now ready.

@rok rok force-pushed the ARROW-14819 branch 2 times, most recently from e372aec to c0a8da9 Compare July 11, 2022 13:45
@rok rok requested a review from paleolimbot July 11, 2022 13:48
@rok rok requested a review from jonkeane July 15, 2022 10:23
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. One very minor suggestion for even more comment 🚲 🏠 ing (do feel free to ignore it thought, it is really minor)

Comment on lines 213 to 217
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely minor, and take it or leave it: but it did take me a second to think through why we are adding 1 here. Maybe we could add it to the comment up above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We calculate day of quarter by flooring timestamp to beginning of quarter and
# calculating days between beginning of quarter and timestamp/date in question.
floored_x <- build_expr("floor_temporal", x, options = list(unit = 9L))
build_expr("days_between", floored_x, x) + Expression$scalar(1L)
# We calculate day of quarter by flooring timestamp to beginning of quarter and
# calculating days between beginning of quarter and timestamp/date in question.
# Since we use one one-based numbering we add one.
floored_x <- build_expr("floor_temporal", x, options = list(unit = 9L))
build_expr("days_between", floored_x, x) + Expression$scalar(1L)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonkeane that's a fair question especially since documentation on qday is not really available.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one test will be better if you make the timestamp UTC but other than that this looks good to me!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would seq(as.POSIXct("1999-12-31", tz = "UTC"), as.POSIXct("2001-01-01", tz = "UTC"), by = "day") let you drop the ignore_attr bit below? As a new reader of this code I'm wondering why the timezone needs to be ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @paleolimbot! It's as if calling qday with mutate in compare_dplyr_binding returns an int64 with tzone = "UTC". Meanwhile calling it with transmute returns correctly.

Would this be considered a sharp edge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's an error in restoring the R metadata (this PR doesn't touch R attributes to my reading). It would be helpful to file a reprex() in a JIRA so that we can fix this later (an int64 should never have a timezone attribute!), but it's definitely not this PR's problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I'll open a Jira and merge this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok rok force-pushed the ARROW-14819 branch 3 times, most recently from 3a9fba4 to cbfb599 Compare July 19, 2022 16:27
@rok rok requested a review from paleolimbot July 19, 2022 16:28
@rok rok merged commit 0330353 into apache:master Jul 20, 2022
@rok rok deleted the ARROW-14819 branch July 20, 2022 07:54
@ursabot
Copy link

ursabot commented Jul 21, 2022

Benchmark runs are scheduled for baseline = 39980dc and contender = 0330353. 0330353 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.1% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.36% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 0330353a ec2-t3-xlarge-us-east-2
[Failed] 0330353a test-mac-arm
[Failed] 0330353a ursa-i9-9960x
[Finished] 0330353a ursa-thinkcentre-m75q
[Failed] 39980dcd ec2-t3-xlarge-us-east-2
[Failed] 39980dcd test-mac-arm
[Failed] 39980dcd ursa-i9-9960x
[Finished] 39980dcd ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants