-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13138: [C++][R] Implement extract temporal components (year, month, day, etc) from date32/64 types #11075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this, this looks like a great start. I think my main questions are 1) whether we should support strftime, hour, minute, etc. on dates as I don't think they're meaningful, and 2) whether we can share the implementation of ISOCalendar instead of duplicating it.
| @@ -466,8 +492,90 @@ struct Nanosecond { | |||
| // Convert timestamps to a string representation with an arbitrary format | |||
|
|
|||
| #ifndef _WIN32 | |||
| template <typename Duration> | |||
| template <typename Duration, typename InType> | |||
| struct Strftime { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since this is for dates only, maybe have the base struct Strftime have no implementation, and implement both dates and timestamps as specializations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, given that the regular strftime kernel won't format timestamps without timezones, I wonder if it's meaningful to have strftime work on dates. I would say if you want a string representation of a date, you should cast, since strftime would let you do misleading things like trying to extract the hour from a date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. Do you know if we have tests for casting date32/64 to strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, they're a bit thin though:
| TEST(Cast, DateToString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, given that the regular strftime kernel won't format timestamps without timezones, I wonder if it's meaningful to have strftime work on dates.
There's work to have Strftime work on timestamps without timezones (assuming UTC)
#10998. It's a philosophical discussion if it's correct or not but it will be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back-and-forth @aucahuasi. However, it might be prudent to split strftime support into a separate JIRA to minimize conflicts with the existing PR and so we can evaluate how best to share the implementation (to avoid too much code duplication).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @lidavidm :)
Indeed, it makes sense to have a new jira. I can create the ticket if there is none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a JIRA so please do file one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the jira ticket for strftime, thanks @lidavidm @rok
https://issues.apache.org/jira/browse/ARROW-13916
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Strftime ticket was merged yesterday so maybe something has changed here. Dunno.
| @@ -588,8 +701,58 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer) | |||
| static_cast<int64_t>(weekday(ymd).iso_encoding())}; | |||
| } | |||
|
|
|||
| template <typename Duration> | |||
| template <typename Duration, typename InType> | |||
| struct ISOCalendar { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to avoid specialization by instead making the helper functions used work with timestamps and dates? For instance, you could modify GetInputTimezone to be GetInputTimezone<InType> where it would always return an empty string for InType == Date32Type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I suppose you'd also need to change functions to templates to be able to specialize things like that...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lidavidm !
I think the provided implementation (using specialization) is good enough.
Please note that this function is returning a const ref string:
const std::string& GetInputTimezone
So the compiler will throw an error if we want to return local empty strings.
And if I change the declaration to return a string (without const ref) we will be copying the timezones instead of using the refs.
Also, for the case of ISOCalendar with TimestampType we had this code:
string timezone = GetInputTimezone
and we had been performing a copy.
So I changed that to
const auto& timezone = GetInputTimezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get around that with a static string somewhere, or by returning const std::string*. Or the actual implementation could be factored out, and the date kernel could call it with a literal string. At the very least, I'm not super enthused about essentially copy-pasting this kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's a good idea, I can try to have a single implementation.
Where should I put the empty static string? in which file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would go in this file, though FWIW, it would probably be easiest to just factor things into functions since these are all static methods anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm I think this last version reuse the kernel logic and it specializes only the parts related to the timestamp logic. Please let me know if it is good enough.
TODO: I need to fix the some R tests (I'll tackle this tomorrow)
| @@ -670,26 +833,36 @@ std::shared_ptr<ScalarFunction> MakeTemporal( | |||
| auto func = | |||
| std::make_shared<ScalarFunction>(name, Arity::Unary(), doc, default_options); | |||
|
|
|||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly I think kernels like hour should just not work on dates - it's not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, pandas doesn't has that behavior too
pd.to_datetime("2019-01-03", format='%Y-%m-%d', errors='coerce').date().hour()
AttributeError: 'datetime.date' object has no attribute 'hour'I'll remove the support of date32/64 for hour, min, sec ...
| class ExecTemplate, | ||
| typename OutType> | ||
| std::shared_ptr<ScalarFunction> MakeTemporal( | ||
| std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a bool, I think it'd be cleaner to split the date kernel registration into a separate function:
template <
template <typename...> class Op,
template <template <typename...> class OpExec, typename Duration, typename InType, typename OutType>
class ExecTemplate,
typename OutType>
std::shared_ptr<ScalarFunction> AddDateKernels(ScalarFunction* func, const std::shared_ptr<arrow::DataType>& out_type, KernelInit init = nullptr) {
{
auto exec = ExecTemplate<Op, days, Date32Type, OutType>::Exec;
DCHECK_OK(func->AddKernel({date32()}, out_Type, std::move(exec), init);
}
{
auto exec = ExecTemplate<Op, std::chrono::milliseconds, Date64Type, OutType>::Exec;
DCHECK_OK(func->AddKernel({date64()}, out_Type, std::move(exec), init);
}
}Then you can just call (for instance) AddDateKernels<Year, TemporalComponentExtract, Int64Type>(year.get(), int64()); below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but I suggest to keep the provided implementation in this PR:
- It requires less code/It change less parts.
- It integrates well into the
MakeTemporal*functions. - It is setting up the date kernels (optionally) at the moment of creating the function.
Also, I think that if I add AddDateKernels we would need 2 versions: one for MakeTemporal and other for MakeSimpleUnaryTemporal.
So I think it makes sense to keep the current code. Let me know what do you think here @lidavidm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's reasonable. In that case, below, can we do the following, so it's clear what the bare boolean parameter is for?
auto quarter = MakeTemporal<Quarter, TemporalComponentExtract, Int64Type>(
/*enable_date=*/true, "quarter", int64(), &quarter_doc);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note though, a struct like this can collapse MakeTemporal and MakeSimpleUnaryTemporal into one function:
template <template <typename...> class Op, typename Duration, typename InType>
struct SimpleUnaryTemporal {
static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
return SimpleUnary<Op<Duration, InType>>(ctx, batch, out);
}
};You would then replace OutType with typename... Args in MakeTemporal and you could replace MakeSimpleUnaryTemporal<Foo> with MakeTemporal<Foo, SimpleUnaryTemporal>. Then you wouldn't need two versions of AddDateKernels.
Also note that the function creation is only done once on initialization, so it doesn't really matter whether it's done in the same function or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added /*enable_date=*/ to indicate the intention of the boolean parameter.
Regarding the SimpleUnaryTemporal struct, sounds like a nice idea for a future refactor, however if you feel strongly about carrying out these sort of changes in this PR, let me know please!
| @@ -588,8 +701,58 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer) | |||
| static_cast<int64_t>(weekday(ymd).iso_encoding())}; | |||
| } | |||
|
|
|||
| template <typename Duration> | |||
| template <typename Duration, typename InType> | |||
| struct ISOCalendar { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get around that with a static string somewhere, or by returning const std::string*. Or the actual implementation could be factored out, and the date kernel could call it with a literal string. At the very least, I'm not super enthused about essentially copy-pasting this kernel.
| class ExecTemplate, | ||
| typename OutType> | ||
| std::shared_ptr<ScalarFunction> MakeTemporal( | ||
| std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's reasonable. In that case, below, can we do the following, so it's clear what the bare boolean parameter is for?
auto quarter = MakeTemporal<Quarter, TemporalComponentExtract, Int64Type>(
/*enable_date=*/true, "quarter", int64(), &quarter_doc);
| @@ -588,20 +614,85 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer) | |||
| static_cast<int64_t>(weekday(ymd).iso_encoding())}; | |||
| } | |||
|
|
|||
| template <typename Duration, typename InType> | |||
| struct ISOCalendarWrapper { | |||
| static Status get(const Scalar& in, std::array<int64_t, 3>& iso_calendar) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this factorization is good, however, just a nit: methods should use UpperCamelCase unless they're const getters so this should be called Get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class ExecTemplate, | ||
| typename OutType> | ||
| std::shared_ptr<ScalarFunction> MakeTemporal( | ||
| std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note though, a struct like this can collapse MakeTemporal and MakeSimpleUnaryTemporal into one function:
template <template <typename...> class Op, typename Duration, typename InType>
struct SimpleUnaryTemporal {
static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
return SimpleUnary<Op<Duration, InType>>(ctx, batch, out);
}
};You would then replace OutType with typename... Args in MakeTemporal and you could replace MakeSimpleUnaryTemporal<Foo> with MakeTemporal<Foo, SimpleUnaryTemporal>. Then you wouldn't need two versions of AddDateKernels.
Also note that the function creation is only done once on initialization, so it doesn't really matter whether it's done in the same function or not.
|
For the failed R test, it looks like it expected arrow/r/tests/testthat/test-dplyr-lubridate.R Lines 37 to 48 in 2ef0edc
|
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
One last thing that I forgot earlier, sorry: we should update compute.rst as well: https://github.com/apache/arrow/blob/f40856a768f2c397082da70af034080994587807/docs/source/cpp/compute.rst#temporal-component-extraction
|
Oh, it also seems we need to rebase here. |
8f366d7 to
19b1cf6
Compare
19b1cf6 to
a839acd
Compare
@lidavidm Thank you! I fixed the issue with the R test and added some additional R tests! |
a839acd to
f7c9e3a
Compare
Done! |
130ea03 to
41d6beb
Compare
|
It seems there are some build issues on windows. Tomorrow I'll check that! |
|
Note the |
41d6beb to
49ee568
Compare
…functions Reuse most of the iso_calendar algorithm for different types cleaning & minor changes fix and add R tests for year, month, etc functions on date types update cpp docs for temporal compute functions c++ format format all fix windows build fix TestNonexistentTimezone test and clang errors
49ee568 to
a47b78a
Compare
Thanks, fixed! |
|
Looks good. Appveyor fail doesn't seem related. It would probably be good to also add Python tests. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
I filed https://issues.apache.org/jira/browse/ARROW-13957 for the flaky AppVeyor test.
I'm not sure if we need explicit Python tests, maybe comparing against Pandas would be nice, but unlike R we don't have higher level bindings we're trying to test.
|
|
||
| template <typename Duration, typename InType> | ||
| struct ISOCalendarWrapper { | ||
| static Status Get(const Scalar& in, std::array<int64_t, 3>& iso_calendar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: either pass a const reference or a non-const pointer. Non-const references require more care from the caller.
| struct ISOCalendarVisitValueFunction { | ||
| static Status Get(std::vector<BuilderType*>& field_builders, const ArrayData&, | ||
| StructBuilder* struct_builder, | ||
| std::function<Status(typename InType::c_type arg)>& visit_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: visit_value should be passed by pointer.
|
Thanks @pitrou, I filed ARROW-13960 |
…nth, day, etc) from date32/64 types https://issues.apache.org/jira/browse/ARROW-13138 Closes apache#11075 from aucahuasi/temporal-functions-date32 Authored-by: Percy Camilo Triveño Aucahuasi <percy.camilo.ta@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
https://issues.apache.org/jira/browse/ARROW-13138