Skip to content

Conversation

@aucahuasi
Copy link
Contributor

@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@aucahuasi aucahuasi changed the title Implement extract temporal components (year, month, day, etc) from date types Implement extract temporal components (year, month, day, etc) from date32/64 types Sep 3, 2021
@lidavidm lidavidm self-requested a review September 3, 2021 12:37
@lidavidm lidavidm changed the title Implement extract temporal components (year, month, day, etc) from date32/64 types ARROW-13138: [C++] Implement extract temporal components (year, month, day, etc) from date32/64 types Sep 3, 2021
@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this, this looks like a great start. I think my main questions are 1) whether we should support strftime, hour, minute, etc. on dates as I don't think they're meaningful, and 2) whether we can share the implementation of ISOCalendar instead of duplicating it.

@@ -466,8 +492,90 @@ struct Nanosecond {
// Convert timestamps to a string representation with an arbitrary format

#ifndef _WIN32
template <typename Duration>
template <typename Duration, typename InType>
struct Strftime {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since this is for dates only, maybe have the base struct Strftime have no implementation, and implement both dates and timestamps as specializations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, given that the regular strftime kernel won't format timestamps without timezones, I wonder if it's meaningful to have strftime work on dates. I would say if you want a string representation of a date, you should cast, since strftime would let you do misleading things like trying to extract the hour from a date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. Do you know if we have tests for casting date32/64 to strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, they're a bit thin though:

TEST(Cast, DateToString) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, given that the regular strftime kernel won't format timestamps without timezones, I wonder if it's meaningful to have strftime work on dates.

There's work to have Strftime work on timestamps without timezones (assuming UTC)
#10998. It's a philosophical discussion if it's correct or not but it will be available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back-and-forth @aucahuasi. However, it might be prudent to split strftime support into a separate JIRA to minimize conflicts with the existing PR and so we can evaluate how best to share the implementation (to avoid too much code duplication).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @lidavidm :)
Indeed, it makes sense to have a new jira. I can create the ticket if there is none.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a JIRA so please do file one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the jira ticket for strftime, thanks @lidavidm @rok
https://issues.apache.org/jira/browse/ARROW-13916

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Strftime ticket was merged yesterday so maybe something has changed here. Dunno.

@@ -588,8 +701,58 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer)
static_cast<int64_t>(weekday(ymd).iso_encoding())};
}

template <typename Duration>
template <typename Duration, typename InType>
struct ISOCalendar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid specialization by instead making the helper functions used work with timestamps and dates? For instance, you could modify GetInputTimezone to be GetInputTimezone<InType> where it would always return an empty string for InType == Date32Type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I suppose you'd also need to change functions to templates to be able to specialize things like that...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lidavidm !

I think the provided implementation (using specialization) is good enough.
Please note that this function is returning a const ref string:
const std::string& GetInputTimezone
So the compiler will throw an error if we want to return local empty strings.
And if I change the declaration to return a string (without const ref) we will be copying the timezones instead of using the refs.

Also, for the case of ISOCalendar with TimestampType we had this code:
string timezone = GetInputTimezone
and we had been performing a copy.
So I changed that to
const auto& timezone = GetInputTimezone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get around that with a static string somewhere, or by returning const std::string*. Or the actual implementation could be factored out, and the date kernel could call it with a literal string. At the very least, I'm not super enthused about essentially copy-pasting this kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a good idea, I can try to have a single implementation.
Where should I put the empty static string? in which file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would go in this file, though FWIW, it would probably be easiest to just factor things into functions since these are all static methods anyways.

Copy link
Contributor Author

@aucahuasi aucahuasi Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm I think this last version reuse the kernel logic and it specializes only the parts related to the timestamp logic. Please let me know if it is good enough.
TODO: I need to fix the some R tests (I'll tackle this tomorrow)

@@ -670,26 +833,36 @@ std::shared_ptr<ScalarFunction> MakeTemporal(
auto func =
std::make_shared<ScalarFunction>(name, Arity::Unary(), doc, default_options);

{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly I think kernels like hour should just not work on dates - it's not meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, pandas doesn't has that behavior too

pd.to_datetime("2019-01-03", format='%Y-%m-%d', errors='coerce').date().hour()
AttributeError: 'datetime.date' object has no attribute 'hour'

I'll remove the support of date32/64 for hour, min, sec ...

@aucahuasi aucahuasi marked this pull request as ready for review September 3, 2021 15:53
class ExecTemplate,
typename OutType>
std::shared_ptr<ScalarFunction> MakeTemporal(
std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a bool, I think it'd be cleaner to split the date kernel registration into a separate function:

template <
    template <typename...> class Op,
    template <template <typename...> class OpExec, typename Duration, typename InType, typename OutType>
    class ExecTemplate,
    typename OutType>
std::shared_ptr<ScalarFunction> AddDateKernels(ScalarFunction* func, const std::shared_ptr<arrow::DataType>& out_type, KernelInit init = nullptr) {
  {
    auto exec = ExecTemplate<Op, days, Date32Type, OutType>::Exec;
    DCHECK_OK(func->AddKernel({date32()}, out_Type, std::move(exec), init);
  }
  {
    auto exec = ExecTemplate<Op, std::chrono::milliseconds, Date64Type, OutType>::Exec;
    DCHECK_OK(func->AddKernel({date64()}, out_Type, std::move(exec), init);
  }
}

Then you can just call (for instance) AddDateKernels<Year, TemporalComponentExtract, Int64Type>(year.get(), int64()); below.

Copy link
Contributor Author

@aucahuasi aucahuasi Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I suggest to keep the provided implementation in this PR:

  • It requires less code/It change less parts.
  • It integrates well into the MakeTemporal* functions.
  • It is setting up the date kernels (optionally) at the moment of creating the function.

Also, I think that if I add AddDateKernels we would need 2 versions: one for MakeTemporal and other for MakeSimpleUnaryTemporal.
So I think it makes sense to keep the current code. Let me know what do you think here @lidavidm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's reasonable. In that case, below, can we do the following, so it's clear what the bare boolean parameter is for?

  auto quarter = MakeTemporal<Quarter, TemporalComponentExtract, Int64Type>(
      /*enable_date=*/true, "quarter", int64(), &quarter_doc);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will note though, a struct like this can collapse MakeTemporal and MakeSimpleUnaryTemporal into one function:

template <template <typename...> class Op, typename Duration, typename InType>
struct SimpleUnaryTemporal {
  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
    return SimpleUnary<Op<Duration, InType>>(ctx, batch, out);
  }
};

You would then replace OutType with typename... Args in MakeTemporal and you could replace MakeSimpleUnaryTemporal<Foo> with MakeTemporal<Foo, SimpleUnaryTemporal>. Then you wouldn't need two versions of AddDateKernels.

Also note that the function creation is only done once on initialization, so it doesn't really matter whether it's done in the same function or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added /*enable_date=*/ to indicate the intention of the boolean parameter.
Regarding the SimpleUnaryTemporal struct, sounds like a nice idea for a future refactor, however if you feel strongly about carrying out these sort of changes in this PR, let me know please!

@@ -588,8 +701,58 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer)
static_cast<int64_t>(weekday(ymd).iso_encoding())};
}

template <typename Duration>
template <typename Duration, typename InType>
struct ISOCalendar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get around that with a static string somewhere, or by returning const std::string*. Or the actual implementation could be factored out, and the date kernel could call it with a literal string. At the very least, I'm not super enthused about essentially copy-pasting this kernel.

class ExecTemplate,
typename OutType>
std::shared_ptr<ScalarFunction> MakeTemporal(
std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's reasonable. In that case, below, can we do the following, so it's clear what the bare boolean parameter is for?

  auto quarter = MakeTemporal<Quarter, TemporalComponentExtract, Int64Type>(
      /*enable_date=*/true, "quarter", int64(), &quarter_doc);

@@ -588,20 +614,85 @@ inline std::array<int64_t, 3> GetIsoCalendar(int64_t arg, Localizer&& localizer)
static_cast<int64_t>(weekday(ymd).iso_encoding())};
}

template <typename Duration, typename InType>
struct ISOCalendarWrapper {
static Status get(const Scalar& in, std::array<int64_t, 3>& iso_calendar) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this factorization is good, however, just a nit: methods should use UpperCamelCase unless they're const getters so this should be called Get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks @lidavidm @rok !

class ExecTemplate,
typename OutType>
std::shared_ptr<ScalarFunction> MakeTemporal(
std::shared_ptr<ScalarFunction> MakeTemporal(bool enable_date,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will note though, a struct like this can collapse MakeTemporal and MakeSimpleUnaryTemporal into one function:

template <template <typename...> class Op, typename Duration, typename InType>
struct SimpleUnaryTemporal {
  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
    return SimpleUnary<Op<Duration, InType>>(ctx, batch, out);
  }
};

You would then replace OutType with typename... Args in MakeTemporal and you could replace MakeSimpleUnaryTemporal<Foo> with MakeTemporal<Foo, SimpleUnaryTemporal>. Then you wouldn't need two versions of AddDateKernels.

Also note that the function creation is only done once on initialization, so it doesn't really matter whether it's done in the same function or not.

@lidavidm
Copy link
Member

lidavidm commented Sep 7, 2021

For the failed R test, it looks like it expected year(date) to fail but of course that's no longer the case. Seems this can be changed to expect_dplyr_equal:

# We can support this feature when ARROW-13138 is resolved
test_that("date32 objects are not supported", {
date <- ymd("2017-01-01")
df <- tibble::tibble(date = date)
expect_error(
Table$create(df) %>%
mutate(x = year(date)) %>%
collect(),
"Function year has no kernel matching input types"
)
})

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm
Copy link
Member

lidavidm commented Sep 7, 2021

Oh, it also seems we need to rebase here.

@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch 2 times, most recently from 8f366d7 to 19b1cf6 Compare September 7, 2021 17:50
@aucahuasi aucahuasi changed the title ARROW-13138: [C++] Implement extract temporal components (year, month, day, etc) from date32/64 types ARROW-13138: [C++][R] Implement extract temporal components (year, month, day, etc) from date32/64 types Sep 7, 2021
@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch from 19b1cf6 to a839acd Compare September 7, 2021 23:45
@aucahuasi
Copy link
Contributor Author

For the failed R test, it looks like it expected year(date) to fail but of course that's no longer the case.
Seems this can be changed to expect_dplyr_equal:

@lidavidm Thank you! I fixed the issue with the R test and added some additional R tests!

@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch from a839acd to f7c9e3a Compare September 7, 2021 23:54
@aucahuasi
Copy link
Contributor Author

One last thing that I forgot earlier, sorry: we should update compute.rst as well:

Done!

@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch 2 times, most recently from 130ea03 to 41d6beb Compare September 8, 2021 03:09
@aucahuasi
Copy link
Contributor Author

aucahuasi commented Sep 8, 2021

It seems there are some build issues on windows. Tomorrow I'll check that!

@lidavidm
Copy link
Member

lidavidm commented Sep 8, 2021

Note the TestNonexistentTimezone test is also failing on Linux.

@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch from 41d6beb to 49ee568 Compare September 8, 2021 22:18
…functions

Reuse most of the iso_calendar algorithm for different types

cleaning & minor changes

fix and add R tests for year, month, etc functions on date types

update cpp docs for temporal compute functions

c++ format

format all

fix windows build

fix TestNonexistentTimezone test and clang errors
@aucahuasi aucahuasi force-pushed the temporal-functions-date32 branch from 49ee568 to a47b78a Compare September 8, 2021 23:22
@aucahuasi
Copy link
Contributor Author

Note the TestNonexistentTimezone test is also failing on Linux.

Thanks, fixed!

@rok
Copy link
Member

rok commented Sep 9, 2021

Looks good. Appveyor fail doesn't seem related.

It would probably be good to also add Python tests.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

I filed https://issues.apache.org/jira/browse/ARROW-13957 for the flaky AppVeyor test.

I'm not sure if we need explicit Python tests, maybe comparing against Pandas would be nice, but unlike R we don't have higher level bindings we're trying to test.

@lidavidm lidavidm closed this in 4b5ed4e Sep 9, 2021

template <typename Duration, typename InType>
struct ISOCalendarWrapper {
static Status Get(const Scalar& in, std::array<int64_t, 3>& iso_calendar) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: either pass a const reference or a non-const pointer. Non-const references require more care from the caller.

struct ISOCalendarVisitValueFunction {
static Status Get(std::vector<BuilderType*>& field_builders, const ArrayData&,
StructBuilder* struct_builder,
std::function<Status(typename InType::c_type arg)>& visit_value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: visit_value should be passed by pointer.

@lidavidm
Copy link
Member

lidavidm commented Sep 9, 2021

Thanks @pitrou, I filed ARROW-13960

@aucahuasi
Copy link
Contributor Author

Thanks @lidavidm @rok @pitrou !

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
…nth, day, etc) from date32/64 types

https://issues.apache.org/jira/browse/ARROW-13138

Closes apache#11075 from aucahuasi/temporal-functions-date32

Authored-by: Percy Camilo Triveño Aucahuasi <percy.camilo.ta@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants