Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented May 17, 2021

This PR adds rounding compute functions, namely "round" and "round_to_multiple".

  • round(x, RoundOptions(ndigits, round_mode)) - round x to the precision indicated by ndigits
  • round_to_multiple(x, RoundToMultipleOptions(multiple, round_mode)) - round x to scale of multiple

Rounding modes supported are: DOWN, UP, TOWARDS_ZERO, TOWARDS_INFINITY, HALF_DOWN, HALF_UP, HALF_TOWARDS_ZERO, HALF_TOWARDS_INFINITY, HALF_TO_EVEN, HALF_TO_ODD.
By default tie-breaking modes round to the nearest integer and resolve ties with HALF_TO_EVEN.

The rounding functions expect floating-point inputs and return output of the same type. Integral inputs are implicitly type-casted and output is float64.

@github-actions
Copy link

@edponce edponce marked this pull request as ready for review May 17, 2021 19:38
@edponce edponce marked this pull request as draft May 17, 2021 20:26
@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 6a6e01e to 49f232b Compare June 25, 2021 10:48
@edponce
Copy link
Contributor Author

edponce commented Jun 25, 2021

@bkietz @jorisvandenbossche Need feedback on this PR. Specifically, the rounding options provided and kernel implementations.

@edponce edponce marked this pull request as ready for review June 28, 2021 15:03
@edponce edponce requested a review from bkietz June 28, 2021 15:05
@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch 2 times, most recently from 402ab23 to 0240f37 Compare July 21, 2021 07:11
@edponce
Copy link
Contributor Author

edponce commented Jul 21, 2021

There are 2 round functions (Round and MRound) and both use different Options but make use of the same enum RoundMode, therefore I defined enum RoundMode in global space of api_scalar.h. Based on the recent FunctionOptions changes, I added EnumTraits<RoundMode> to api_scalar.cc along with necessary type and registration code.

For tests, I wanted to use the values() method of EnumTraits to be able to iterate through the enum values, but I am not sure on how to invoke the EnumTraits<RoundMode> since it is not exposed in a header file. My solution was to create an array (kRoundModes) with the enum values in the global space of tests.

Also, I could not find a way to create the generator dispatchers without explicitly using the enum RoundMode values as template parameters (I do not think we can do this in C++11 because the value depends on the ty loop variable).

@lidavidm @bkietz Any comments or suggestions would be gladly appreciated.

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from e7f5d07 to bf4d80d Compare July 21, 2021 20:28
@lidavidm
Copy link
Member

There seem to be some Windows-specific test failures :/

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from d58bc48 to 8c3943c Compare August 19, 2021 03:44
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me. I left two small comments.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM.

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 7ef8e65 to dfe6de2 Compare August 24, 2021 06:23
@pitrou
Copy link
Member

pitrou commented Sep 7, 2021

Should you undraft it or is it still WIP?

@edponce
Copy link
Contributor Author

edponce commented Sep 7, 2021

It is basically complete and undrafted. There are a few minor comments I made w.r.t. to doubts that I have.

@edponce edponce requested a review from lidavidm September 7, 2021 14:56
@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 14d6244 to 3e82fd0 Compare September 7, 2021 15:33
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A bunch of comments below.

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 3e82fd0 to 14382ff Compare September 8, 2021 04:23
@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 14382ff to dd001b9 Compare September 9, 2021 00:57
@edponce
Copy link
Contributor Author

edponce commented Sep 9, 2021

Ready for review cc @pitrou

@rok
Copy link
Member

rok commented Sep 9, 2021

Looks great!
Would we consider pandas-like time unit rounding once we have this in?

@edponce
Copy link
Contributor Author

edponce commented Sep 9, 2021

@rok It is a possibility, if we have the semantics pinned down w.r.t. how to shift specific timestamps (forward, backward, delta). This would be a new set of compute functions: "round_time" and "round_time_to_multiple", where the latter could be a quaternary/varargs function to support multiples for hour, min, sec, ms/ns.

@rok
Copy link
Member

rok commented Sep 9, 2021

@rok It is a possibility, if we have the semantics pinned down w.r.t. how to shift specific timestamps (forward, backward, delta).

Indeed that would probably need some work.

Does this currently support arbitrary rounding to multiple on timestamps? If yes it might be good to limit it to timezoneless and UTC timestamps to avoid ambiguous and nonexistent timestamp issues.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. One minor comment.

Comment on lines 979 to 985
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just return OptionsWrapper::Init like below?

Copy link
Contributor Author

@edponce edponce Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is RoundOptionsWrapper so that its constructor is invoked via make_unique, which initializes pow10 data member. If we use OptionsWrapper then pow10 is not available. I tried invoking OptionsWrapper::Init but it returns a std::unique_ptr which would require "casting" to RoundOptionsWrapper first and then to KernelState to match return type. The unique_ptr casting caused too many issues so I reverted to mimic the OptionsWrapper::Init method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I used these OptionsWrapper classes:

  • considers Init for options validation because Init can return a Status::Invalid
  • constructor for initializing non-options state, which can then be accessed in kernels' Call via ctx->state()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I templetized Init method of KernelState derived classes so that derived constructors can be invoked without the need to duplicate the Init definition. This makes KernelState be fully functional to support options validation and extending kernel state via constructor which can most likely benefit other scalar kernels as well.

For example by extending, OptionsWrapper as follows:

template <typename OptionsType>
struct OptionsWrapper : public KernelState {
  template <typename KernelStateType = OptionsWrapper>
  static Result<std::unique_ptr<KernelState>> Init(KernelContext* ctx, const KernelInitArgs& args) {
    if (auto options = static_cast<const OptionsType*>(args.options)) {
        return ::arrow::internal::make_unique<KernelStateType>(*options);
    }
    ...
  }
  ...
};

now we can extend custom states as follows:

struct RoundOptionsWrapper<RoundOptions> : public OptionsWrapper<RoundOptions> {
  using OptionsType = RoundOptions;
  using State = RoundOptionsWrapper<OptionsType>;
  double pow10;

  explicit RoundOptionsWrapper(OptionsType options) : OptionsWrapper(std::move(options)) {
    pow10 = RoundUtil::Pow10(std::abs(options.ndigits));
  }

  static Result<std::unique_ptr<KernelState>> Init(KernelContext* ctx, const KernelInitArgs& args) {
    return OptionsWrapper<OptionsType>::Init<State>(ctx, args);
  }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried the template variant of OptionsWrapper and although it made the code cleaner, it failed to compiled for some systems, so reverted to duplicating the OptionsWrapper::Init definition. I think there needs to be a refactoring of KernelState and related parts to support validating kernel options and extending kernel state in a simpler manner. There are different patterns being used in the code to fulfill these. But this is a separate issue from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record, it feels a bit weird to call GenerateArithmeticRound at kernel execution time. That said, a quick benchmarking in Python shows there doesn't seem to be any large overhead:

>>> import pyarrow as pa, pyarrow.compute as pc
>>> floor = pc.get_function("floor")
>>> round = pc.get_function("round")
>>> arr = pa.array([None], type=pa.float64())
>>> %timeit floor.call([arr])
2.57 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit floor.call([arr])
2.58 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit round.call([arr])
2.65 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit round.call([arr])
2.53 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The purpose of doing this here is to generate dispatchers once prior to kernel invocation. Previously, I tried 2 solutions:

  • Have a vector of precompute GenerateArithmeticFloatingPoint and use the options.round_mode to index this vector. But this required vector to be ordered identically to the round modes in enum RoundMode.
  • Similar to above but using an unordered_map indexed by options.round_mode. This requires adding hash support for RoundMode data type. This approach does not imposes a full constraint on the ordering of RoundMode.

Which one do you think is best?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is doing the same thing as the other kernels. Other kernels pass exec (an ArrayKernelExec) to the AddKernel method without invoking it. exec is invoked during kernel dispatching because it requires KernelContext, ExecBatch, and Datum parameters.

@edponce
Copy link
Contributor Author

edponce commented Sep 9, 2021

@rok This PR only supports rounding for basic arithmetic data types (unsigned/signed int and floating-point).

@lidavidm
Copy link
Member

lidavidm commented Sep 9, 2021

@edponce looks like you need to rebase again here as well.

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch 2 times, most recently from d6b909d to 65cb707 Compare September 10, 2021 02:32
@edponce
Copy link
Contributor Author

edponce commented Sep 10, 2021

This PR is ready for a (hopefully) final review. cc @lidavidm @pitrou

@edponce edponce force-pushed the ARROW-12744-Add-rounding-kernel branch from 0496356 to 302c5f1 Compare September 10, 2021 15:49
@edponce
Copy link
Contributor Author

edponce commented Sep 10, 2021

Are there any additional comments/reviews? cc @pitrou @bkietz @jorisvandenbossche

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I'll push a few minor changes and will merge.

@pitrou pitrou force-pushed the ARROW-12744-Add-rounding-kernel branch from 302c5f1 to f91a1a5 Compare September 13, 2021 16:29
@pitrou pitrou force-pushed the ARROW-12744-Add-rounding-kernel branch from f91a1a5 to ae740e3 Compare September 13, 2021 16:35
@pitrou pitrou closed this in 376cb45 Sep 13, 2021
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This PR adds rounding compute functions, namely "round" and "round_to_multiple".
* `round(x, RoundOptions(ndigits, round_mode))` - round `x` to the precision indicated by `ndigits`
* `round_to_multiple(x, RoundToMultipleOptions(multiple, round_mode))` - round `x` to scale of `multiple`

Rounding modes supported are: DOWN, UP, TOWARDS_ZERO, TOWARDS_INFINITY, HALF_DOWN, HALF_UP, HALF_TOWARDS_ZERO, HALF_TOWARDS_INFINITY, HALF_TO_EVEN, HALF_TO_ODD.
By default tie-breaking modes round to the nearest integer and resolve ties with HALF_TO_EVEN.

The rounding functions expect floating-point inputs and return output of the same type. Integral inputs are implicitly type-casted and output is float64.

Closes apache#10349 from edponce/ARROW-12744-Add-rounding-kernel

Authored-by: Eduardo Ponce <edponce00@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants