-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size #7506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>& type) { | ||
| return std::make_shared<ScalarType>(val, type); | ||
| } | ||
| static void Box(T val, Scalar* out) { checked_cast<ScalarType*>(out)->value = val; } | ||
| }; | ||
|
|
||
| template <typename Type> | ||
| struct BoxScalar<Type, enable_if_base_binary<Type>> { | ||
| using T = typename GetOutputType<Type>::T; | ||
| using ScalarType = typename TypeTraits<Type>::ScalarType; | ||
| static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>&) { | ||
| return std::make_shared<ScalarType>(val); | ||
| static void Box(T val, Scalar* out) { | ||
| checked_cast<ScalarType*>(out)->value = std::make_shared<Buffer>(val); | ||
| } | ||
| }; | ||
|
|
||
| template <> | ||
| struct BoxScalar<Decimal128Type> { | ||
| using T = Decimal128; | ||
| using ScalarType = Decimal128Scalar; | ||
| static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>& type) { | ||
| return std::make_shared<ScalarType>(val, type); | ||
| } | ||
| static void Box(T val, Scalar* out) { checked_cast<ScalarType*>(out)->value = val; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prior implementation was causing a lot of compiled code to be generated for some reason, FYI
|
AMD64 Ubuntu 18.04 C++ Benchmark (#113872) builder has been succeeded. Revision: 1f1f553 =================================== ================== ================== ========
benchmark baseline contender change
=================================== ================== ================== ========
- CastDoubleToInt32Safe/262144/10 293.547m items/sec 268.901m items/sec -8.396%
CastDoubleToInt32Unsafe/262144/10 1.923b items/sec 1.882b items/sec -2.133%
CastInt64ToDoubleUnsafe/32768/1 1.160b items/sec 1.108b items/sec -4.457%
CastDoubleToInt32Safe/262144/1 955.014m items/sec 1.649b items/sec 72.694%
- CastInt64ToInt32Safe/32768/0 1.113b items/sec 775.276m items/sec -30.322%
CastDoubleToInt32Unsafe/262144/0 1.896b items/sec 1.952b items/sec 2.951%
CastDoubleToInt32Safe/32768/2 146.420m items/sec 149.509m items/sec 2.110%
CastInt64ToDoubleUnsafe/262144/1 1.370b items/sec 1.353b items/sec -1.289%
CastDoubleToInt32Unsafe/262144/1 1.881b items/sec 1.960b items/sec 4.211%
CastDoubleToInt32Unsafe/262144/2 1.903b items/sec 1.953b items/sec 2.613%
CastDoubleToInt32Unsafe/32768/1 1.530b items/sec 1.478b items/sec -3.414%
CastInt64ToInt32Unsafe/32768/0 1.587b items/sec 1.589b items/sec 0.117%
CastUInt32ToInt32Safe/32768/1 591.934m items/sec 1.828b items/sec 208.740%
CastInt64ToDoubleUnsafe/262144/2 1.368b items/sec 1.354b items/sec -1.026%
CastInt64ToInt32Safe/32768/1 759.162m items/sec 1.392b items/sec 83.315%
CastInt64ToInt32Safe/32768/2 158.228m items/sec 186.136m items/sec 17.638%
CastInt64ToInt32Safe/32768/10 362.633m items/sec 408.551m items/sec 12.662%
CastUInt32ToInt32Safe/32768/2 158.148m items/sec 192.534m items/sec 21.743%
CastUInt32ToInt32Safe/262144/1 624.409m items/sec 2.186b items/sec 250.028%
CastUInt32ToInt32Safe/32768/10 368.245m items/sec 431.468m items/sec 17.169%
CastDoubleToInt32Unsafe/262144/1000 1.921b items/sec 1.881b items/sec -2.094%
- CastUInt32ToInt32Safe/262144/0 1.709b items/sec 1.034b items/sec -39.490%
CastInt64ToDoubleSafe/32768/10 300.990m items/sec 363.788m items/sec 20.864%
- CastInt64ToDoubleUnsafe/32768/0 1.167b items/sec 1.101b items/sec -5.689%
- CastInt64ToInt32Safe/262144/0 1.344b items/sec 904.358m items/sec -32.718%
CastDoubleToInt32Safe/32768/1000 397.097m items/sec 501.618m items/sec 26.321%
CastUInt32ToInt32Safe/262144/2 159.841m items/sec 169.776m items/sec 6.216%
CastInt64ToInt32Unsafe/262144/2 2.067b items/sec 2.069b items/sec 0.110%
CastInt64ToDoubleSafe/262144/0 532.308m items/sec 718.648m items/sec 35.006%
- CastInt64ToDoubleUnsafe/32768/1000 1.125b items/sec 1.067b items/sec -5.151%
CastDoubleToInt32Safe/32768/1 873.356m items/sec 1.374b items/sec 57.276%
CastDoubleToInt32Unsafe/32768/10 1.558b items/sec 1.483b items/sec -4.815%
CastInt64ToDoubleSafe/32768/2 146.671m items/sec 176.282m items/sec 20.188%
CastInt64ToInt32Safe/262144/1000 614.881m items/sec 728.407m items/sec 18.463%
CastInt64ToInt32Unsafe/262144/10 2.066b items/sec 2.068b items/sec 0.069%
CastInt64ToDoubleSafe/32768/1000 435.454m items/sec 537.717m items/sec 23.484%
CastInt64ToInt32Unsafe/32768/1000 1.567b items/sec 1.561b items/sec -0.361%
CastInt64ToDoubleSafe/32768/0 477.517m items/sec 627.189m items/sec 31.344%
CastDoubleToInt32Unsafe/32768/0 1.545b items/sec 1.481b items/sec -4.144%
CastInt64ToDoubleUnsafe/262144/0 1.372b items/sec 1.353b items/sec -1.401%
CastInt64ToDoubleUnsafe/32768/10 1.128b items/sec 1.092b items/sec -3.203%
CastInt64ToInt32Safe/32768/1000 562.735m items/sec 649.690m items/sec 15.452%
- CastDoubleToInt32Safe/32768/10 291.876m items/sec 259.543m items/sec -11.078%
- CastInt64ToInt32Safe/262144/10 372.755m items/sec 341.696m items/sec -8.332%
CastInt64ToInt32Unsafe/262144/1000 2.065b items/sec 2.067b items/sec 0.075%
CastInt64ToInt32Unsafe/32768/10 1.568b items/sec 1.558b items/sec -0.604%
CastInt64ToDoubleUnsafe/262144/1000 1.366b items/sec 1.319b items/sec -3.452%
CastInt64ToInt32Unsafe/32768/1 1.575b items/sec 1.576b items/sec 0.100%
CastInt64ToInt32Unsafe/32768/2 1.559b items/sec 1.562b items/sec 0.226%
CastUInt32ToInt32Safe/262144/1000 612.542m items/sec 815.338m items/sec 33.107%
CastInt64ToInt32Safe/262144/1 814.460m items/sec 1.737b items/sec 113.312%
- CastDoubleToInt32Safe/262144/0 798.460m items/sec 542.248m items/sec -32.088%
CastInt64ToDoubleSafe/262144/10 315.922m items/sec 309.061m items/sec -2.172%
- CastUInt32ToInt32Safe/32768/0 1.396b items/sec 986.071m items/sec -29.353%
CastInt64ToInt32Unsafe/262144/1 2.075b items/sec 2.078b items/sec 0.162%
- CastDoubleToInt32Unsafe/32768/1000 1.552b items/sec 1.417b items/sec -8.674%
CastUInt32ToInt32Safe/32768/1000 546.409m items/sec 752.927m items/sec 37.796%
CastInt64ToDoubleUnsafe/262144/10 1.369b items/sec 1.327b items/sec -3.040%
CastInt64ToDoubleSafe/262144/2 149.308m items/sec 159.159m items/sec 6.598%
CastInt64ToDoubleSafe/262144/1000 486.532m items/sec 604.187m items/sec 24.182%
CastDoubleToInt32Safe/262144/2 145.876m items/sec 151.442m items/sec 3.816%
- CastDoubleToInt32Safe/32768/0 707.547m items/sec 511.325m items/sec -27.733%
CastDoubleToInt32Unsafe/32768/2 1.517b items/sec 1.479b items/sec -2.458%
CastInt64ToDoubleUnsafe/32768/2 1.147b items/sec 1.095b items/sec -4.530%
CastInt64ToDoubleSafe/32768/1 593.325m items/sec 999.989m items/sec 68.540%
CastInt64ToDoubleSafe/262144/1 625.563m items/sec 1.168b items/sec 86.742%
- CastUInt32ToInt32Safe/262144/10 375.829m items/sec 351.531m items/sec -6.465%
CastInt64ToInt32Safe/262144/2 159.038m items/sec 167.729m items/sec 5.464%
CastInt64ToInt32Unsafe/262144/0 2.079b items/sec 2.081b items/sec 0.103%
CastDoubleToInt32Safe/262144/1000 421.370m items/sec 543.970m items/sec 29.096%
=================================== ================== ================== ======== |
|
I'll investigate the perf regressions |
|
Wow, really crazy performance investigation. The old code performs badly with clang-8 but actually very well with gcc-8 (maybe some better vectorization)? By contrast the new code has 4-5x performance speedup on clang but slight performance regression in some cases on gcc. Taking into consideration the smaller code size I think we should accept the gcc perf regressions. I'll also look to see the difference with MSVC |
|
@ursabot benchmark --help |
|
|
@kszucs @fsaintjacques is there a way to ask the benchmark differ to use clang-8? |
|
Archery does via —cc and —Cxx, but ursabot doesn’t supports it. It’s probably just a matter of forwarding correctly argv. |
|
Yeah, I'm trapped between two things that don't do what I need. |
|
Pretty massive speedups with MSVC 2017 (mimalloc allocator): https://gist.github.com/wesm/c7efa656ab0a4bd789e6029e5f791417/revisions?diff=split The 2nd revision is the performance after applying this patch's changes |
|
@nealrichardson @romainfrancois check out the |
|
The gcc vs clang performance has come up a few times. On the SIMD thread on the mailing list, I mentioned trying to standardise on a compiler at least in Linux so we can have a common rubric for evaluating benchmark results. Otherwise, I think we are going to spin our wheels on optimization work. Auto Vectorization choices seem to vary a lot. |
|
@emkornfield I agree. Realistically we're going to have to look at them both. FWIW, in this particular case it seems that the Clang performance is the most representative of how things are behaving across platforms. Stuff that autovectorizes well in gcc may not do much at all on MSVC. There's also the question of how |
Hmm, I'm not sure I understand this completely, I'll start a discussion thread for this on the ML. |
|
@emkornfield I meant that we should treat the results of "ursabot benchmark" as informational only and certainly not authoritative |
|
Here's the benchmark comparison with clang-8 |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this looks good. Just a couple comments/questions below.
|
For curiosity, here are the same benchmarks on my laptop with gcc-8 (using the new output formatting from ARROW-9201) In the case where there is a regression, the code went from "very very fast" to "very fast" so not likely to make a difference in real world workloads. |
|
+1. Appveyor build https://ci.appveyor.com/project/wesm/arrow/builds/33669506 |
… diffs, add repetitions argument, don't build unit tests This uses pandas to generate a sorted text table when using `archery benchmark diff`. Example: #7506 (comment) There's some other incidental changes * pandas is required for `archery benchmark diff`. I don't think there's value in reimplementing the stuff that pandas can do in a few lines of code (read JSON, create a sorted table and print it nicely for us). * The default # of benchmarks repetitions has been changed from 10 to 1 (see ARROW-9155 for context). IMHO more interactive benchmark results is more useful than higher precision. If you need higher precision you can pass `--repetitions=10` on the command line * `archery benchmark` was building the unit tests unnecessarily. This also occluded a bug ARROW-9209, which is fixed here Closes #7516 from wesm/ARROW-9201 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>
… diffs, add repetitions argument, don't build unit tests This uses pandas to generate a sorted text table when using `archery benchmark diff`. Example: apache/arrow#7506 (comment) There's some other incidental changes * pandas is required for `archery benchmark diff`. I don't think there's value in reimplementing the stuff that pandas can do in a few lines of code (read JSON, create a sorted table and print it nicely for us). * The default # of benchmarks repetitions has been changed from 10 to 1 (see ARROW-9155 for context). IMHO more interactive benchmark results is more useful than higher precision. If you need higher precision you can pass `--repetitions=10` on the command line * `archery benchmark` was building the unit tests unnecessarily. This also occluded a bug ARROW-9209, which is fixed here Closes #7516 from wesm/ARROW-9201 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>
Bunch of stuff in this PR:
CheckIntegersInRangefor fast range-checking of integer arraysarrow::Scalarvalues, too. I disabled these casts for the types where they aren't supported: decimal (see ARROW-9194), string (ARROW-9198), dictionary, lists (ARROW-9199), and temporal types (ARROW-9196).nmsleuthing that our code inBoxScalarin codegen_internal.h was generating a lot of binary for some reason, so this has been fixed.