Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 21, 2020

Bunch of stuff in this PR:

  • Speed up safe integer/floating<->integer/floating casts, especially when they are mostly not null
  • Compiled size of scalar_cast_numeric.cc.o is down to 736KB from 1210KB. There's about 70KB of new code in util/int_util.cc.o for some of the stuff below. Net reduction of 400KB in libarrow.so on Linux/clang-8
  • Start overdue casting benchmark suite. There are benchmarks for some of the casts I worked on in this PR to help show the before/after performance
  • General purpose CheckIntegersInRange for fast range-checking of integer arrays
  • Augment scalar-cast-test to check casting arrow::Scalar values, too. I disabled these casts for the types where they aren't supported: decimal (see ARROW-9194), string (ARROW-9198), dictionary, lists (ARROW-9199), and temporal types (ARROW-9196).
  • I discovered by nm sleuthing that our code in BoxScalar in codegen_internal.h was generating a lot of binary for some reason, so this has been fixed.

@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

@ursabot benchmark --benchmark-filter=Cast 6538173

Comment on lines -218 to +234
static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>& type) {
return std::make_shared<ScalarType>(val, type);
}
static void Box(T val, Scalar* out) { checked_cast<ScalarType*>(out)->value = val; }
};

template <typename Type>
struct BoxScalar<Type, enable_if_base_binary<Type>> {
using T = typename GetOutputType<Type>::T;
using ScalarType = typename TypeTraits<Type>::ScalarType;
static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>&) {
return std::make_shared<ScalarType>(val);
static void Box(T val, Scalar* out) {
checked_cast<ScalarType*>(out)->value = std::make_shared<Buffer>(val);
}
};

template <>
struct BoxScalar<Decimal128Type> {
using T = Decimal128;
using ScalarType = Decimal128Scalar;
static std::shared_ptr<Scalar> Box(T val, const std::shared_ptr<DataType>& type) {
return std::make_shared<ScalarType>(val, type);
}
static void Box(T val, Scalar* out) { checked_cast<ScalarType*>(out)->value = val; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior implementation was causing a lot of compiled code to be generated for some reason, FYI

@ursabot
Copy link

ursabot commented Jun 21, 2020

AMD64 Ubuntu 18.04 C++ Benchmark (#113872) builder has been succeeded.

Revision: 1f1f553

  ===================================  ==================  ==================  ========
  benchmark                            baseline            contender           change
  ===================================  ==================  ==================  ========
- CastDoubleToInt32Safe/262144/10      293.547m items/sec  268.901m items/sec  -8.396%
  CastDoubleToInt32Unsafe/262144/10    1.923b items/sec    1.882b items/sec    -2.133%
  CastInt64ToDoubleUnsafe/32768/1      1.160b items/sec    1.108b items/sec    -4.457%
  CastDoubleToInt32Safe/262144/1       955.014m items/sec  1.649b items/sec    72.694%
- CastInt64ToInt32Safe/32768/0         1.113b items/sec    775.276m items/sec  -30.322%
  CastDoubleToInt32Unsafe/262144/0     1.896b items/sec    1.952b items/sec    2.951%
  CastDoubleToInt32Safe/32768/2        146.420m items/sec  149.509m items/sec  2.110%
  CastInt64ToDoubleUnsafe/262144/1     1.370b items/sec    1.353b items/sec    -1.289%
  CastDoubleToInt32Unsafe/262144/1     1.881b items/sec    1.960b items/sec    4.211%
  CastDoubleToInt32Unsafe/262144/2     1.903b items/sec    1.953b items/sec    2.613%
  CastDoubleToInt32Unsafe/32768/1      1.530b items/sec    1.478b items/sec    -3.414%
  CastInt64ToInt32Unsafe/32768/0       1.587b items/sec    1.589b items/sec    0.117%
  CastUInt32ToInt32Safe/32768/1        591.934m items/sec  1.828b items/sec    208.740%
  CastInt64ToDoubleUnsafe/262144/2     1.368b items/sec    1.354b items/sec    -1.026%
  CastInt64ToInt32Safe/32768/1         759.162m items/sec  1.392b items/sec    83.315%
  CastInt64ToInt32Safe/32768/2         158.228m items/sec  186.136m items/sec  17.638%
  CastInt64ToInt32Safe/32768/10        362.633m items/sec  408.551m items/sec  12.662%
  CastUInt32ToInt32Safe/32768/2        158.148m items/sec  192.534m items/sec  21.743%
  CastUInt32ToInt32Safe/262144/1       624.409m items/sec  2.186b items/sec    250.028%
  CastUInt32ToInt32Safe/32768/10       368.245m items/sec  431.468m items/sec  17.169%
  CastDoubleToInt32Unsafe/262144/1000  1.921b items/sec    1.881b items/sec    -2.094%
- CastUInt32ToInt32Safe/262144/0       1.709b items/sec    1.034b items/sec    -39.490%
  CastInt64ToDoubleSafe/32768/10       300.990m items/sec  363.788m items/sec  20.864%
- CastInt64ToDoubleUnsafe/32768/0      1.167b items/sec    1.101b items/sec    -5.689%
- CastInt64ToInt32Safe/262144/0        1.344b items/sec    904.358m items/sec  -32.718%
  CastDoubleToInt32Safe/32768/1000     397.097m items/sec  501.618m items/sec  26.321%
  CastUInt32ToInt32Safe/262144/2       159.841m items/sec  169.776m items/sec  6.216%
  CastInt64ToInt32Unsafe/262144/2      2.067b items/sec    2.069b items/sec    0.110%
  CastInt64ToDoubleSafe/262144/0       532.308m items/sec  718.648m items/sec  35.006%
- CastInt64ToDoubleUnsafe/32768/1000   1.125b items/sec    1.067b items/sec    -5.151%
  CastDoubleToInt32Safe/32768/1        873.356m items/sec  1.374b items/sec    57.276%
  CastDoubleToInt32Unsafe/32768/10     1.558b items/sec    1.483b items/sec    -4.815%
  CastInt64ToDoubleSafe/32768/2        146.671m items/sec  176.282m items/sec  20.188%
  CastInt64ToInt32Safe/262144/1000     614.881m items/sec  728.407m items/sec  18.463%
  CastInt64ToInt32Unsafe/262144/10     2.066b items/sec    2.068b items/sec    0.069%
  CastInt64ToDoubleSafe/32768/1000     435.454m items/sec  537.717m items/sec  23.484%
  CastInt64ToInt32Unsafe/32768/1000    1.567b items/sec    1.561b items/sec    -0.361%
  CastInt64ToDoubleSafe/32768/0        477.517m items/sec  627.189m items/sec  31.344%
  CastDoubleToInt32Unsafe/32768/0      1.545b items/sec    1.481b items/sec    -4.144%
  CastInt64ToDoubleUnsafe/262144/0     1.372b items/sec    1.353b items/sec    -1.401%
  CastInt64ToDoubleUnsafe/32768/10     1.128b items/sec    1.092b items/sec    -3.203%
  CastInt64ToInt32Safe/32768/1000      562.735m items/sec  649.690m items/sec  15.452%
- CastDoubleToInt32Safe/32768/10       291.876m items/sec  259.543m items/sec  -11.078%
- CastInt64ToInt32Safe/262144/10       372.755m items/sec  341.696m items/sec  -8.332%
  CastInt64ToInt32Unsafe/262144/1000   2.065b items/sec    2.067b items/sec    0.075%
  CastInt64ToInt32Unsafe/32768/10      1.568b items/sec    1.558b items/sec    -0.604%
  CastInt64ToDoubleUnsafe/262144/1000  1.366b items/sec    1.319b items/sec    -3.452%
  CastInt64ToInt32Unsafe/32768/1       1.575b items/sec    1.576b items/sec    0.100%
  CastInt64ToInt32Unsafe/32768/2       1.559b items/sec    1.562b items/sec    0.226%
  CastUInt32ToInt32Safe/262144/1000    612.542m items/sec  815.338m items/sec  33.107%
  CastInt64ToInt32Safe/262144/1        814.460m items/sec  1.737b items/sec    113.312%
- CastDoubleToInt32Safe/262144/0       798.460m items/sec  542.248m items/sec  -32.088%
  CastInt64ToDoubleSafe/262144/10      315.922m items/sec  309.061m items/sec  -2.172%
- CastUInt32ToInt32Safe/32768/0        1.396b items/sec    986.071m items/sec  -29.353%
  CastInt64ToInt32Unsafe/262144/1      2.075b items/sec    2.078b items/sec    0.162%
- CastDoubleToInt32Unsafe/32768/1000   1.552b items/sec    1.417b items/sec    -8.674%
  CastUInt32ToInt32Safe/32768/1000     546.409m items/sec  752.927m items/sec  37.796%
  CastInt64ToDoubleUnsafe/262144/10    1.369b items/sec    1.327b items/sec    -3.040%
  CastInt64ToDoubleSafe/262144/2       149.308m items/sec  159.159m items/sec  6.598%
  CastInt64ToDoubleSafe/262144/1000    486.532m items/sec  604.187m items/sec  24.182%
  CastDoubleToInt32Safe/262144/2       145.876m items/sec  151.442m items/sec  3.816%
- CastDoubleToInt32Safe/32768/0        707.547m items/sec  511.325m items/sec  -27.733%
  CastDoubleToInt32Unsafe/32768/2      1.517b items/sec    1.479b items/sec    -2.458%
  CastInt64ToDoubleUnsafe/32768/2      1.147b items/sec    1.095b items/sec    -4.530%
  CastInt64ToDoubleSafe/32768/1        593.325m items/sec  999.989m items/sec  68.540%
  CastInt64ToDoubleSafe/262144/1       625.563m items/sec  1.168b items/sec    86.742%
- CastUInt32ToInt32Safe/262144/10      375.829m items/sec  351.531m items/sec  -6.465%
  CastInt64ToInt32Safe/262144/2        159.038m items/sec  167.729m items/sec  5.464%
  CastInt64ToInt32Unsafe/262144/0      2.079b items/sec    2.081b items/sec    0.103%
  CastDoubleToInt32Safe/262144/1000    421.370m items/sec  543.970m items/sec  29.096%
  ===================================  ==================  ==================  ========

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

I'll investigate the perf regressions

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

Wow, really crazy performance investigation. The old code performs badly with clang-8 but actually very well with gcc-8 (maybe some better vectorization)? By contrast the new code has 4-5x performance speedup on clang but slight performance regression in some cases on gcc.

Taking into consideration the smaller code size I think we should accept the gcc perf regressions. I'll also look to see the difference with MSVC

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

@ursabot benchmark --help

@ursabot
Copy link

ursabot commented Jun 21, 2020

Usage: @ursabot benchmark [OPTIONS] [<baseline>]

  Run the benchmark suite in comparison mode.

  This command will run the benchmark suite for tip of the branch commit
  against `<baseline>` (or master if not provided).

  Examples:

  # Run the all the benchmarks
  @ursabot benchmark

  # Compare only benchmarks where the name matches the /^Sum/ regex
  @ursabot benchmark --benchmark-filter=^Sum

  # Compare only benchmarks where the suite matches the /compute-/ regex.
  # A suite is the C++ binary.
  @ursabot benchmark --suite-filter=compute-

  # Sometimes a new optimization requires the addition of new benchmarks to
  # quantify the performance increase. When doing this be sure to add the
  # benchmark in a separate commit before introducing the optimization.
  #
  # Note that specifying the baseline is the only way to compare using a new
  # benchmark, since master does not contain the new benchmark and no
  # comparison is possible.
  #
  # The following command compares the results of matching benchmarks,
  # compiling against HEAD and the provided baseline commit, e.g. eaf8302.
  # You can use this to quantify the performance improvement of new
  # optimizations or to check for regressions.
  @ursabot benchmark --benchmark-filter=MyBenchmark eaf8302

Options:
  --suite-filter <regex>      Regex filtering benchmark suites.
  --benchmark-filter <regex>  Regex filtering benchmarks.
  --help                      Show this message and exit.

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

@kszucs @fsaintjacques is there a way to ask the benchmark differ to use clang-8?

@fsaintjacques
Copy link
Contributor

Archery does via —cc and —Cxx, but ursabot doesn’t supports it. It’s probably just a matter of forwarding correctly argv.

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

Yeah, I'm trapped between two things that don't do what I need. archery benchmark diff doesn't print the results (AFAICT?) in a presentable way -- I opened ARROW-9201 about that

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

Pretty massive speedups with MSVC 2017 (mimalloc allocator):

https://gist.github.com/wesm/c7efa656ab0a4bd789e6029e5f791417/revisions?diff=split

The 2nd revision is the performance after applying this patch's changes

@wesm
Copy link
Member Author

wesm commented Jun 21, 2020

@nealrichardson @romainfrancois check out the arrow::internal::IntegersCanFit function within to help with int64->int32 narrowing in R

@emkornfield
Copy link
Contributor

The gcc vs clang performance has come up a few times. On the SIMD thread on the mailing list, I mentioned trying to standardise on a compiler at least in Linux so we can have a common rubric for evaluating benchmark results. Otherwise, I think we are going to spin our wheels on optimization work. Auto Vectorization choices seem to vary a lot.

@wesm
Copy link
Member Author

wesm commented Jun 22, 2020

@emkornfield I agree. Realistically we're going to have to look at them both. FWIW, in this particular case it seems that the Clang performance is the most representative of how things are behaving across platforms. Stuff that autovectorizes well in gcc may not do much at all on MSVC. There's also the question of how __builtin_expect impacts optimizations. But in general I don't think we should be using the "ursabot benchmark" results (which use gcc) to make conclusions about what perf optimizations are working

@emkornfield
Copy link
Contributor

But in general I don't think we should be using the "ursabot benchmark" results (which use gcc) to make conclusions about what perf optimizations are working

Hmm, I'm not sure I understand this completely, I'll start a discussion thread for this on the ML.

@wesm
Copy link
Member Author

wesm commented Jun 22, 2020

@emkornfield I meant that we should treat the results of "ursabot benchmark" as informational only and certainly not authoritative

@wesm
Copy link
Member Author

wesm commented Jun 22, 2020

Here's the benchmark comparison with clang-8

$ archery benchmark diff --cc=clang-8 --cxx=clang++-8 2db48b4 653817301 --benchmark-filter=Cast
                              benchmark            baseline           contender  change %  regression
41       CastUInt32ToInt32Safe/262144/1  769.933m items/sec    4.474b items/sec   481.076       False
62       CastUInt32ToInt32Safe/262144/0  409.277m items/sec    2.189b items/sec   434.792       False
15        CastUInt32ToInt32Safe/32768/0  399.127m items/sec    2.089b items/sec   423.357       False
7         CastUInt32ToInt32Safe/32768/1  742.226m items/sec    3.721b items/sec   401.307       False
18        CastInt64ToInt32Safe/262144/0  341.403m items/sec    1.569b items/sec   359.706       False
55    CastUInt32ToInt32Safe/262144/1000  351.335m items/sec    1.612b items/sec   358.917       False
16     CastUInt32ToInt32Safe/32768/1000  334.147m items/sec    1.484b items/sec   344.139       False
47         CastInt64ToInt32Safe/32768/0  328.491m items/sec    1.414b items/sec   330.412       False
30        CastInt64ToInt32Safe/262144/1  742.497m items/sec    3.131b items/sec   321.638       False
51     CastInt64ToInt32Safe/262144/1000  304.244m items/sec    1.226b items/sec   302.928       False
42      CastInt64ToInt32Safe/32768/1000  288.976m items/sec    1.101b items/sec   280.942       False
32         CastInt64ToInt32Safe/32768/1  706.339m items/sec    2.552b items/sec   261.273       False
45       CastDoubleToInt32Safe/262144/1  924.369m items/sec    2.997b items/sec   224.214       False
54       CastInt64ToDoubleSafe/262144/0  419.319m items/sec    1.324b items/sec   215.783       False
11        CastInt64ToDoubleSafe/32768/0  408.425m items/sec    1.216b items/sec   197.672       False
49       CastDoubleToInt32Safe/262144/2  207.799m items/sec  614.658m items/sec   195.795       False
9         CastDoubleToInt32Safe/32768/2  202.480m items/sec  584.558m items/sec   188.699       False
23    CastInt64ToDoubleSafe/262144/1000  375.572m items/sec    1.078b items/sec   186.948       False
50        CastDoubleToInt32Safe/32768/1  869.447m items/sec    2.445b items/sec   181.248       False
21       CastInt64ToDoubleSafe/262144/1  790.625m items/sec    2.222b items/sec   181.101       False
59    CastDoubleToInt32Safe/262144/1000  360.792m items/sec    1.013b items/sec   180.714       False
44     CastInt64ToDoubleSafe/32768/1000  360.492m items/sec  988.897m items/sec   174.319       False
48     CastDoubleToInt32Safe/32768/1000  349.576m items/sec  932.771m items/sec   166.829       False
58       CastDoubleToInt32Safe/262144/0  407.159m items/sec    1.067b items/sec   162.086       False
63        CastInt64ToDoubleSafe/32768/1  746.561m items/sec    1.893b items/sec   153.520       False
8         CastDoubleToInt32Safe/32768/0  395.857m items/sec  990.704m items/sec   150.268       False
67      CastDoubleToInt32Safe/262144/10  275.237m items/sec  612.002m items/sec   122.354       False
10       CastDoubleToInt32Safe/32768/10  266.596m items/sec  583.346m items/sec   118.813       False
69        CastInt64ToInt32Safe/32768/10  232.545m items/sec  449.883m items/sec    93.461       False
64       CastUInt32ToInt32Safe/32768/10  256.845m items/sec  482.636m items/sec    87.909       False
61       CastInt64ToInt32Safe/262144/10  243.012m items/sec  435.232m items/sec    79.099       False
0       CastUInt32ToInt32Safe/262144/10  264.244m items/sec  466.981m items/sec    76.723       False
53       CastInt64ToDoubleSafe/32768/10  278.548m items/sec  441.752m items/sec    58.591       False
1       CastInt64ToDoubleSafe/262144/10  283.181m items/sec  431.990m items/sec    52.549       False
14         CastInt64ToInt32Safe/32768/2  170.844m items/sec  224.195m items/sec    31.228       False
37        CastUInt32ToInt32Safe/32768/2  182.246m items/sec  238.051m items/sec    30.621       False
27       CastUInt32ToInt32Safe/262144/2  187.277m items/sec  231.385m items/sec    23.553       False
28        CastInt64ToInt32Safe/262144/2  175.893m items/sec  216.887m items/sec    23.306       False
26        CastInt64ToDoubleSafe/32768/2  189.465m items/sec  228.996m items/sec    20.864       False
3        CastInt64ToDoubleSafe/262144/2  192.523m items/sec  219.324m items/sec    13.921       False
36       CastInt64ToInt32Unsafe/32768/0    2.993b items/sec    3.227b items/sec     7.800       False
35       CastInt64ToInt32Unsafe/32768/2    2.937b items/sec    3.154b items/sec     7.367       False
68       CastInt64ToInt32Unsafe/32768/1    2.966b items/sec    3.176b items/sec     7.088       False
43      CastInt64ToInt32Unsafe/32768/10    2.940b items/sec    3.142b items/sec     6.899       False
65    CastInt64ToInt32Unsafe/32768/1000    2.943b items/sec    3.139b items/sec     6.647       False
24      CastInt64ToInt32Unsafe/262144/0    3.836b items/sec    4.073b items/sec     6.170       False
2       CastInt64ToInt32Unsafe/262144/2    3.810b items/sec    4.034b items/sec     5.890       False
25      CastInt64ToInt32Unsafe/262144/1    3.837b items/sec    4.061b items/sec     5.843       False
39   CastInt64ToInt32Unsafe/262144/1000    3.789b items/sec    4.009b items/sec     5.806       False
13     CastInt64ToInt32Unsafe/262144/10    3.798b items/sec    4.008b items/sec     5.525       False
57  CastInt64ToDoubleUnsafe/262144/1000    2.477b items/sec    2.586b items/sec     4.386       False
20     CastInt64ToDoubleUnsafe/262144/0    2.503b items/sec    2.606b items/sec     4.145       False
60     CastInt64ToDoubleUnsafe/262144/2    2.487b items/sec    2.587b items/sec     4.005       False
38    CastInt64ToDoubleUnsafe/262144/10    2.484b items/sec    2.580b items/sec     3.874       False
29     CastDoubleToInt32Unsafe/262144/2    3.531b items/sec    3.665b items/sec     3.795       False
34     CastInt64ToDoubleUnsafe/262144/1    2.503b items/sec    2.597b items/sec     3.755       False
52  CastDoubleToInt32Unsafe/262144/1000    3.540b items/sec    3.670b items/sec     3.680       False
33     CastDoubleToInt32Unsafe/262144/0    3.560b items/sec    3.689b items/sec     3.617       False
4     CastDoubleToInt32Unsafe/262144/10    3.546b items/sec    3.667b items/sec     3.390       False
22     CastDoubleToInt32Unsafe/262144/1    3.561b items/sec    3.677b items/sec     3.260       False
6    CastInt64ToDoubleUnsafe/32768/1000    2.102b items/sec    2.168b items/sec     3.140       False
46     CastInt64ToDoubleUnsafe/32768/10    2.106b items/sec    2.169b items/sec     3.006       False
12      CastInt64ToDoubleUnsafe/32768/2    2.111b items/sec    2.173b items/sec     2.935       False
19      CastInt64ToDoubleUnsafe/32768/0    2.138b items/sec    2.191b items/sec     2.475       False
5    CastDoubleToInt32Unsafe/32768/1000    2.874b items/sec    2.944b items/sec     2.432       False
56      CastInt64ToDoubleUnsafe/32768/1    2.134b items/sec    2.183b items/sec     2.293       False
31      CastDoubleToInt32Unsafe/32768/1    2.884b items/sec    2.944b items/sec     2.101       False
40     CastDoubleToInt32Unsafe/32768/10    2.879b items/sec    2.926b items/sec     1.641       False
17      CastDoubleToInt32Unsafe/32768/2    2.884b items/sec    2.926b items/sec     1.453       False
66      CastDoubleToInt32Unsafe/32768/0    2.919b items/sec    2.959b items/sec     1.393       False

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks good. Just a couple comments/questions below.

@wesm
Copy link
Member Author

wesm commented Jun 22, 2020

For curiosity, here are the same benchmarks on my laptop with gcc-8 (using the new output formatting from ARROW-9201)

                              benchmark            baseline           contender  change %  regression
46       CastUInt32ToInt32Safe/262144/1  994.591m items/sec    4.487b items/sec   351.118       False
31        CastUInt32ToInt32Safe/32768/1  932.612m items/sec    3.622b items/sec   288.383       False
52        CastInt64ToInt32Safe/262144/1    1.258b items/sec    3.393b items/sec   169.693       False
38         CastInt64ToInt32Safe/32768/1    1.170b items/sec    2.703b items/sec   130.930       False
26    CastUInt32ToInt32Safe/262144/1000  742.665m items/sec    1.217b items/sec    63.901       False
54     CastUInt32ToInt32Safe/32768/1000  696.762m items/sec    1.132b items/sec    62.468       False
43       CastDoubleToInt32Safe/262144/1    1.493b items/sec    2.333b items/sec    56.229       False
66        CastDoubleToInt32Safe/32768/1    1.366b items/sec    2.001b items/sec    46.571       False
61       CastInt64ToDoubleSafe/262144/1    1.489b items/sec    2.162b items/sec    45.245       False
67        CastInt64ToDoubleSafe/32768/1    1.370b items/sec    1.854b items/sec    35.316       False
5         CastUInt32ToInt32Safe/32768/2  204.642m items/sec  253.501m items/sec    23.876       False
40    CastInt64ToDoubleSafe/262144/1000  802.584m items/sec  987.873m items/sec    23.087       False
25       CastUInt32ToInt32Safe/32768/10  451.556m items/sec  546.766m items/sec    21.085       False
24    CastDoubleToInt32Safe/262144/1000  744.363m items/sec  899.943m items/sec    20.901       False
3      CastInt64ToDoubleSafe/32768/1000  753.040m items/sec  906.901m items/sec    20.432       False
53     CastInt64ToInt32Safe/262144/1000  968.574m items/sec    1.158b items/sec    19.578       False
49     CastDoubleToInt32Safe/32768/1000  697.915m items/sec  829.688m items/sec    18.881       False
55      CastInt64ToInt32Safe/32768/1000  893.465m items/sec    1.051b items/sec    17.641       False
57        CastInt64ToDoubleSafe/32768/2  215.963m items/sec  238.114m items/sec    10.257       False
15       CastUInt32ToInt32Safe/262144/2  212.126m items/sec  232.094m items/sec     9.414       False
12       CastInt64ToDoubleSafe/32768/10  443.363m items/sec  483.006m items/sec     8.942       False
18         CastInt64ToInt32Safe/32768/2  219.356m items/sec  236.947m items/sec     8.020       False
7        CastInt64ToDoubleSafe/262144/2  220.790m items/sec  225.300m items/sec     2.043       False
69    CastInt64ToInt32Unsafe/32768/1000    3.262b items/sec    3.304b items/sec     1.278       False
14      CastInt64ToInt32Unsafe/262144/0    4.271b items/sec    4.306b items/sec     0.814       False
22       CastInt64ToInt32Unsafe/32768/2    3.272b items/sec    3.290b items/sec     0.546       False
64        CastInt64ToInt32Safe/262144/2  228.532m items/sec  229.766m items/sec     0.540       False
56      CastUInt32ToInt32Safe/262144/10  467.505m items/sec  469.887m items/sec     0.509       False
21       CastInt64ToInt32Unsafe/32768/0    3.332b items/sec    3.343b items/sec     0.332       False
13      CastInt64ToInt32Unsafe/262144/1    4.292b items/sec    4.300b items/sec     0.195       False
9       CastInt64ToInt32Unsafe/32768/10    3.284b items/sec    3.290b items/sec     0.178       False
20       CastInt64ToInt32Unsafe/32768/1    3.324b items/sec    3.328b items/sec     0.125       False
28    CastInt64ToDoubleUnsafe/262144/10    2.507b items/sec    2.510b items/sec     0.103       False
36   CastInt64ToInt32Unsafe/262144/1000    4.273b items/sec    4.274b items/sec     0.037       False
68      CastInt64ToDoubleUnsafe/32768/1    2.125b items/sec    2.122b items/sec    -0.153       False
39      CastInt64ToDoubleUnsafe/32768/0    2.137b items/sec    2.132b items/sec    -0.208       False
19     CastInt64ToDoubleUnsafe/262144/2    2.511b items/sec    2.505b items/sec    -0.234       False
33     CastInt64ToInt32Unsafe/262144/10    4.288b items/sec    4.271b items/sec    -0.395       False
2       CastInt64ToDoubleUnsafe/32768/2    2.119b items/sec    2.110b items/sec    -0.455       False
58      CastInt64ToInt32Unsafe/262144/2    4.301b items/sec    4.277b items/sec    -0.549       False
48     CastInt64ToDoubleUnsafe/262144/1    2.545b items/sec    2.528b items/sec    -0.651       False
0   CastInt64ToDoubleUnsafe/262144/1000    2.525b items/sec    2.501b items/sec    -0.941       False
45     CastInt64ToDoubleUnsafe/262144/0    2.546b items/sec    2.518b items/sec    -1.098       False
30     CastInt64ToDoubleUnsafe/32768/10    2.131b items/sec    2.097b items/sec    -1.615       False
50   CastInt64ToDoubleUnsafe/32768/1000    2.135b items/sec    2.072b items/sec    -2.954       False
60        CastDoubleToInt32Safe/32768/0  934.302m items/sec  888.511m items/sec    -4.901       False
62        CastInt64ToDoubleSafe/32768/0    1.132b items/sec    1.073b items/sec    -5.285        True
35       CastDoubleToInt32Safe/262144/0    1.012b items/sec  951.534m items/sec    -5.960        True
17        CastInt64ToInt32Safe/32768/10  518.073m items/sec  485.375m items/sec    -6.311        True
37       CastInt64ToDoubleSafe/262144/0    1.229b items/sec    1.141b items/sec    -7.110        True
47      CastInt64ToDoubleSafe/262144/10  455.081m items/sec  418.319m items/sec    -8.078        True
16        CastDoubleToInt32Safe/32768/2  223.095m items/sec  202.291m items/sec    -9.325        True
23       CastDoubleToInt32Safe/262144/2  227.086m items/sec  203.714m items/sec   -10.292        True
59       CastDoubleToInt32Safe/32768/10  463.260m items/sec  389.040m items/sec   -16.021        True
41      CastDoubleToInt32Safe/262144/10  479.642m items/sec  401.058m items/sec   -16.384        True
32       CastInt64ToInt32Safe/262144/10  542.675m items/sec  442.477m items/sec   -18.464        True
4          CastInt64ToInt32Safe/32768/0    1.648b items/sec    1.316b items/sec   -20.173        True
42        CastInt64ToInt32Safe/262144/0    1.894b items/sec    1.444b items/sec   -23.787        True
65   CastDoubleToInt32Unsafe/32768/1000    2.954b items/sec    2.248b items/sec   -23.882        True
44     CastDoubleToInt32Unsafe/32768/10    2.955b items/sec    2.233b items/sec   -24.439        True
51      CastDoubleToInt32Unsafe/32768/0    2.916b items/sec    2.202b items/sec   -24.494        True
63      CastDoubleToInt32Unsafe/32768/1    2.962b items/sec    2.227b items/sec   -24.821        True
8       CastDoubleToInt32Unsafe/32768/2    2.974b items/sec    2.225b items/sec   -25.194        True
6      CastDoubleToInt32Unsafe/262144/2    3.759b items/sec    2.673b items/sec   -28.877        True
1   CastDoubleToInt32Unsafe/262144/1000    3.765b items/sec    2.671b items/sec   -29.063        True
34    CastDoubleToInt32Unsafe/262144/10    3.773b items/sec    2.649b items/sec   -29.802        True
11     CastDoubleToInt32Unsafe/262144/0    3.778b items/sec    2.641b items/sec   -30.089        True
27     CastDoubleToInt32Unsafe/262144/1    3.778b items/sec    2.578b items/sec   -31.763        True
29        CastUInt32ToInt32Safe/32768/0    2.307b items/sec    1.410b items/sec   -38.873        True
10       CastUInt32ToInt32Safe/262144/0    2.728b items/sec    1.521b items/sec   -44.266        True

In the case where there is a regression, the code went from "very very fast" to "very fast" so not likely to make a difference in real world workloads.

@wesm
Copy link
Member Author

wesm commented Jun 22, 2020

@wesm wesm closed this in 7038533 Jun 22, 2020
@wesm wesm deleted the ARROW-9197 branch June 22, 2020 18:26
wesm added a commit that referenced this pull request Jun 23, 2020
… diffs, add repetitions argument, don't build unit tests

This uses pandas to generate a sorted text table when using `archery benchmark diff`. Example:

#7506 (comment)

There's some other incidental changes

* pandas is required for `archery benchmark diff`. I don't think there's value in reimplementing the stuff that pandas can do in a few lines of code (read JSON, create a sorted table and print it nicely for us).
* The default # of benchmarks repetitions has been changed from 10 to 1 (see ARROW-9155 for context). IMHO more interactive benchmark results is more useful than higher precision. If you need higher precision you can pass `--repetitions=10` on the command line
* `archery benchmark` was building the unit tests unnecessarily. This also occluded a bug ARROW-9209, which is fixed here

Closes #7516 from wesm/ARROW-9201

Authored-by: Wes McKinney <wesm@apache.org>
Signed-off-by: Wes McKinney <wesm@apache.org>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
… diffs, add repetitions argument, don't build unit tests

This uses pandas to generate a sorted text table when using `archery benchmark diff`. Example:

apache/arrow#7506 (comment)

There's some other incidental changes

* pandas is required for `archery benchmark diff`. I don't think there's value in reimplementing the stuff that pandas can do in a few lines of code (read JSON, create a sorted table and print it nicely for us).
* The default # of benchmarks repetitions has been changed from 10 to 1 (see ARROW-9155 for context). IMHO more interactive benchmark results is more useful than higher precision. If you need higher precision you can pass `--repetitions=10` on the command line
* `archery benchmark` was building the unit tests unnecessarily. This also occluded a bug ARROW-9209, which is fixed here

Closes #7516 from wesm/ARROW-9201

Authored-by: Wes McKinney <wesm@apache.org>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants