Skip to content

Conversation

@Qiyu8
Copy link
Contributor

@Qiyu8 Qiyu8 commented Oct 12, 2020

Introduction

Inspired by #2887, we can take advantage of universal intrinsics, Compare with develop branch, the new implementation has about 13x faster in AVX2, about 4x faster in SSE2, about 6x faster in NEON.

Benchmark

Each test has run for 10 loops and then take the average result, The radio is the MFlops division of Baseline and optimized branch(The bigger the radio is, the better performance is achieved).

  • X86 BaseLine
data size(10^3) develop(float)
4000 568.87 MFlops 0.014063 sec
8000 564.95 MFlops 0.028321 sec
  • X86 loop unrolling
data size(10^3) develop(float) ratio
4000 1820.37 MFlops 0.004395 sec 3.19
8000 1883.19 MFlops 0.008496 sec 3.33
  • X86-SSE2 enabled

    data size(10^3) usimd-sum(float) ratio
    4000 2642.44 MFlops 0.003027 sec 4.64
    8000 2642.40 MFlops 0.006055 sec 4.67
  • X86-AVX2 enabled

    data size(10^3) usimd-sum(float) ratio
    4000 7447.40 MFlops 0.001074 sec 13.09
    8000 5649.32 MFlops 0.002832 sec 9.99
  • ARM BaseLine

    data size(10^3) develop(float)
    4000 338.97 MFlops 0.023601 sec
    8000 338.20 MFlops 0.047309 sec
  • ARM loop unrolling

data size usimd-sum(float) ratio
4000 1007.34 MFlops 0.007942 sec 2.97
8000 1009.55 MFlops 0.015849 sec 2.98
  • ARM-Neon enabled

    data size usimd-sum(float) ratio
    4000 2151.17 MFlops 0.003719 sec 6.34
    8000 2154.59 MFlops 0.007426 sec 6.37

System Info

Arm x86
Hardware KunPeng
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

@martin-frbg
Copy link
Collaborator

Looks good to me, thanks

@martin-frbg martin-frbg merged commit cb4274e into OpenMathLib:develop Oct 12, 2020
@martin-frbg
Copy link
Collaborator

Oops - that code is single-precision only, right ? arm/sum.c gets built for both SSUM and DSUM (where the -DDOUBLE automagically changes the FLOAT type to double instead of float)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants