Optimize the performance of sum by using universal intrinsics #2888

Qiyu8 · 2020-10-12T12:24:27Z

Introduction

Inspired by #2887, we can take advantage of universal intrinsics, Compare with develop branch, the new implementation has about 13x faster in AVX2, about 4x faster in SSE2, about 6x faster in NEON.

Benchmark

Each test has run for 10 loops and then take the average result, The radio is the MFlops division of Baseline and optimized branch(The bigger the radio is, the better performance is achieved).

X86 BaseLine

data size(10^3)	develop(float)
4000	568.87 MFlops 0.014063 sec
8000	564.95 MFlops 0.028321 sec

X86 loop unrolling

data size(10^3)	develop(float)	ratio
4000	1820.37 MFlops 0.004395 sec	3.19
8000	1883.19 MFlops 0.008496 sec	3.33

X86-SSE2 enabled

data size(10^3) usimd-sum(float) ratio

4000 2642.44 MFlops 0.003027 sec 4.64

8000 2642.40 MFlops 0.006055 sec 4.67
X86-AVX2 enabled

data size(10^3) usimd-sum(float) ratio

4000 7447.40 MFlops 0.001074 sec 13.09

8000 5649.32 MFlops 0.002832 sec 9.99
ARM BaseLine

data size(10^3) develop(float)

4000 338.97 MFlops 0.023601 sec

8000 338.20 MFlops 0.047309 sec
ARM loop unrolling

data size	usimd-sum(float)	ratio
4000	1007.34 MFlops 0.007942 sec	2.97
8000	1009.55 MFlops 0.015849 sec	2.98

ARM-Neon enabled

data size usimd-sum(float) ratio

4000 2151.17 MFlops 0.003719 sec 6.34

8000 2154.59 MFlops 0.007426 sec 6.37

System Info

	Arm	x86
Hardware	KunPeng
Processor	ARMv8 2.6GMHZ 8 processors	Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS	Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64	Windows Server 2008 R2 Enterprise
Compiler	gcc (GCC) 7.3.0	MSVC14.06

martin-frbg · 2020-10-12T21:21:55Z

Looks good to me, thanks

martin-frbg · 2020-10-14T14:09:59Z

Oops - that code is single-precision only, right ? arm/sum.c gets built for both SSUM and DSUM (where the -DDOUBLE automagically changes the FLOAT type to double instead of float)

Optimize the performance of sum by using universal intrinsics

0ed1f07

martin-frbg mentioned this pull request Oct 12, 2020

Use C kernels for x86_64 SUM on Windows #2887

Closed

martin-frbg added this to the 0.3.11 milestone Oct 12, 2020

martin-frbg merged commit cb4274e into OpenMathLib:develop Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the performance of sum by using universal intrinsics #2888

Optimize the performance of sum by using universal intrinsics #2888

Uh oh!

Qiyu8 commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

data size(10^3)	usimd-sum(float)	ratio
4000	2642.44 MFlops 0.003027 sec	4.64
8000	2642.40 MFlops 0.006055 sec	4.67

data size(10^3)	usimd-sum(float)	ratio
4000	7447.40 MFlops 0.001074 sec	13.09
8000	5649.32 MFlops 0.002832 sec	9.99

data size(10^3)	develop(float)
4000	338.97 MFlops 0.023601 sec
8000	338.20 MFlops 0.047309 sec

data size	usimd-sum(float)	ratio
4000	2151.17 MFlops 0.003719 sec	6.34
8000	2154.59 MFlops 0.007426 sec	6.37

Optimize the performance of sum by using universal intrinsics #2888

Optimize the performance of sum by using universal intrinsics #2888

Uh oh!

Conversation

Qiyu8 commented Oct 12, 2020

Introduction

Benchmark

System Info

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants