HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation#20184
HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation#201844creators wants to merge 1 commit into
Conversation
|
Thanks for the work. Do you have detailed perf data (like VTune)? |
|
We already have a C# implementation of the n-body algorithm here: https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/BenchmarksGame/n-body/n-body-3.cs Ideally, you would first submit an updated version to the benchmark games site and then we could pull it back into the repo from there. |
I know this implementation and have compared results. However I need to move it outside of coreclr benchmark harness to make more reliable comparisons.
Yes I have tuned implementation with support of VTune. I will post detailed info soon.
It would be impossible for this implementation due to the fact that it is based on Avx2/Fma and it seems the Bnechmarks Game processor does not support anything higher than Sse41/Sse3. |
|
Just forgot to add that I will work now on Sse2/Sse3 implementation for submission to Benchmarks Game. In this case it may be beneficial to implement SoA instead of AoS algorithm. |
…tnet#3 implementation Implementation of NBody algorithm with some data layuout modifications introduced in C++ g++ dotnet#3. Benchmark is based on hand tuned procedural implementation of the AoS algorithm form. Due to small size of the data (5 objects only) and change of structural data layout requirements during calculations SoA implementation may provide limited benefits at maximum 15-18 %. On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster than partially Sse2/Sse vectorized C++ dotnet#3 benchmark. The speedup should be significantly higher on any architecture with number of ymm registers larger than 16 as some register spills impact performance.
0820a10 to
7b6b490
Compare
|
Performance diff in milliseconds between nbody-3 (current benchmark) and NBodySimdAvxFma, 11 measurements were taken with 50 000 000 integration steps for each run On Windows 10 x64
On WSL Ubuntu 18.04
Hardware i7-4700MQ, COMPlus_TieredCompilation=0, Microsoft.NETCore.App 3.0.0-preview1-26928-03, Windows 10 Pro. |
|
Hi @4creators Could you please close this PR and add the new benchmark to our dotnet/performance repository? This is the place where we keep all the benchmarks now. Thanks, |
|
@adamsitnik Closing and will add PR to dotnet/perf repo |
Related issue #16854
Implementation of NBody algorithm with some data layuout modifications
introduced in C++ g++ #3. Benchmark is based on hand tuned procedural
implementation of the AoS algorithm form. Due to small size of the data
(5 objects only) and change of structural data layout requirements during
calculations SoA implementation may provide limited benefits at maximum 15-18 %.
On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster
than partially Sse2/Sse vectorized C++ #3 benchmark. The speedup should be
significantly higher on any architecture with number of ymm registers
larger than 16 as some register spills impact performance.