[GPU/Full System Test] GPU standalone benchmarking #6484

mconcas · 2021-06-21T18:04:36Z

Hi @davidrohr,
In this draft I propose a possible implementation of dedicated executables to run generic GPU benchmark.
Notably this PR has:

Two dedicated binaries o2-gpu-memory-benchmark-{cuda,hip} automatically produced based on "GPU autodetection"
Automatic (re-)generation of HIP code on-the-fly by mean of hipify-perl script, upon changes in .cu CUDA corresponding files
Automatic scratching based on fraction (default 95%) of the free GPU resident memory
Templated benchmarking class to use desired class type backend and templated benchmarking function
A dummy example of reading kernel (to be heavily revised, but just to create a full benchmarking "workflow")
Configurable CLI params
Store results on root file

Overall, for the moment it's very simple but I wanted to be sure I am going in the right direction.

Please, let me know what do you think.

Cheers,
Matteo

Summary
Kernels are always launched in one-dimensional fashion, assigning gridDim.x to the number of multiprocessors and blockDim.x to the number of available threads per block.

	seq, single-block	seq, multi-block	conc, single-block	conc, multi-block
read	✔️	✔️	✔️	✔️
write	✔️	✔️	✔️	✔️
copy	✔️	✔️	✔️	✔️
random read	TODO	TODO	TODO	TODO
random write	TODO	TODO	TODO	TODO
random copy	TODO	TODO	TODO	TODO

Names can be imprecise and misleading, I try to explain what I mean:

seq: kernels are launched one after the other finishes, one per different 1GB(default) partition of scratch; so that each one run with no others in parallel on other partitions
- single-block: each block is pinned (by its Id.x) to a different 1GB(default) partition of scratch. To ergodically span over it it uses iteration increase by a stride equal to blockDim.x.
- multi-block: all blocks work on each segment but in a ordered and strided way (stride = blockDim * gridDim )
conc: kernels run at the same time on different slices of scratch. Benchmarks measure per-slice performance
- single-block: each block is pinned (by its Id.x) to a different sub-buffer (regardless partitions) of scratch. To ergodically span over it it uses iteration increase by a stride equal to blockDim.x.
- multi-block: The idea is to partition the pool of blocks so that each subset can access a sub-buffer of scratch. All blocks work on each segment but in a ordered and strided way (stride = blockDim * gridDim )

davidrohr

looks quite good, some comments inline.

GPU/GPUbenchmark/Shared/Kernels.h

GPU/GPUbenchmark/cuda/Kernels.cu

mconcas · 2021-07-09T08:29:52Z

Hi @davidrohr, the error in fullCI seems genuine, unfortunately on my two ubuntu machines it seems that I cannot reproduce it as everything runs just fine.
Shall I add some -lstdc++fs to hipcc flags or do you have a better approach?

davidrohr · 2021-07-09T08:39:03Z

@mconcas : I think there is no good solution. I believe what happens is that you link to O2CommonUtils which pulls in some symbols with GLIBCXX version GLIBCXX_3.4.26 corresponding to GCC 9.1 or higher, but hipcc compiles using the system gcc (the one used to build hip itself), which is only GCC 8, so this symbol is not available. You could just pull in the newer cxx library, but that should be quite dangerous since then you have c++ implementations with different ABI.

I think the best way to proceed is to get rid of the O2CommonUtils dependency, if that is not too much effort.

mconcas · 2021-07-10T11:45:32Z

@mconcas : I think there is no good solution. I believe what happens is that you link to O2CommonUtils which pulls in some symbols with GLIBCXX version GLIBCXX_3.4.26 corresponding to GCC 9.1 or higher, but hipcc compiles using the system gcc (the one used to build hip itself), which is only GCC 8, so this symbol is not available. You could just pull in the newer cxx library, but that should be quite dangerous since then you have c++ implementations with different ABI.

I think the best way to proceed is to get rid of the O2CommonUtils dependency, if that is not too much effort.

Understood, that dependency is there to use TTreeStreamer to save results in a tree, I'll try to do something manually, if the O2::ROOT dependency does not suffer of the same issue.

mconcas · 2021-07-14T08:32:58Z

@davidrohr : this is currently building on EPN. Will run tests after. This round can be merged. I will reiterate on this adding remaining tests and possible improvements.

davidrohr · 2021-07-14T08:34:08Z

ok, good with me, do you want to partially squash to keep some history, or shall I just squash-merge?

mconcas · 2021-07-14T08:37:06Z

You can squash, it's fine

davidrohr reviewed Jun 21, 2021

View reviewed changes

mconcas force-pushed the gpu-benchmark branch from e3a4c6b to 28158e4 Compare June 23, 2021 12:26

mconcas marked this pull request as ready for review June 23, 2021 15:39

mconcas changed the title ~~[GPU/Full System Test] Skeleton for GPU standalone benchmarking~~ [GPU/Full System Test] GPU standalone benchmarking Jul 1, 2021

mconcas and others added 26 commits July 8, 2021 22:19

Add CUDA backbone

244287b

HIP breaks

4071b11

Make two separate libraries

04ca94b

Re-arrange directories

aae7f42

Add missing header

3cc9695

Meta library does not compile

1f31791

Port hipInfo example to test gpu specs

175be3f

Fix compilation to test build on EPN

ed9ade9

Produce two separate executables

d74f0b4

Flatten dir tree a bit

9454b54

Cleanup

84f8534

Add CMake forced re-configuration

7e2ef6a

HIP can't find symbols

664759f

Checkpoint before radical change

1d352ab

Create single executable

12c5d39

Update

0a7ae2e

Add first dummy benchmark

6c352b1

Assign a block to each scratch segment

3773655

Fix copyright

8933ff1

Please consider the following formatting changes (#16)

fd8041b

Set configurable iterations

f80c684

Improve busy fucntion + streaming results on file

3896a71

Fix bug in CLI params

cf1276e

Add configurable number of tests

2b7e27e

Fix undefined behaviour insetting nLaunches

134b31d

Please consider the following formatting changes (#17)

3e96125

mconcas and others added 7 commits July 8, 2021 22:20

Streamline ker benchmarking w/ events

d928a4c

Tidy up kernels and improve output

bc8e75b

Update read test

71408e8

Please consider the following formatting changes (#18)

05cc7ae

CP

4ae3ae4

Add last read test

195aae7

Add last read test

33b19be

mconcas force-pushed the gpu-benchmark branch from add060e to 33b19be Compare July 8, 2021 20:20

Fix result dump on file

434c9af

mconcas force-pushed the gpu-benchmark branch from a48317a to 434c9af Compare July 8, 2021 20:54

mconcas and others added 3 commits July 9, 2021 19:00

add reading kernel

9daeab5

Add write tests

e309004

Add copy benchmark

9018922

mconcas and others added 5 commits July 12, 2021 10:24

Remove CommonUtils dependency

f31f774

Remove GPUCommon dependency

e330a7e

Fix fullCI errors

1480959

Revise result saving

5383afc

Ready to test on EPN

cbef6f8

mconcas force-pushed the gpu-benchmark branch from 69e48d0 to cbef6f8 Compare July 13, 2021 22:38

davidrohr merged commit 4c52d2d into AliceO2Group:dev Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU/Full System Test] GPU standalone benchmarking #6484

[GPU/Full System Test] GPU standalone benchmarking #6484

Uh oh!

mconcas commented Jun 21, 2021 •

edited

Loading

Uh oh!

davidrohr left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mconcas commented Jul 9, 2021

Uh oh!

davidrohr commented Jul 9, 2021

Uh oh!

mconcas commented Jul 10, 2021 •

edited

Loading

Uh oh!

mconcas commented Jul 14, 2021 •

edited

Loading

Uh oh!

davidrohr commented Jul 14, 2021

Uh oh!

mconcas commented Jul 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

[GPU/Full System Test] GPU standalone benchmarking #6484

[GPU/Full System Test] GPU standalone benchmarking #6484

Uh oh!

Conversation

mconcas commented Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidrohr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mconcas commented Jul 9, 2021

Uh oh!

davidrohr commented Jul 9, 2021

Uh oh!

mconcas commented Jul 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mconcas commented Jul 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidrohr commented Jul 14, 2021

Uh oh!

mconcas commented Jul 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

mconcas commented Jun 21, 2021 •

edited

Loading

mconcas commented Jul 10, 2021 •

edited

Loading

mconcas commented Jul 14, 2021 •

edited

Loading