Skip to content

Conversation

@mconcas
Copy link
Collaborator

@mconcas mconcas commented Jun 21, 2021

Hi @davidrohr,
In this draft I propose a possible implementation of dedicated executables to run generic GPU benchmark.
Notably this PR has:

  1. Two dedicated binaries o2-gpu-memory-benchmark-{cuda,hip} automatically produced based on "GPU autodetection"
  2. Automatic (re-)generation of HIP code on-the-fly by mean of hipify-perl script, upon changes in .cu CUDA corresponding files
  3. Automatic scratching based on fraction (default 95%) of the free GPU resident memory
  4. Templated benchmarking class to use desired class type backend and templated benchmarking function
  5. A dummy example of reading kernel (to be heavily revised, but just to create a full benchmarking "workflow")
  6. Configurable CLI params
  7. Store results on root file

Overall, for the moment it's very simple but I wanted to be sure I am going in the right direction.

Please, let me know what do you think.

Cheers,
Matteo

Summary
Kernels are always launched in one-dimensional fashion, assigning gridDim.x to the number of multiprocessors and blockDim.x to the number of available threads per block.

seq, single-block seq, multi-block conc, single-block conc, multi-block
read ✔️ ✔️ ✔️ ✔️
write ✔️ ✔️ ✔️ ✔️
copy ✔️ ✔️ ✔️ ✔️
random read TODO TODO TODO TODO
random write TODO TODO TODO TODO
random copy TODO TODO TODO TODO

Names can be imprecise and misleading, I try to explain what I mean:

  • seq: kernels are launched one after the other finishes, one per different 1GB(default) partition of scratch; so that each one run with no others in parallel on other partitions

    • single-block: each block is pinned (by its Id.x) to a different 1GB(default) partition of scratch. To ergodically span over it it uses iteration increase by a stride equal to blockDim.x.
    • multi-block: all blocks work on each segment but in a ordered and strided way (stride = blockDim * gridDim )
  • conc: kernels run at the same time on different slices of scratch. Benchmarks measure per-slice performance

    • single-block: each block is pinned (by its Id.x) to a different sub-buffer (regardless partitions) of scratch. To ergodically span over it it uses iteration increase by a stride equal to blockDim.x.
    • multi-block: The idea is to partition the pool of blocks so that each subset can access a sub-buffer of scratch. All blocks work on each segment but in a ordered and strided way (stride = blockDim * gridDim )

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks quite good, some comments inline.

@mconcas mconcas marked this pull request as ready for review June 23, 2021 15:39
@mconcas mconcas changed the title [GPU/Full System Test] Skeleton for GPU standalone benchmarking [GPU/Full System Test] GPU standalone benchmarking Jul 1, 2021
@mconcas
Copy link
Collaborator Author

mconcas commented Jul 9, 2021

Hi @davidrohr, the error in fullCI seems genuine, unfortunately on my two ubuntu machines it seems that I cannot reproduce it as everything runs just fine.
Shall I add some -lstdc++fs to hipcc flags or do you have a better approach?

@davidrohr
Copy link
Collaborator

@mconcas : I think there is no good solution. I believe what happens is that you link to O2CommonUtils which pulls in some symbols with GLIBCXX version GLIBCXX_3.4.26 corresponding to GCC 9.1 or higher, but hipcc compiles using the system gcc (the one used to build hip itself), which is only GCC 8, so this symbol is not available. You could just pull in the newer cxx library, but that should be quite dangerous since then you have c++ implementations with different ABI.

I think the best way to proceed is to get rid of the O2CommonUtils dependency, if that is not too much effort.

@mconcas
Copy link
Collaborator Author

mconcas commented Jul 10, 2021

@mconcas : I think there is no good solution. I believe what happens is that you link to O2CommonUtils which pulls in some symbols with GLIBCXX version GLIBCXX_3.4.26 corresponding to GCC 9.1 or higher, but hipcc compiles using the system gcc (the one used to build hip itself), which is only GCC 8, so this symbol is not available. You could just pull in the newer cxx library, but that should be quite dangerous since then you have c++ implementations with different ABI.

I think the best way to proceed is to get rid of the O2CommonUtils dependency, if that is not too much effort.

Understood, that dependency is there to use TTreeStreamer to save results in a tree, I'll try to do something manually, if the O2::ROOT dependency does not suffer of the same issue.

@mconcas
Copy link
Collaborator Author

mconcas commented Jul 14, 2021

@davidrohr : this is currently building on EPN. Will run tests after. This round can be merged. I will reiterate on this adding remaining tests and possible improvements.

@davidrohr
Copy link
Collaborator

ok, good with me, do you want to partially squash to keep some history, or shall I just squash-merge?

@mconcas
Copy link
Collaborator Author

mconcas commented Jul 14, 2021

You can squash, it's fine

@davidrohr davidrohr merged commit 4c52d2d into AliceO2Group:dev Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants