-
Notifications
You must be signed in to change notification settings - Fork 484
[GPU/Full System Test] GPU standalone benchmarking #6484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
davidrohr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks quite good, some comments inline.
|
Hi @davidrohr, the error in fullCI seems genuine, unfortunately on my two ubuntu machines it seems that I cannot reproduce it as everything runs just fine. |
|
@mconcas : I think there is no good solution. I believe what happens is that you link to O2CommonUtils which pulls in some symbols with GLIBCXX version I think the best way to proceed is to get rid of the O2CommonUtils dependency, if that is not too much effort. |
Understood, that dependency is there to use TTreeStreamer to save results in a tree, I'll try to do something manually, if the |
|
@davidrohr : this is currently building on EPN. Will run tests after. This round can be merged. I will reiterate on this adding remaining tests and possible improvements. |
|
ok, good with me, do you want to partially squash to keep some history, or shall I just squash-merge? |
|
You can squash, it's fine |
Hi @davidrohr,
In this draft I propose a possible implementation of dedicated executables to run generic GPU benchmark.
Notably this PR has:
o2-gpu-memory-benchmark-{cuda,hip}automatically produced based on "GPU autodetection"hipify-perlscript, upon changes in.cuCUDA corresponding filesfreeGPU resident memoryOverall, for the moment it's very simple but I wanted to be sure I am going in the right direction.
Please, let me know what do you think.
Cheers,
Matteo
Summary
Kernels are always launched in one-dimensional fashion, assigning
gridDim.xto the number of multiprocessors andblockDim.xto the number of available threads per block.Names can be imprecise and misleading, I try to explain what I mean:
seq: kernels are launched one after the other finishes, one per different 1GB(default) partition of scratch; so that each one run with no others in parallel on other partitions
Id.x) to a different 1GB(default) partition of scratch. To ergodically span over it it uses iteration increase by a stride equal toblockDim.x.stride = blockDim * gridDim)conc: kernels run at the same time on different slices of scratch. Benchmarks measure per-slice performance
Id.x) to a different sub-buffer (regardless partitions) of scratch. To ergodically span over it it uses iteration increase by a stride equal toblockDim.x.stride = blockDim * gridDim)