xor: encode with cuda #51

adammoody · 2023-11-15T18:28:14Z

This is a first pass to offload redundancy encoding to the GPU.

MPI applications running on systems with GPUs often run a single rank per GPU. This means that only a small number of ranks are available for encoding on each node. To improve performance, this executes the encode logic on the GPU. The current implementation requires CUDA-enabled MPI, because intermediate buffers are sent and received directly from the GPU without copies to the host for higher performance.

This adds a new -DENABLE_CUDA=ON CMake option to compile with CUDA support. It requires nvcc to be detectable by CMake.

The performance improvement is notable for both XOR and RS. Running 4 procs/node, where each writes a 1GB checkpoint file, the encode time is reduced by about 20x using Nvidia V100s compared to using the CPU.

The changes support both encode and (scalable) decode for XOR and RS.

The RS decode implementation could likely be improved by moving the full gaussian solve to a kernel to reduce the number of kernel launches. However, at least the current version is functional.

For a multiply, the kernel does a lookup to get the log of the two operands and then another lookup to exponentiate the sum of the logs. This requires 3 memory loads. For a 1024-thread block, this requires 3*1024 memory loads.

To scale a set of values by a constant when using GF(2^8), things could be improved by precomputing the full 256-element multiplication table so that a multiply could be done using a single memory load. This table could be loaded into CUDA shared memory. That would require 256 memory loads.

Adding nvcc to github actions seems a bit complicated

https://github.com/ptheywood/cuda-cmake-github-actions

Got to be an easier way, right? Why doesn't Nvidia maintain something if it's this hard? Good question to pose to Nvidia.

IBM MPI -pthread work around for nvcc

The IBM MPI compiler wrappers add -pthread, which leads to a fatal error with nvcc. As a work around, this flag can be dropped with a search/replace as done in:

https://github.com/LLNL/blt/blob/aea5fbf046e122bd72888dad0a7f97a07b9ff08d/cmake/thirdparty/SetupMPI.cmake#L111-L119

A cleaner work around might be to replace -pthread with -Xcompiler -pthread when building with CUDA:
https://stackoverflow.com/questions/43911802/does-nvcc-support-pthread-option-internally

Alternative implementation

An alternative to GPU offloading would be to spawn threads on each MPI process and use more CPU cores. That would require either MPI_THREAD_MULTIPLE or at least thread synchronization when executing MPI operations. The benefit with this is that it would not require memory on the GPU, and it could be used on systems where people are not using all cores (for some reason) but have no GPUs.

xor: encode with cuda

f0c1f53

adammoody force-pushed the cuda branch from 8292d27 to f0c1f53 Compare November 15, 2023 21:09

adammoody added 8 commits November 15, 2023 13:30

actions: install nvcc

1723207

add -DENABLE_CUDA cmake option

9e6081c

actions: check matrix.os for os name

b023e1a

rs: encode using cuda

060aa71

move int inside non cuda block

5064fa1

cmake: re-enable package config files

278258f

actions: comment out nvcc

783d53e

rs: decode with cuda

3f7e8a6

adammoody force-pushed the cuda branch from d9853e1 to 3f7e8a6 Compare November 18, 2023 18:57

adammoody merged commit 8f368e1 into main Nov 18, 2023

adammoody deleted the cuda branch November 18, 2023 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xor: encode with cuda #51

xor: encode with cuda #51

Uh oh!

adammoody commented Nov 15, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xor: encode with cuda #51

xor: encode with cuda #51

Uh oh!

Conversation

adammoody commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adammoody commented Nov 15, 2023 •

edited

Loading