xor: encode with cuda #51
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a first pass to offload redundancy encoding to the GPU.
MPI applications running on systems with GPUs often run a single rank per GPU. This means that only a small number of ranks are available for encoding on each node. To improve performance, this executes the encode logic on the GPU. The current implementation requires CUDA-enabled MPI, because intermediate buffers are sent and received directly from the GPU without copies to the host for higher performance.
This adds a new
-DENABLE_CUDA=ONCMake option to compile with CUDA support. It requiresnvccto be detectable by CMake.The performance improvement is notable for both
XORandRS. Running 4 procs/node, where each writes a 1GB checkpoint file, the encode time is reduced by about 20x using Nvidia V100s compared to using the CPU.The changes support both encode and (scalable) decode for
XORandRS.The
RSdecode implementation could likely be improved by moving the full gaussian solve to a kernel to reduce the number of kernel launches. However, at least the current version is functional.For a multiply, the kernel does a lookup to get the log of the two operands and then another lookup to exponentiate the sum of the logs. This requires 3 memory loads. For a 1024-thread block, this requires 3*1024 memory loads.
To scale a set of values by a constant when using GF(2^8), things could be improved by precomputing the full 256-element multiplication table so that a multiply could be done using a single memory load. This table could be loaded into CUDA shared memory. That would require 256 memory loads.
Adding nvcc to github actions seems a bit complicated
https://github.com/ptheywood/cuda-cmake-github-actions
Got to be an easier way, right? Why doesn't Nvidia maintain something if it's this hard? Good question to pose to Nvidia.
IBM MPI -pthread work around for nvcc
The IBM MPI compiler wrappers add
-pthread, which leads to a fatal error withnvcc. As a work around, this flag can be dropped with a search/replace as done in:https://github.com/LLNL/blt/blob/aea5fbf046e122bd72888dad0a7f97a07b9ff08d/cmake/thirdparty/SetupMPI.cmake#L111-L119
A cleaner work around might be to replace
-pthreadwith-Xcompiler -pthreadwhen building with CUDA:https://stackoverflow.com/questions/43911802/does-nvcc-support-pthread-option-internally
Alternative implementation
An alternative to GPU offloading would be to spawn threads on each MPI process and use more CPU cores. That would require either
MPI_THREAD_MULTIPLEor at least thread synchronization when executing MPI operations. The benefit with this is that it would not require memory on the GPU, and it could be used on systems where people are not using all cores (for some reason) but have no GPUs.