Skip to content

Conversation

@romerojosh
Copy link
Collaborator

This PR splits compilation of CUDA kernels in cuDecomp into two files (cudecomp_kernels.cu and cudecomp_kernels_rdc.cu) to limit the application of -rdc=true to only kernels that require that flag (e.g. kernels that use NVSHMEM device functions). It was observed that the combination of -rdc=true and __launch_bounds__ on our batched memcopy kernels caused an unexpectedly high increase in registers. Compiling those kernels without -rdc=true resolves this issue and improves memcopy kernel occupancy/performance.

@romerojosh romerojosh merged commit 6e17527 into main Mar 14, 2024
@romerojosh romerojosh deleted the limit_rdc branch March 14, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants