Enable copies between different devices#3135
Conversation
| CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset, num*sizeof(float), cudaMemcpyDeviceToDevice)); | ||
| { | ||
| if (dest.device_id() != src.device_id()) | ||
| CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float))); |
There was a problem hiding this comment.
The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.
The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.
There was a problem hiding this comment.
I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."
What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.
I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.
The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.
When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require
cudaMemcpyPeer.