Skip to content

Enable copies between different devices#3135

Open
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy
Open

Enable copies between different devices#3135
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy

Conversation

@kSkip
Copy link
Contributor

@kSkip kSkip commented Feb 10, 2026

When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require cudaMemcpyPeer.

CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset, num*sizeof(float), cudaMemcpyDeviceToDevice));
{
if (dest.device_id() != src.device_id())
CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float)));
Copy link
Owner

@davisking davisking Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.

The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.

Copy link
Contributor Author

@kSkip kSkip Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."

What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.

I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.

The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants