Enable copies between different devices by kSkip · Pull Request #3135 · davisking/dlib

kSkip · 2026-02-10T14:27:03Z

When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require cudaMemcpyPeer.

davisking · 2026-02-15T01:34:20Z

dlib/cuda/gpu_data.cpp

-                    CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset,  num*sizeof(float), cudaMemcpyDeviceToDevice));
+                {
+                    if (dest.device_id() != src.device_id())
+                        CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float)));


The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.

The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.

I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."

What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.

I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.

The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.

kSkip added 2 commits February 10, 2026 08:16

Use cudaMemcpyPeer if tensors are on different devices

74da1e4

Merge branch 'master' into enable-peer-memcpy

167c6fc

davisking reviewed Feb 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable copies between different devices#3135

Enable copies between different devices#3135
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy

kSkip commented Feb 10, 2026

Uh oh!

davisking Feb 15, 2026 •

edited

Loading

Uh oh!

kSkip Feb 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kSkip commented Feb 10, 2026

Uh oh!

davisking Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kSkip Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davisking Feb 15, 2026 •

edited

Loading

kSkip Feb 15, 2026 •

edited

Loading