Add handle cache for AMD platform #698

Binyang2014 · 2025-12-04T19:25:30Z

Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened

Copilot

Pull request overview

This PR introduces handle caching for the AMD platform to prevent reaching IPC handle limits when multiple processes open the same handles. NVIDIA GPUs handle reference counting internally, so this optimization is AMD-specific. The implementation uses a thread-safe cache with weak pointers to automatically reuse and release handles.

Key Changes:

Adds custom hash and equality operators for cudaIpcMemHandle_t to enable use in std::unordered_map
Implements getPeerMemoryHandle() function with AMD-specific caching using weak pointers and mutex protection
Refactors RegisteredMemory::Impl to use std::shared_ptr for automatic IPC handle lifetime management

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
src/registered_memory.cc	Adds hash/equality operators for cudaIpcMemHandle_t, implements getPeerMemoryHandle with AMD-specific caching, updates constructor to use cached handles, and removes manual IPC handle cleanup from destructor
src/include/registered_memory.hpp	Adds peerHandle field to RegisteredMemory::Impl for managing IPC handle lifetime via shared_ptr

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

src/include/registered_memory.hpp

src/registered_memory.cc

Copilot · 2025-12-04T19:34:23Z

@Binyang2014 I've opened a new pull request, #699, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-04T19:34:38Z

@Binyang2014 I've opened a new pull request, #700, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>

Binyang2014 · 2025-12-04T19:46:28Z

/azp run

azure-pipelines · 2025-12-04T19:46:49Z

Azure Pipelines successfully started running 3 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

src/include/registered_memory.hpp

Binyang2014 · 2025-12-04T23:47:09Z

/azp run

azure-pipelines · 2025-12-04T23:47:28Z

Azure Pipelines successfully started running 3 pipeline(s).

src/registered_memory.cc

test/mp_unit/executor_tests.cc

Binyang2014 · 2025-12-12T07:58:23Z

/azp run

azure-pipelines · 2025-12-12T07:58:42Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2025-12-13T12:04:25Z

/azp run

azure-pipelines · 2025-12-13T12:04:43Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2025-12-15T03:44:43Z

/azp run

azure-pipelines · 2025-12-15T03:45:03Z

Azure Pipelines successfully started running 3 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

python/test/mscclpp_mpi.py

python/test/conftest.py

src/include/logger.hpp

python/test/mscclpp_mpi.py

python/test/conftest.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/test/mscclpp_mpi.py

python/test/conftest.py

python/mscclpp/utils.py

python/test/mscclpp_mpi.py

Copilot · 2025-12-18T04:35:46Z

@Binyang2014 I've opened a new pull request, #710, to work on those changes. Once the pull request is ready, I'll request review from you.

Binyang2014 · 2025-12-18T04:40:42Z

/azp run

azure-pipelines · 2025-12-18T04:41:00Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2025-12-19T02:19:34Z

/azp run

azure-pipelines · 2025-12-19T02:19:55Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 added 2 commits December 4, 2025 19:20

add ipc cache

70c1d4d

WIP

1739f5a

Binyang2014 requested a review from Copilot December 4, 2025 19:25

Binyang2014 marked this pull request as ready for review December 4, 2025 19:25

Copilot started reviewing on behalf of Binyang2014 December 4, 2025 19:26 View session

Copilot finished reviewing on behalf of Binyang2014 December 4, 2025 19:30

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 4, 2025

[WIP] Address feedback on handle cache implementation for AMD platform #699

Closed

6 tasks

Copilot AI mentioned this pull request Dec 4, 2025

Move cudaIpcMemHandle_t hash and equality to custom namespace #700

Merged

Binyang2014 and others added 3 commits December 4, 2025 11:36

Update src/registered_memory.cc

4ebe37e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

WIP

2137325

Binyang2014 requested a review from Copilot December 4, 2025 19:44

Copilot started reviewing on behalf of Binyang2014 December 4, 2025 19:44 View session

Copilot finished reviewing on behalf of Binyang2014 December 4, 2025 19:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

src/registered_memory.cc Show resolved Hide resolved

src/registered_memory.cc Outdated Show resolved Hide resolved

src/include/registered_memory.hpp Show resolved Hide resolved

Binyang2014 requested review from chhwang, mahdiehghazim and seagater December 4, 2025 21:24

fix ut

b1029b9

mahdiehghazim reviewed Dec 8, 2025

View reviewed changes

src/registered_memory.cc Show resolved Hide resolved

test/mp_unit/executor_tests.cc Show resolved Hide resolved

Merge branch 'main' into binyli/handle_cache

d97d230

Binyang2014 added 2 commits December 13, 2025 09:35

fix ci

09d6b70

address comment

01fcb32

update

4acf3a9

update for log

e283c5d

chhwang requested a review from Copilot December 16, 2025 22:22

Copilot started reviewing on behalf of chhwang December 16, 2025 22:22 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

Binyang2014 and others added 2 commits December 18, 2025 10:56

Update python/test/conftest.py

a1581e6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

address comments

9d508a0

Binyang2014 requested a review from Copilot December 18, 2025 04:24

Merge branch 'main' into binyli/handle_cache

3dad149

Copilot started reviewing on behalf of Binyang2014 December 18, 2025 04:24 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

python/test/mscclpp_mpi.py Show resolved Hide resolved

python/test/conftest.py Show resolved Hide resolved

python/mscclpp/utils.py Show resolved Hide resolved

python/test/mscclpp_mpi.py Show resolved Hide resolved

Copilot AI mentioned this pull request Dec 18, 2025

Clarify automated review feedback on MpiGroup.__del__ implementation #710

Closed

Binyang2014 requested review from chhwang and mahdiehghazim December 18, 2025 04:40

Merge branch 'main' into binyli/handle_cache

98f18b8

chhwang approved these changes Dec 19, 2025

View reviewed changes

Merge branch 'main' into binyli/handle_cache

60b6fd6

Binyang2014 merged commit eda74a7 into main Dec 22, 2025
14 checks passed

Binyang2014 deleted the binyli/handle_cache branch December 22, 2025 02:39

Add handle cache for AMD platform #698

Add handle cache for AMD platform #698

Uh oh!

Conversation

Binyang2014 commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

Binyang2014 commented Dec 4, 2025

Uh oh!

azure-pipelines bot commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Dec 4, 2025

Uh oh!

azure-pipelines bot commented Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

Binyang2014 commented Dec 13, 2025

Uh oh!

azure-pipelines bot commented Dec 13, 2025

Uh oh!

Binyang2014 commented Dec 15, 2025

Uh oh!

azure-pipelines bot commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

Binyang2014 commented Dec 18, 2025

Uh oh!

azure-pipelines bot commented Dec 18, 2025

Uh oh!

Binyang2014 commented Dec 19, 2025

Uh oh!

azure-pipelines bot commented Dec 19, 2025

Uh oh!