Skip to content

[Cuda Ipc] Add barrier at the end of IpcHandleCache::exchangeHandles#4308

Merged
samnordmann merged 4 commits intomainfrom
ipc_fix_bug_key_not_present_in_store
Apr 27, 2025
Merged

[Cuda Ipc] Add barrier at the end of IpcHandleCache::exchangeHandles#4308
samnordmann merged 4 commits intomainfrom
ipc_fix_bug_key_not_present_in_store

Conversation

@samnordmann
Copy link
Collaborator

A synchronization is needed at the IpcHandleCache::exchangeHandles to avoid the exporter exporting twice before the importer delete the key to the store.

Should fix a bug observed in the CI.

cc Team thread

@github-actions
Copy link

github-actions bot commented Apr 24, 2025

Review updated until commit 8ce05dd

Description

  • Added barrier in IpcHandleCache::exchangeHandles to synchronize ranks

  • Re-enabled P2PCommunicationTest.CudaComm test

  • Re-enabled MultiDeviceTest.ShareIpcMemHandles test


Changes walkthrough 📝

Relevant files
Bug fix
ipc_handle.cpp
Add barrier for synchronization                                                   

csrc/multidevice/ipc_handle.cpp

  • Added a barrier at the end of IpcHandleCache::exchangeHandles to
    ensure synchronization
  • +6/-0     
    Tests
    test_multidevice_communications.cpp
    Re-enable CudaComm test                                                                   

    tests/cpp/test_multidevice_communications.cpp

    • Re-enabled P2PCommunicationTest.CudaComm test
    +1/-1     
    test_multidevice_host_ir.cpp
    Re-enable ShareIpcMemHandles test                                               

    tests/cpp/test_multidevice_host_ir.cpp

    • Re-enabled MultiDeviceTest.ShareIpcMemHandles test
    +1/-1     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Barrier Efficiency

    The added barrier ensures synchronization but may introduce performance overhead. Consider optimizing the barrier to involve only necessary ranks.

    // a second barrier is needed here to ensure all ranks have received the
    // memhandles and the keys are deleted from the store before the next call to
    // exchangeHandles
    // TODO: precisely select what ranks need to wait on that barrier.
    communicator->barrier();
    Test Re-enablement

    The CudaComm test was re-enabled. Ensure that the test is stable and passes consistently before merging.

    TEST_F(P2PCommunicationTest, CudaComm) {
    Test Re-enablement

    The ShareIpcMemHandles test was re-enabled. Ensure that the test is stable and passes consistently before merging.

    TEST_F(MultiDeviceTest, ShareIpcMemHandles) {

    @samnordmann samnordmann requested a review from wujingyue April 24, 2025 13:40
    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you also reenable the tests that I disabled?

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator

    I disabled two: #4283

    (Sorry, I wrote my previous comment on my phone. I should have given you the code pointers)

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit 3fe1c32 into main Apr 27, 2025
    53 checks passed
    @samnordmann samnordmann deleted the ipc_fix_bug_key_not_present_in_store branch April 27, 2025 08:58
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants