Skip to content

Conversation

@nusislam
Copy link
Contributor

@nusislam nusislam commented Dec 10, 2025

Details

Experimental enablement of GDA-based alltoall in RCCL via rocshmem integration. Tested on MI300 with Thor2 NICs.

Work item: AICOMRCCL-332

What were the changes?
One sentence describing the work done.

Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@nusislam nusislam marked this pull request as draft December 10, 2025 22:12
@nusislam nusislam force-pushed the rccl-rocshmem-alltoall-new branch from c8d1d16 to 26126ef Compare December 17, 2025 18:26
@nusislam nusislam marked this pull request as ready for review December 19, 2025 22:37
@thomas-huber thomas-huber self-requested a review December 31, 2025 18:17
@nusislam nusislam requested a review from wenkaidu January 2, 2026 15:59
Comment on lines +19 to +20
all_pipeline = ["0", "1"]
pipelined_types = ["bf16"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all_pipeline just a placeholder for now? I do not see it being used. OR is it a dup of line 14?
pipelined_types is also defined again below on line 162. Duplicate?

@@ -0,0 +1,39 @@
/*************************************************************************
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the diff between alltoall_gda and alltoall_gda1 impl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alltoall_gda1 is a kernel for large messages where we use multiple CUs for the data copies.


#ifdef ENABLE_ROCSHMEM
#include <rocshmem/rocshmem.hpp>
#define NUM_SYM_BUF 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially dangerous. Can we get this from ROCSHMEM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rccl defined number to control the number of symmetric buffers used in rccl.

src/init.cc Outdated
#ifdef ENABLE_ROCSHMEM
/* --- sanity-check print statement for development purposes --- */
if (rcclParamRocshmemEnabled()) { // @TODO - This doesn't seem to disable when I set ROCSHMEM_ENABLE=0 on command line
printf("Initializing rocSHMEM inside of RCCL\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO debug print instead?

Comment on lines +2146 to +2147
comm->sourceRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024));
comm->destRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant looks like it should be pulled out and given a name. What are the implications of this sizing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. This size should be >= threshold. I will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if threshold is set to greater than 1MB? do you clamp it?

url = https://github.com/nlohmann/json.git
ignore = dirty
shallow = true
[submodule "ext-src/rocSHMEM"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't hate this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question I had is about how we are planning on rolling out rocSHMEM changes. Are we planning on using a pinned version of rocSHMEM or running part of RCCL CI tests as part of rocSHMEM's release? I think we need something like that for safety if we're planning on enabling this code path by default as part of our release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to use a pinned version of rocSHMEM.

@alex-breslow-amd
Copy link
Contributor

Can you comment on how this affects compilation time for gfx942 and gfx950?

func_pattern = func_pattern[0]
else:
func_pattern = "AllGather|AllReduce|AllToAllPivot|Broadcast|Reduce|ReduceScatter|SendRecv"
func_pattern = "AllGather|AllReduce|AllToAllPivot|AllToAllGda|AllToAllGda1|Broadcast|Reduce|ReduceScatter|SendRecv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these new functions built if rocshmem is disabled? I see the guards in the source code, but will this take care of avoiding the build?

find_package_handle_standard_args(rocshmem_static DEFAULT_MSG ROCSHMEM_INCLUDE_DIR ROCSHMEM_LIBRARY)
## mark_as_advanced(MSCCLPP_INCLUDE_DIRS MSCCLPP_NCCL_STATIC_LIB) add this for Rocshmem?

## --- TODO --- Remove this, just use for testing purposes -- ###
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a HARD fatal error. Do you want to still keep it?

src/enqueue.cc Outdated
bool specialized;
};

static int symId = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a race. It is shared across all communicators, and can err in multi-threaded mode.

src/init.cc Outdated
rocshmem::rocshmem_free(comm->sourceRshmem[i]);
rocshmem::rocshmem_free(comm->destRshmem[i]);
}
rocshmem::rocshmem_finalize();
Copy link
Contributor

@mustafabar mustafabar Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is rocshmem::rocshmem_finalize(); safe when multiple comms call finalize?

Comment on lines +2146 to +2147
comm->sourceRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024));
comm->destRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if threshold is set to greater than 1MB? do you clamp it?

@nusislam nusislam force-pushed the rccl-rocshmem-alltoall-new branch from 693c20e to 4e57ace Compare January 5, 2026 21:37
@thomas-huber
Copy link
Contributor

thomas-huber commented Jan 6, 2026

Can you comment on how this affects compilation time for gfx942 and gfx950?

Hey @alex-breslow-amd!

On gfx942, a traditional build (./install.sh -l) takes 603.045s and the "with rocshmem" build takes 605.864s (./install.sh -l --rocshmem).

@alex-breslow-amd
Copy link
Contributor

Can you comment on how this affects compilation time for gfx942 and gfx950?

Hey @alex-breslow-amd!

On gfx942, a traditional build (./install.sh -l) takes 603.045s and the "with rocshmem" build takes 605.864s (./install.sh -l --rocshmem).

Great, that is good to hear @thomas-huber and @nusislam!

@alex-breslow-amd
Copy link
Contributor

Sorry for closing by accident.

@nusislam nusislam force-pushed the rccl-rocshmem-alltoall-new branch from 4e57ace to b343b6c Compare January 6, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants