-
Notifications
You must be signed in to change notification settings - Fork 194
GDA support for alltoall via rocshmem integration #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
c8d1d16 to
26126ef
Compare
| all_pipeline = ["0", "1"] | ||
| pipelined_types = ["bf16"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all_pipeline just a placeholder for now? I do not see it being used. OR is it a dup of line 14?
pipelined_types is also defined again below on line 162. Duplicate?
| @@ -0,0 +1,39 @@ | |||
| /************************************************************************* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the diff between alltoall_gda and alltoall_gda1 impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alltoall_gda1 is a kernel for large messages where we use multiple CUs for the data copies.
|
|
||
| #ifdef ENABLE_ROCSHMEM | ||
| #include <rocshmem/rocshmem.hpp> | ||
| #define NUM_SYM_BUF 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially dangerous. Can we get this from ROCSHMEM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rccl defined number to control the number of symmetric buffers used in rccl.
src/init.cc
Outdated
| #ifdef ENABLE_ROCSHMEM | ||
| /* --- sanity-check print statement for development purposes --- */ | ||
| if (rcclParamRocshmemEnabled()) { // @TODO - This doesn't seem to disable when I set ROCSHMEM_ENABLE=0 on command line | ||
| printf("Initializing rocSHMEM inside of RCCL\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO debug print instead?
| comm->sourceRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024)); | ||
| comm->destRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant looks like it should be pulled out and given a name. What are the implications of this sizing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. This size should be >= threshold. I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if threshold is set to greater than 1MB? do you clamp it?
| url = https://github.com/nlohmann/json.git | ||
| ignore = dirty | ||
| shallow = true | ||
| [submodule "ext-src/rocSHMEM"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't hate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question I had is about how we are planning on rolling out rocSHMEM changes. Are we planning on using a pinned version of rocSHMEM or running part of RCCL CI tests as part of rocSHMEM's release? I think we need something like that for safety if we're planning on enabling this code path by default as part of our release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to use a pinned version of rocSHMEM.
|
Can you comment on how this affects compilation time for gfx942 and gfx950? |
src/device/generate.py
Outdated
| func_pattern = func_pattern[0] | ||
| else: | ||
| func_pattern = "AllGather|AllReduce|AllToAllPivot|Broadcast|Reduce|ReduceScatter|SendRecv" | ||
| func_pattern = "AllGather|AllReduce|AllToAllPivot|AllToAllGda|AllToAllGda1|Broadcast|Reduce|ReduceScatter|SendRecv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these new functions built if rocshmem is disabled? I see the guards in the source code, but will this take care of avoiding the build?
| find_package_handle_standard_args(rocshmem_static DEFAULT_MSG ROCSHMEM_INCLUDE_DIR ROCSHMEM_LIBRARY) | ||
| ## mark_as_advanced(MSCCLPP_INCLUDE_DIRS MSCCLPP_NCCL_STATIC_LIB) add this for Rocshmem? | ||
|
|
||
| ## --- TODO --- Remove this, just use for testing purposes -- ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a HARD fatal error. Do you want to still keep it?
src/enqueue.cc
Outdated
| bool specialized; | ||
| }; | ||
|
|
||
| static int symId = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a race. It is shared across all communicators, and can err in multi-threaded mode.
src/init.cc
Outdated
| rocshmem::rocshmem_free(comm->sourceRshmem[i]); | ||
| rocshmem::rocshmem_free(comm->destRshmem[i]); | ||
| } | ||
| rocshmem::rocshmem_finalize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is rocshmem::rocshmem_finalize(); safe when multiple comms call finalize?
| comm->sourceRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024)); | ||
| comm->destRshmem[i] = (void *)rocshmem::rocshmem_malloc((size_t)(1*1024*1024)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if threshold is set to greater than 1MB? do you clamp it?
693c20e to
4e57ace
Compare
Hey @alex-breslow-amd! On gfx942, a traditional build ( |
Great, that is good to hear @thomas-huber and @nusislam! |
|
Sorry for closing by accident. |
4e57ace to
b343b6c
Compare
Details
Experimental enablement of GDA-based alltoall in RCCL via rocshmem integration. Tested on MI300 with Thor2 NICs.
Work item: AICOMRCCL-332
What were the changes?
One sentence describing the work done.
Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.
How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.
Additional Documentation:
What else should the reviewer know?
Approval Checklist
Do not approve until these items are satisfied.