Fix shared memory offset not reset between CUDA kernels.#442
Merged
Conversation
7 tasks
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
Collaborator
|
Could you explain step by step:
|
Contributor
Author
Done (see PR description and unit test description). |
Contributor
Author
|
No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me. |
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
hughperkins
reviewed
Mar 31, 2026
f9848ee to
0670652
Compare
hughperkins
reviewed
Mar 31, 2026
be5e613 to
ec3a152
Compare
Collaborator
|
Could you update teh PR title please |
e35d351 to
0b852cd
Compare
hughperkins
reviewed
Apr 1, 2026
e51c6a6 to
2818f9f
Compare
hughperkins
reviewed
Apr 1, 2026
hughperkins
reviewed
Apr 1, 2026
hughperkins
reviewed
Apr 1, 2026
hughperkins
reviewed
Apr 1, 2026
hughperkins
approved these changes
Apr 1, 2026
Collaborator
hughperkins
left a comment
There was a problem hiding this comment.
Awesome. Thank you.
gpinkert
added a commit
to ROCm/quadrants
that referenced
this pull request
Apr 28, 2026
…PU branch) Ports just the Python entry point of upstream commit e98b7a9 ("Fix shared memory offset not reset between CUDA kernels (Genesis-Embodied-AI#442)") so that Genesis perf/upstream-pulls (which calls it at scene-build time via rigid_solver.py::_build_static_config) doesn't AttributeError against amd-integration quadrants. Why a shim and not the full commit: - Genesis only ever calls this with `is_lowerbound_ok=True`. - For AMDGPU under that flag, the function returns a hardcoded constant (64 KiB LDS, valid for RDNA 2+ and MI300X). - The CUDA branch of the upstream commit also rewires codegen + GpuGraphManager + shared-array tests; none of that machinery is needed for the AMDGPU is_lowerbound_ok=True path. If someone later needs the exact value (is_lowerbound_ok=False) on AMDGPU, the full commit would have to be cherry-picked + adapted to AMDGPU codegen. This branch's sole purpose is to unblock end-to-end testing of the Genesis perf/upstream-pulls bundle against AMD quadrants. See docs/optimization_catalog.md section B. Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka
shared_array_bytes = 0becausetensor_type->get_num_elements() == 0from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.
To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").
Note that I just discovered these limitations:
I suggest to address them in a latter stage.