Skip to content

Fix shared memory offset not reset between CUDA kernels.#442

Merged
duburcqa merged 19 commits intomainfrom
duburcqa/fix_shared_mem_offset
Apr 1, 2026
Merged

Fix shared memory offset not reset between CUDA kernels.#442
duburcqa merged 19 commits intomainfrom
duburcqa/fix_shared_mem_offset

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Mar 31, 2026

This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka shared_array_bytes = 0 because tensor_type->get_num_elements() == 0 from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.

To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.

To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").


Note that I just discovered these limitations:

// By default, CUDA could allocate up to 48KB static shared arrays.
// It requires dynamic shared memory to allocate a larger array.
// Therefore, when one shared array request for size greater than 48KB,
// we switch it to dynamic allocation.
// In current version, only one dynamic instance is allowed.
// TODO: remove the limit.
constexpr std::size_t cuda_dynamic_shared_array_threshold_bytes = 49152;

// use for auto mesh_local to determine shared-mem size per block (in bytes)
// TODO: get this at runtime
constexpr std::size_t default_shared_mem_size = 65536;

I suggest to address them in a latter stage.

Comment thread quadrants/codegen/cuda/codegen_cuda.cpp Outdated
Comment thread tests/python/test_shared_array.py Outdated
Comment thread tests/python/test_shared_array.py Outdated
Comment thread quadrants/python/export_lang.cpp
@hughperkins
Copy link
Copy Markdown
Collaborator

Could you explain step by step:

  • what is your hypothesis for what the bug is?
  • how this PR addresses it?

@duburcqa
Copy link
Copy Markdown
Contributor Author

Could you explain step by step:

Done (see PR description and unit test description).

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented Mar 31, 2026

No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me.

Comment thread quadrants/codegen/cuda/codegen_cuda.cpp
Comment thread quadrants/runtime/cuda/gpu_graph_manager.cpp
Comment thread tests/python/test_shared_array.py
Comment thread tests/python/test_shared_array.py Outdated
Comment thread tests/python/test_shared_array.py Outdated
Comment thread tests/python/test_shared_array.py Outdated
Comment thread tests/python/test_shared_array.py
Comment thread tests/python/test_shared_array.py
Comment thread tests/python/test_shared_array.py Outdated
Comment thread tests/python/test_shared_array.py Outdated
Comment thread quadrants/codegen/cuda/codegen_cuda.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch 2 times, most recently from f9848ee to 0670652 Compare March 31, 2026 22:36
Comment thread tests/python/test_shared_array.py Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch 2 times, most recently from be5e613 to ec3a152 Compare April 1, 2026 09:20
@hughperkins
Copy link
Copy Markdown
Collaborator

Could you update teh PR title please

@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from e35d351 to 0b852cd Compare April 1, 2026 12:03
Comment thread tests/python/test_shared_array.py Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from e51c6a6 to 2818f9f Compare April 1, 2026 12:11
Comment thread tests/python/test_shared_array.py
Comment thread tests/python/test_shared_array.py Outdated
Comment thread python/quadrants/lang/impl.py Outdated
Comment thread tests/python/test_shared_array.py Outdated
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thank you.

@duburcqa duburcqa merged commit e98b7a9 into main Apr 1, 2026
47 checks passed
@duburcqa duburcqa deleted the duburcqa/fix_shared_mem_offset branch April 1, 2026 13:32
gpinkert added a commit to ROCm/quadrants that referenced this pull request Apr 28, 2026
…PU branch)

Ports just the Python entry point of upstream commit e98b7a9
("Fix shared memory offset not reset between CUDA kernels (Genesis-Embodied-AI#442)") so
that Genesis perf/upstream-pulls (which calls it at scene-build time
via rigid_solver.py::_build_static_config) doesn't AttributeError
against amd-integration quadrants.

Why a shim and not the full commit:
- Genesis only ever calls this with `is_lowerbound_ok=True`.
- For AMDGPU under that flag, the function returns a hardcoded constant
  (64 KiB LDS, valid for RDNA 2+ and MI300X).
- The CUDA branch of the upstream commit also rewires codegen +
  GpuGraphManager + shared-array tests; none of that machinery is
  needed for the AMDGPU is_lowerbound_ok=True path.

If someone later needs the exact value (is_lowerbound_ok=False) on
AMDGPU, the full commit would have to be cherry-picked + adapted to
AMDGPU codegen.

This branch's sole purpose is to unblock end-to-end testing of the
Genesis perf/upstream-pulls bundle against AMD quadrants. See
docs/optimization_catalog.md section B.

Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants