Fix shared memory offset not reset between CUDA kernels. by duburcqa · Pull Request #442 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-03-31T16:12:07Z

This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka shared_array_bytes = 0 because tensor_type->get_num_elements() == 0 from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.

To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.

To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").

Note that I just discovered these limitations:

// By default, CUDA could allocate up to 48KB static shared arrays.
// It requires dynamic shared memory to allocate a larger array.
// Therefore, when one shared array request for size greater than 48KB,
// we switch it to dynamic allocation.
// In current version, only one dynamic instance is allowed.
// TODO: remove the limit.
constexpr std::size_t cuda_dynamic_shared_array_threshold_bytes = 49152;

// use for auto mesh_local to determine shared-mem size per block (in bytes)
// TODO: get this at runtime
constexpr std::size_t default_shared_mem_size = 65536;

I suggest to address them in a latter stage.

hughperkins · 2026-03-31T18:18:19Z

Could you explain step by step:

what is your hypothesis for what the bug is?
how this PR addresses it?

duburcqa · 2026-03-31T19:34:22Z

Could you explain step by step:

Done (see PR description and unit test description).

duburcqa · 2026-03-31T19:38:57Z

No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me.

hughperkins · 2026-04-01T10:55:19Z

Could you update teh PR title please

…es'.

hughperkins

Awesome. Thank you.

…PU branch) Ports just the Python entry point of upstream commit e98b7a9 ("Fix shared memory offset not reset between CUDA kernels (Genesis-Embodied-AI#442)") so that Genesis perf/upstream-pulls (which calls it at scene-build time via rigid_solver.py::_build_static_config) doesn't AttributeError against amd-integration quadrants. Why a shim and not the full commit: - Genesis only ever calls this with `is_lowerbound_ok=True`. - For AMDGPU under that flag, the function returns a hardcoded constant (64 KiB LDS, valid for RDNA 2+ and MI300X). - The CUDA branch of the upstream commit also rewires codegen + GpuGraphManager + shared-array tests; none of that machinery is needed for the AMDGPU is_lowerbound_ok=True path. If someone later needs the exact value (is_lowerbound_ok=False) on AMDGPU, the full commit would have to be cherry-picked + adapted to AMDGPU codegen. This branch's sole purpose is to unblock end-to-end testing of the Genesis perf/upstream-pulls bundle against AMD quadrants. See docs/optimization_catalog.md section B. Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>

duburcqa mentioned this pull request Mar 31, 2026

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance. Genesis-Embodied-AI/Genesis#2629

Merged

7 tasks