RRTMGP memory reduction#3128
Conversation
…right-sizing RRTMGP radiation consumed excessive GPU memory because all columns were processed at once with O(ncol*nlay*ngpt) internal temporaries. This commit introduces five changes: (1) right-size the Kokkos memory pool (nvar 300->20, use ngpt not nbnd), (2) initialize RRTMGP once instead of every radiation step, (3) process columns in configurable chunks via erf.rad_ncol_chunk (default 5000), (4) remove dead aerosol and cloud optical depth arrays, (5) conditionally allocate diagnostic flux arrays. Expected peak GPU memory reduction is ~10x for typical workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The clear-sky heating-rate kernels write to sw_clrsky_heating and lw_clrsky_heating via async Kokkos parallel_for; populateDatalogMF() then launches a kernel that reads those arrays. Without a fence, the LW kernel's writes can still be in flight when the datalog kernel launches, so the radqrclw column in the radiation datalog varies run-to-run on multi-GPU MPI configurations (reproduced at 2 ranks on both this branch and development). All other datalog fields are already serialized by prior kernel-launch ordering, which is why only radqrclw was affected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Aerosol optical property arrays are restored as dormant scaffolding so a future aerosol coupling can populate them without re-adding members. The Radiation constructor aborts if erf.rad_do_aerosol = true since the pipeline is not implemented. Updates RRTMGP_Memory_Reduction.md to cover the aerosol abort, the dormant arrays, and the prior fence fix that made radqrclw deterministic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
| // Number of columns per RRTMGP chunk (controls peak GPU memory) | ||
| pp.query("rad_ncol_chunk", m_ncol_chunk); | ||
|
|
There was a problem hiding this comment.
Could we validate m_ncol_chunk > 0 here? It is later used as the chunk-loop increment, so erf.rad_ncol_chunk = 0 would produce an infinite loop, and negative values would also be invalid.
Since this adds a new user-facing input parameter, could we also add erf.rad_ncol_chunk to the normal inputs documentation with its default and recommended tuning guidance?
| // These are static lookup tables that never change. | ||
| // Size the memory pool for ncol_chunk (not full ncol) to limit peak memory. | ||
| if (!rrtmgp::initialized) { | ||
| int ncol_for_pool = std::min(m_ncol_chunk, m_ncol); |
There was a problem hiding this comment.
Should the pool be sized for m_ncol_chunk rather than min(m_ncol_chunk, m_ncol)? If a later radiation step has more local columns after regridding/load balancing, the chunk loop could use a larger ncol_c than the pool was initialized for.
|
|
||
| **File:** `ERF_RRTMGP_Interface.cpp` | ||
|
|
||
| The pool multiplier `nvar` was reduced from 300 to 20, and `nbnd` was replaced with `ngpt` |
There was a problem hiding this comment.
This says nvar was reduced to 20, but the implementation appears to use const size_t nvar = 12 in ERF_RRTMGP_Interface.cpp. Could we either update the doc or confirm that 12 is sufficient?
There was a problem hiding this comment.
Thanks for the catch. I originally set 20, and Aaron Lattanzi set 12 in his test:
11c9c65#diff-a65ccaf90702def79896f166e2417f8de34d1a950528b8ec7c13d1fe0c3ed3beR253
I'll update the md file.
| populated with real data — see section 4). The flag is preserved so the abort can be | ||
| removed once a real aerosol coupling lands. | ||
|
|
||
| ### 7. Fence Before Datalog Read to Eliminate Cross-Stream Race |
There was a problem hiding this comment.
I may be missing it, but I do not see the corresponding Kokkos::fence() in the code changes. Could we verify that the fence is actually inserted before populateDatalogMF()? If it is present, maybe mention the exact function/location here.
There was a problem hiding this comment.
This has been added by Aaron Lattanzi:
11c9c65#diff-69f75840f3d92b0736398c62970a74839d86b83c267a2f40014bbf22ffe3c479R1215
| // --- Create chunk gas concentrations by subsetting from full gas_concs --- | ||
| gas_concs_t gas_concs_c; | ||
| gas_concs_c.init(gas_names_offset, ncol_c, nlay); | ||
| for (int igas = 0; igas < m_ngas; ++igas) { |
There was a problem hiding this comment.
This allocates and fills a full-domain vmr_full for every gas and every chunk. Could we move the full-gas get_vmr() calls outside the chunk loop, or expose a subview path, so chunking does not repeatedly rebuild full-size VMR arrays?
Addressing 2 issues in RRTMGP: 1) the out-of-memory (OOM) issue and 2) the race condition (a bug not related to OOM). The work was done with the assistance of Claude Code. Details are noted in Source/PhysicsInterfaces/Radiation/RRTMGP_Memory_Reduction.md. Below are the testing results:
Old:
- 8 GPU: OOM
- 16 GPU: OOM
- 32 GPU: OOM
- 64 GPU: works
New:
- 1 GPU: OOM
- 2 GPU: works
- old 2 GPU vs old 2 GPU (same input, two runs): differ
- new 1 GPU vs old 1 GPU: match
- new 2 GPU vs new 1 GPU: match
- new 2 GPU vs new 2 GPU: match