RRTMGP memory reduction by wang1202 · Pull Request #3128 · erf-model/ERF

wang1202 · 2026-04-22T14:53:52Z

Addressing 2 issues in RRTMGP: 1) the out-of-memory (OOM) issue and 2) the race condition (a bug not related to OOM). The work was done with the assistance of Claude Code. Details are noted in Source/PhysicsInterfaces/Radiation/RRTMGP_Memory_Reduction.md. Below are the testing results:

Peak GPU memory (512 x 512 x 42 grid)

Old:
  - 8 GPU: OOM
  - 16 GPU: OOM
  - 32 GPU: OOM
  - 64 GPU: works

New:
- 1 GPU: OOM
- 2 GPU: works

Datalog determinism (fence before populateDatalogMF)

- old 2 GPU vs old 2 GPU (same input, two runs): differ
- new 1 GPU vs old 1 GPU: match
- new 2 GPU vs new 1 GPU: match
- new 2 GPU vs new 2 GPU: match

…right-sizing RRTMGP radiation consumed excessive GPU memory because all columns were processed at once with O(ncol*nlay*ngpt) internal temporaries. This commit introduces five changes: (1) right-size the Kokkos memory pool (nvar 300->20, use ngpt not nbnd), (2) initialize RRTMGP once instead of every radiation step, (3) process columns in configurable chunks via erf.rad_ncol_chunk (default 5000), (4) remove dead aerosol and cloud optical depth arrays, (5) conditionally allocate diagnostic flux arrays. Expected peak GPU memory reduction is ~10x for typical workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The clear-sky heating-rate kernels write to sw_clrsky_heating and lw_clrsky_heating via async Kokkos parallel_for; populateDatalogMF() then launches a kernel that reads those arrays. Without a fence, the LW kernel's writes can still be in flight when the datalog kernel launches, so the radqrclw column in the radiation datalog varies run-to-run on multi-GPU MPI configurations (reproduced at 2 ranks on both this branch and development). All other datalog fields are already serialized by prior kernel-launch ordering, which is why only radqrclw was affected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Aerosol optical property arrays are restored as dormant scaffolding so a future aerosol coupling can populate them without re-adding members. The Radiation constructor aborts if erf.rad_do_aerosol = true since the pipeline is not implemented. Updates RRTMGP_Memory_Reduction.md to cover the aerosol abort, the dormant arrays, and the prior fence fix that made radqrclw deterministic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

pressel · 2026-05-13T19:56:19Z


+    // Number of columns per RRTMGP chunk (controls peak GPU memory)
+    pp.query("rad_ncol_chunk", m_ncol_chunk);
+


Could we validate m_ncol_chunk > 0 here? It is later used as the chunk-loop increment, so erf.rad_ncol_chunk = 0 would produce an infinite loop, and negative values would also be invalid.

Since this adds a new user-facing input parameter, could we also add erf.rad_ncol_chunk to the normal inputs documentation with its default and recommended tuning guidance?

pressel · 2026-05-13T19:57:08Z

+    // These are static lookup tables that never change.
+    // Size the memory pool for ncol_chunk (not full ncol) to limit peak memory.
+    if (!rrtmgp::initialized) {
+        int ncol_for_pool = std::min(m_ncol_chunk, m_ncol);


Should the pool be sized for m_ncol_chunk rather than min(m_ncol_chunk, m_ncol)? If a later radiation step has more local columns after regridding/load balancing, the chunk loop could use a larger ncol_c than the pool was initialized for.

pressel · 2026-05-13T19:58:21Z

+
+**File:** `ERF_RRTMGP_Interface.cpp`
+
+The pool multiplier `nvar` was reduced from 300 to 20, and `nbnd` was replaced with `ngpt`


This says nvar was reduced to 20, but the implementation appears to use const size_t nvar = 12 in ERF_RRTMGP_Interface.cpp. Could we either update the doc or confirm that 12 is sufficient?

Thanks for the catch. I originally set 20, and Aaron Lattanzi set 12 in his test:
11c9c65#diff-a65ccaf90702def79896f166e2417f8de34d1a950528b8ec7c13d1fe0c3ed3beR253

I'll update the md file.

pressel · 2026-05-13T20:23:33Z

+populated with real data — see section 4). The flag is preserved so the abort can be
+removed once a real aerosol coupling lands.
+
+### 7. Fence Before Datalog Read to Eliminate Cross-Stream Race


I may be missing it, but I do not see the corresponding Kokkos::fence() in the code changes. Could we verify that the fence is actually inserted before populateDatalogMF()? If it is present, maybe mention the exact function/location here.

This has been added by Aaron Lattanzi:
11c9c65#diff-69f75840f3d92b0736398c62970a74839d86b83c267a2f40014bbf22ffe3c479R1215

pressel · 2026-05-13T20:27:14Z

+        // --- Create chunk gas concentrations by subsetting from full gas_concs ---
+        gas_concs_t gas_concs_c;
+        gas_concs_c.init(gas_names_offset, ncol_c, nlay);
+        for (int igas = 0; igas < m_ngas; ++igas) {


This allocates and fills a full-domain vmr_full for every gas and every chunk. Could we move the full-gas get_vmr() calls outside the chunk loop, or expose a subview path, so chunking does not repeatedly rebuild full-size VMR arrays?

wang1202 and others added 5 commits April 17, 2026 13:23

Merge branch 'development' into RRTMGP_memory_reduction

96b60e8

Merge branch 'development' into RRTMGP_memory_reduction

6e92a97

AMLattanzi mentioned this pull request Apr 22, 2026

Rad Mem Temp #3132

Merged

AMLattanzi and others added 8 commits April 22, 2026 14:20

Merge branch 'development' into RRTMGP_memory_reduction

73d132b

Merge branch 'development' into RRTMGP_memory_reduction

25905ae

Merge branch 'development' into RRTMGP_memory_reduction

c866cae

Merge branch 'development' into RRTMGP_memory_reduction

a91c39f

Merge branch 'development' into RRTMGP_memory_reduction

8b991a8

Merge branch 'development' into RRTMGP_memory_reduction

996253e

Merge branch 'development' into RRTMGP_memory_reduction

cfd15de

Merge branch 'erf-model:development' into RRTMGP_memory_reduction

708becd

pressel reviewed May 13, 2026

View reviewed changes

Merge branch 'erf-model:development' into RRTMGP_memory_reduction

5d32d97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RRTMGP memory reduction#3128

RRTMGP memory reduction#3128
wang1202 wants to merge 14 commits into
erf-model:developmentfrom
wang1202:RRTMGP_memory_reduction

wang1202 commented Apr 22, 2026

Uh oh!

pressel May 13, 2026 •

edited

Loading

Uh oh!

pressel May 13, 2026

Uh oh!

pressel May 13, 2026

Uh oh!

wang1202 May 15, 2026

Uh oh!

pressel May 13, 2026

Uh oh!

wang1202 May 15, 2026

Uh oh!

pressel May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		// Number of columns per RRTMGP chunk (controls peak GPU memory)
		pp.query("rad_ncol_chunk", m_ncol_chunk);


		File: `ERF_RRTMGP_Interface.cpp`

		The pool multiplier `nvar` was reduced from 300 to 20, and `nbnd` was replaced with `ngpt`

Conversation

wang1202 commented Apr 22, 2026

Uh oh!

pressel May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pressel May 13, 2026

Choose a reason for hiding this comment

Uh oh!

pressel May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wang1202 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pressel May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wang1202 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pressel May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pressel May 13, 2026 •

edited

Loading