Skip to content

RRTMGP memory reduction#3128

Open
wang1202 wants to merge 14 commits into
erf-model:developmentfrom
wang1202:RRTMGP_memory_reduction
Open

RRTMGP memory reduction#3128
wang1202 wants to merge 14 commits into
erf-model:developmentfrom
wang1202:RRTMGP_memory_reduction

Conversation

@wang1202
Copy link
Copy Markdown
Contributor

Addressing 2 issues in RRTMGP: 1) the out-of-memory (OOM) issue and 2) the race condition (a bug not related to OOM). The work was done with the assistance of Claude Code. Details are noted in Source/PhysicsInterfaces/Radiation/RRTMGP_Memory_Reduction.md. Below are the testing results:

  1. Peak GPU memory (512 x 512 x 42 grid)

Old:
  - 8 GPU: OOM
  - 16 GPU: OOM
  - 32 GPU: OOM
  - 64 GPU: works

 New:
  - 1 GPU: OOM
  - 2 GPU: works

  1. Datalog determinism (fence before populateDatalogMF)

 - old 2 GPU vs old 2 GPU (same input, two runs): differ
 - new 1 GPU vs old 1 GPU: match
 - new 2 GPU vs new 1 GPU: match
 - new 2 GPU vs new 2 GPU: match

wang1202 and others added 5 commits April 17, 2026 13:23
…right-sizing

RRTMGP radiation consumed excessive GPU memory because all columns were processed
at once with O(ncol*nlay*ngpt) internal temporaries. This commit introduces five
changes: (1) right-size the Kokkos memory pool (nvar 300->20, use ngpt not nbnd),
(2) initialize RRTMGP once instead of every radiation step, (3) process columns in
configurable chunks via erf.rad_ncol_chunk (default 5000), (4) remove dead aerosol
and cloud optical depth arrays, (5) conditionally allocate diagnostic flux arrays.
Expected peak GPU memory reduction is ~10x for typical workloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The clear-sky heating-rate kernels write to sw_clrsky_heating and
lw_clrsky_heating via async Kokkos parallel_for; populateDatalogMF()
then launches a kernel that reads those arrays. Without a fence, the
LW kernel's writes can still be in flight when the datalog kernel
launches, so the radqrclw column in the radiation datalog varies
run-to-run on multi-GPU MPI configurations (reproduced at 2 ranks
on both this branch and development). All other datalog fields are
already serialized by prior kernel-launch ordering, which is why
only radqrclw was affected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Aerosol optical property arrays are restored as dormant scaffolding so a
future aerosol coupling can populate them without re-adding members. The
Radiation constructor aborts if erf.rad_do_aerosol = true since the
pipeline is not implemented. Updates RRTMGP_Memory_Reduction.md to cover
the aerosol abort, the dormant arrays, and the prior fence fix that made
radqrclw deterministic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AMLattanzi AMLattanzi mentioned this pull request Apr 22, 2026

// Number of columns per RRTMGP chunk (controls peak GPU memory)
pp.query("rad_ncol_chunk", m_ncol_chunk);

Copy link
Copy Markdown
Collaborator

@pressel pressel May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we validate m_ncol_chunk > 0 here? It is later used as the chunk-loop increment, so erf.rad_ncol_chunk = 0 would produce an infinite loop, and negative values would also be invalid.

Since this adds a new user-facing input parameter, could we also add erf.rad_ncol_chunk to the normal inputs documentation with its default and recommended tuning guidance?

// These are static lookup tables that never change.
// Size the memory pool for ncol_chunk (not full ncol) to limit peak memory.
if (!rrtmgp::initialized) {
int ncol_for_pool = std::min(m_ncol_chunk, m_ncol);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the pool be sized for m_ncol_chunk rather than min(m_ncol_chunk, m_ncol)? If a later radiation step has more local columns after regridding/load balancing, the chunk loop could use a larger ncol_c than the pool was initialized for.


**File:** `ERF_RRTMGP_Interface.cpp`

The pool multiplier `nvar` was reduced from 300 to 20, and `nbnd` was replaced with `ngpt`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says nvar was reduced to 20, but the implementation appears to use const size_t nvar = 12 in ERF_RRTMGP_Interface.cpp. Could we either update the doc or confirm that 12 is sufficient?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. I originally set 20, and Aaron Lattanzi set 12 in his test:
11c9c65#diff-a65ccaf90702def79896f166e2417f8de34d1a950528b8ec7c13d1fe0c3ed3beR253

I'll update the md file.

populated with real data — see section 4). The flag is preserved so the abort can be
removed once a real aerosol coupling lands.

### 7. Fence Before Datalog Read to Eliminate Cross-Stream Race
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing it, but I do not see the corresponding Kokkos::fence() in the code changes. Could we verify that the fence is actually inserted before populateDatalogMF()? If it is present, maybe mention the exact function/location here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// --- Create chunk gas concentrations by subsetting from full gas_concs ---
gas_concs_t gas_concs_c;
gas_concs_c.init(gas_names_offset, ncol_c, nlay);
for (int igas = 0; igas < m_ngas; ++igas) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates and fills a full-domain vmr_full for every gas and every chunk. Could we move the full-gas get_vmr() calls outside the chunk loop, or expose a subview path, so chunking does not repeatedly rebuild full-size VMR arrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants