Skip to content

fix for largepages with agressive decommit logic#126929

Merged
mangod9 merged 3 commits intodotnet:mainfrom
mangod9:fix/gc-largepages
Apr 16, 2026
Merged

fix for largepages with agressive decommit logic#126929
mangod9 merged 3 commits intodotnet:mainfrom
mangod9:fix/gc-largepages

Conversation

@mangod9
Copy link
Copy Markdown
Member

@mangod9 mangod9 commented Apr 15, 2026

clear decommitted memory in the largepages scenario. Fixes #126903

@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a GC heap-corruption scenario when GCLargePages is enabled and an induced Aggressive GC triggers “decommit” bookkeeping that doesn’t actually decommit at the OS level for large pages. The change ensures the memory that is treated as decommitted is explicitly cleared so stale references can’t be observed later.

Changes:

  • In the induced-aggressive path of gc_heap::distribute_free_regions, clear the region tail that would normally be decommitted/zeroed by the OS.
  • Gate the clearing to use_large_pages_p, since only large pages make virtual_decommit a no-op while still updating GC bookkeeping.

Copy link
Copy Markdown
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@janvorli
Copy link
Copy Markdown
Member

@mangod9 I believe this change should get in as is. But I wonder if it would be better to integrate the clearing of used part of the large page into the virtual_decommit (adding an "end of used data" argument) in the future so that we prevent similar issues to occur due to some changes in the GC. I also wonder if all the other usages of virtual_decommit are fine for large pages w.r.t. the fact the memory is not cleared.

@mangod9
Copy link
Copy Markdown
Member Author

mangod9 commented Apr 15, 2026

@mangod9 I believe this change should get in as is. But I wonder if it would be better to integrate the clearing of used part of the large page into the virtual_decommit (adding an "end of used data" argument) in the future so that we prevent similar issues to occur due to some changes in the GC. I also wonder if all the other usages of virtual_decommit are fine for large pages w.r.t. the fact the memory is not cleared.

yeah moved it centrally to virtual_decommit now. I have looked through other large_pages code flow and this looks to be the only case.

@mangod9
Copy link
Copy Markdown
Member Author

mangod9 commented Apr 16, 2026

/ba-g downloading artifacts is constantly stuck on macOS

@mangod9 mangod9 merged commit 830b6fe into dotnet:main Apr 16, 2026
109 of 113 checks passed
Comment thread src/coreclr/gc/memory.cpp
// observes leftover object references after the region is reused.
if (use_large_pages_p && (end_of_data != nullptr) && (end_of_data > address))
{
memclr ((uint8_t*)address, (uint8_t*)end_of_data - (uint8_t*)address);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other paths, the GC just takes keeps track of the fact that memory is dirty and clears it right before it is used for allocations again in gc_heap::adjust_limit_clr. Would it be a better option here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix was following the same pattern like this in decommit_region:

if (require_clearing_memory_p)
{
uint8_t* clear_end = use_large_pages_p ? heap_segment_used (region) : heap_segment_committed (region);
size_t clear_size = clear_end - page_start;
memclr (page_start, clear_size);
heap_segment_used (region) = heap_segment_mem (region);
dprintf(REGIONS_LOG, ("cleared region %p(%p-%p) (%zu bytes)",
region,
page_start,
clear_end,
clear_size));
}
else
{
heap_segment_committed (region) = heap_segment_mem (region);
}

where memclr clears the full region for large_pages. Similar cleanup was missing during aggressive decommitting of tail regions.

cshung added a commit to cshung/runtime that referenced this pull request Apr 23, 2026
With large pages, VirtualDecommit is a no-op since large pages cannot be
partially decommitted. PR dotnet#126929 fixed the resulting stale data corruption
by adding memclr in virtual_decommit, but this approach has downsides:
the memory is never returned to the OS, yet we pay for the clearing and
produce misleading committed/used bookkeeping.

Instead, skip the decommit entirely for large pages:

1. distribute_free_regions: skip the aggressive tail-region decommit
   (the committed-but-unallocated tail of in-use regions). This was the
   path that caused the heap corruption in dotnet#126903.

2. decommit_heap_segment: skip the whole-segment decommit used for
   segment hoarding and BGC segment deletion. Same class of issue:
   committed/used are lowered but physical memory retains stale data.

3. decommit_region: bypass virtual_decommit and call
   reduce_committed_bytes directly, since decommit_region already
   handles large pages correctly by clearing memory itself.

4. virtual_decommit: add an assert that it is never called for heap
   memory when large pages are on. This catches any future caller that
   forgets to handle the large pages case. The end_of_data parameter
   and no-op ternary added by dotnet#126929 are removed.

Add GCLargePages=2 mode that simulates large pages using small pages:
sets use_large_pages_p=true but reserves with normal pages and commits
everything upfront. This exercises all large page GC code paths without
requiring OS large page setup or privileges, enabling CI testing.

Fix dotnet#126903
cshung added a commit to cshung/runtime that referenced this pull request Apr 24, 2026
With large pages, VirtualDecommit is a no-op since large pages cannot be
partially decommitted. PR dotnet#126929 fixed the resulting stale data corruption
by adding memclr in virtual_decommit, but this approach has downsides:
the memory is never returned to the OS, yet we pay for the clearing and
produce misleading committed/used bookkeeping.

Instead, skip the decommit entirely for large pages:

1. distribute_free_regions: skip the aggressive tail-region decommit
   (the committed-but-unallocated tail of in-use regions). This was the
   path that caused the heap corruption in dotnet#126903.

2. decommit_heap_segment: skip the whole-segment decommit used for
   segment hoarding and BGC segment deletion. Same class of issue:
   committed/used are lowered but physical memory retains stale data.

3. decommit_region: bypass virtual_decommit and call
   reduce_committed_bytes directly, since decommit_region already
   handles large pages correctly by clearing memory itself.

4. virtual_decommit: add an assert that it is never called for heap
   memory when large pages are on. This catches any future caller that
   forgets to handle the large pages case. The end_of_data parameter
   and no-op ternary added by dotnet#126929 are removed.

Add GCLargePages=2 mode that simulates large pages using small pages:
sets use_large_pages_p=true but reserves with normal pages and commits
everything upfront. This exercises all large page GC code paths without
requiring OS large page setup or privileges, enabling CI testing.

Fix dotnet#126903
janvorli pushed a commit that referenced this pull request Apr 28, 2026
…7290)

With large pages, VirtualDecommit is a no-op since large pages cannot be
partially decommitted. PR #126929 fixed the resulting stale data
corruption by adding memclr in virtual_decommit, but this approach has
downsides: the memory is never returned to the OS, yet we pay for the
clearing and produce misleading committed/used bookkeeping.

Instead, skip the decommit entirely for large pages:

1. distribute_free_regions: skip the aggressive tail-region decommit
(the committed-but-unallocated tail of in-use regions). This was the
path that caused the heap corruption in #126903.

2. decommit_heap_segment: skip the whole-segment decommit used for
segment hoarding and BGC segment deletion. Same class of issue:
committed/used are lowered but physical memory retains stale data.

3. decommit_region: bypass virtual_decommit and call
reduce_committed_bytes directly, since decommit_region already handles
large pages correctly by clearing memory itself.

4. virtual_decommit: add an assert that it is never called for heap
memory when large pages are on. This catches any future caller that
forgets to handle the large pages case. The end_of_data parameter and
no-op ternary added by #126929 are removed.

Add GCLargePages=2 mode that simulates large pages using small pages:
sets use_large_pages_p=true but reserves with normal pages and commits
everything upfront. This exercises all large page GC code paths without
requiring OS large page setup or privileges, enabling CI testing.

Fix #126903
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GC heap corruption with GCLargePages

4 participants