fix for largepages with agressive decommit logic#126929
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @dotnet/gc |
There was a problem hiding this comment.
Pull request overview
Fixes a GC heap-corruption scenario when GCLargePages is enabled and an induced Aggressive GC triggers “decommit” bookkeeping that doesn’t actually decommit at the OS level for large pages. The change ensures the memory that is treated as decommitted is explicitly cleared so stale references can’t be observed later.
Changes:
- In the induced-aggressive path of
gc_heap::distribute_free_regions, clear the region tail that would normally be decommitted/zeroed by the OS. - Gate the clearing to
use_large_pages_p, since only large pages makevirtual_decommita no-op while still updating GC bookkeeping.
|
@mangod9 I believe this change should get in as is. But I wonder if it would be better to integrate the clearing of used part of the large page into the |
yeah moved it centrally to virtual_decommit now. I have looked through other large_pages code flow and this looks to be the only case. |
|
/ba-g downloading artifacts is constantly stuck on macOS |
| // observes leftover object references after the region is reused. | ||
| if (use_large_pages_p && (end_of_data != nullptr) && (end_of_data > address)) | ||
| { | ||
| memclr ((uint8_t*)address, (uint8_t*)end_of_data - (uint8_t*)address); |
There was a problem hiding this comment.
In other paths, the GC just takes keeps track of the fact that memory is dirty and clears it right before it is used for allocations again in gc_heap::adjust_limit_clr. Would it be a better option here?
There was a problem hiding this comment.
the fix was following the same pattern like this in decommit_region:
runtime/src/coreclr/gc/memory.cpp
Lines 358 to 373 in 75d3e60
where memclr clears the full region for large_pages. Similar cleanup was missing during aggressive decommitting of tail regions.
With large pages, VirtualDecommit is a no-op since large pages cannot be partially decommitted. PR dotnet#126929 fixed the resulting stale data corruption by adding memclr in virtual_decommit, but this approach has downsides: the memory is never returned to the OS, yet we pay for the clearing and produce misleading committed/used bookkeeping. Instead, skip the decommit entirely for large pages: 1. distribute_free_regions: skip the aggressive tail-region decommit (the committed-but-unallocated tail of in-use regions). This was the path that caused the heap corruption in dotnet#126903. 2. decommit_heap_segment: skip the whole-segment decommit used for segment hoarding and BGC segment deletion. Same class of issue: committed/used are lowered but physical memory retains stale data. 3. decommit_region: bypass virtual_decommit and call reduce_committed_bytes directly, since decommit_region already handles large pages correctly by clearing memory itself. 4. virtual_decommit: add an assert that it is never called for heap memory when large pages are on. This catches any future caller that forgets to handle the large pages case. The end_of_data parameter and no-op ternary added by dotnet#126929 are removed. Add GCLargePages=2 mode that simulates large pages using small pages: sets use_large_pages_p=true but reserves with normal pages and commits everything upfront. This exercises all large page GC code paths without requiring OS large page setup or privileges, enabling CI testing. Fix dotnet#126903
With large pages, VirtualDecommit is a no-op since large pages cannot be partially decommitted. PR dotnet#126929 fixed the resulting stale data corruption by adding memclr in virtual_decommit, but this approach has downsides: the memory is never returned to the OS, yet we pay for the clearing and produce misleading committed/used bookkeeping. Instead, skip the decommit entirely for large pages: 1. distribute_free_regions: skip the aggressive tail-region decommit (the committed-but-unallocated tail of in-use regions). This was the path that caused the heap corruption in dotnet#126903. 2. decommit_heap_segment: skip the whole-segment decommit used for segment hoarding and BGC segment deletion. Same class of issue: committed/used are lowered but physical memory retains stale data. 3. decommit_region: bypass virtual_decommit and call reduce_committed_bytes directly, since decommit_region already handles large pages correctly by clearing memory itself. 4. virtual_decommit: add an assert that it is never called for heap memory when large pages are on. This catches any future caller that forgets to handle the large pages case. The end_of_data parameter and no-op ternary added by dotnet#126929 are removed. Add GCLargePages=2 mode that simulates large pages using small pages: sets use_large_pages_p=true but reserves with normal pages and commits everything upfront. This exercises all large page GC code paths without requiring OS large page setup or privileges, enabling CI testing. Fix dotnet#126903
…7290) With large pages, VirtualDecommit is a no-op since large pages cannot be partially decommitted. PR #126929 fixed the resulting stale data corruption by adding memclr in virtual_decommit, but this approach has downsides: the memory is never returned to the OS, yet we pay for the clearing and produce misleading committed/used bookkeeping. Instead, skip the decommit entirely for large pages: 1. distribute_free_regions: skip the aggressive tail-region decommit (the committed-but-unallocated tail of in-use regions). This was the path that caused the heap corruption in #126903. 2. decommit_heap_segment: skip the whole-segment decommit used for segment hoarding and BGC segment deletion. Same class of issue: committed/used are lowered but physical memory retains stale data. 3. decommit_region: bypass virtual_decommit and call reduce_committed_bytes directly, since decommit_region already handles large pages correctly by clearing memory itself. 4. virtual_decommit: add an assert that it is never called for heap memory when large pages are on. This catches any future caller that forgets to handle the large pages case. The end_of_data parameter and no-op ternary added by #126929 are removed. Add GCLargePages=2 mode that simulates large pages using small pages: sets use_large_pages_p=true but reserves with normal pages and commits everything upfront. This exercises all large page GC code paths without requiring OS large page setup or privileges, enabling CI testing. Fix #126903
clear decommitted memory in the largepages scenario. Fixes #126903