[EventPipe] Add EventPipeBuffer corruption diagnostics #118874

mdh1418 · 2025-08-19T01:51:16Z

On rare occasions, EventPipeBuffers may be corrupted during their lifetimes (the reason is currently unknown, but the main suspects are corruption by an external process or some uncaught use-after-free scenario). Once an EventPipeEventInstance is written into an EventPipeBuffer, there are no validation mechanisms guaranteeing that the data is uncorrupted by the time the event instance is read from the buffer. As such, there has been instances where the in-process event listener has been observed to hit an Access Violation due to EventPipeInternal_GetNextEvent returning a non-NULL pointer to corrupted bytes.

In order to better diagnose whether EventPipeBuffers in these scenarios are being corrupted internally or externally, this PR aims two opt-in EventPipeBuffer guarding mechanisms:

Header and Footer Guard signatures - To detect corruption and overrun
Memory Virtual Protection - To limit instances where some internals corrupt the EventPipeBuffer, helping distinguish between an internal and external corruption.

Behind an EventPipeBufferGuardLevel config switch (DOTNET_EventPipeBufferGuardLevel envvar), this PR adds two levels of increasing EventPipeBuffer protection.

DOTNET_EventPipeBufferGuardLevel	Behavior
0	No Protection, default
1	EventPipeBuffer Header and Footer guard signatures are active, and RaiseFailFastException triggers upon signature corruption detection. Memory Protection set to ReadOnly once the `EventPipeBuffer` is converted to read-only
2	In addition to level 1 behavior, EventPipeBuffer memory is set to ReadWrite during event writes, and ReadOnly at all other times after allocation

Header and Footer Guard Details

In the buffer header, we inject a magic + data relevant to the EventPipeBuffer's creation (timestamp, writing thread, and event sequence number), so in the event that the header/footer is partially corrupted, the remaining bytes can provide context for that buffer.

In the buffer footer, we inject a magic, it's inverse for a quick integrity check, a checksum computed from the header's identifiable bytes with a salt, and finally padding bytes to help detect buffer overrun + quick visual marker.

Note: To maintain EventPipeEventInstance 8-byte alignments, 32-bytes for each of the header and footer was determined to be a small enough overhead that provided a good starting point to diagnose buffer corruption.

Testing

Performed manual testing with a debugger, corrupting an EventPipeBuffer's guards before buffer_manager_move_next_event_any_thread during a call to ep_buffer_manager_get_next_event for an in-proc EventListener.

Previously, EventPipeBuffer's were not truly read-only when converted, as metadata IDs would be generated on the fly and written into the EventPipeEventInstance during event block writing. Instead, pass in the computed metadata ID separately, allowing for buffers to be protected with ClrVirtualProtect.

Copilot

Pull Request Overview

This PR adds diagnostic mechanisms to EventPipeBuffer to detect and diagnose memory corruption issues. The implementation introduces configurable buffer guard levels with header/footer signatures and memory protection to help distinguish between internal and external corruption sources.

Key changes include:

Addition of configurable EventPipeBufferGuardLevel with three protection levels (0-2)
Implementation of header and footer guard structures with magic values and checksums
Integration of memory protection using virtual memory APIs when guards are enabled

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ep-types-forward.h	Adds forward declarations for guard structures and enums for protection levels
ep-rt.h	Adds runtime interface declarations for memory protection and fatal error handling
ep-file.c	Refactors to pass metadata_id parameter directly instead of storing in event instance
ep-event-instance.h/.c	Removes metadata_id field from EventPipeEventInstance structure
ep-buffer.h/.c	Core implementation of buffer guards with header/footer structures and validation
ep-buffer-manager.h/.c	Integrates guard level configuration and passes it to buffer allocation
ep-block.h/.c	Updates to accept metadata_id as parameter instead of extracting from event instance
ep-rt-*.h	Platform-specific implementations of memory protection and fatal error functions
clrconfigvalues.h	Adds configuration value for EventPipeBufferGuardLevel

src/native/eventpipe/ep-buffer.c

src/coreclr/vm/eventing/eventpipe/ep-rt-coreclr.h

src/native/eventpipe/ep-buffer.c

noahfalk

Looks good to me. A few nits inline and then then main item is we should make sure this works for NativeAOT too. I'm hoping we do that by shifting some of the runtime specific callouts into new minipal APIs.

noahfalk · 2025-10-14T22:28:51Z

src/coreclr/nativeaot/Runtime/eventpipe/ep-rt-aot.h

+    size_t length,
+    EventPipePageProtection protection)
+{
+    return true;


@jkotas - is there some guiding principle on what we add to minipal and what should stay out?

Today EventPipe code relies on runtime-specific callouts for various OS functionality but I'm hoping we can shift that trajectory to depend more on minipal directly. This spot seems like an appealing example where we'd like to invoke an OS API but the existing pattern takes us through runtime-specific wrappers first. Adding VirtualProtect to minipal and using it feels like a good approach to me but I want to make sure I'm not abusing the intent of minipal.

Good candidates for minipal are unambiguous trivial methods used from number of different places. "get current time stamp" is a perfect example.

minipal is not meant to wrap everything with platform specific implementation.

I am not sure whether mprotect is a good candidate for minimap. Most places that call mprotect tend to come with their own unique requirements.

This spot seems like an appealing example where we'd like to invoke an OS API but the existing pattern takes us through runtime-specific wrappers first.

This was setup years ago to work around build system limitations. It does not make sense. It would make a lot more sense for eventpipe to have OS-specific PALs and avoid trying to fit into runtime-specific wrappers.

Thanks Jan! We'll create our own EventPipe PAL as it sounds like these aren't appropriate for minipal.

The EventPipe runtime shim was originally put into place to make majority of EventPipe code runtime agnostic, but still reuse runtime unique implementation of low-level artifacts like IO, lock, threading, atomics, mainly keeping the EventPipe<->runtime interaction stable for CoreCLR while porting code from C++ to C and integrated with Mono, originally it even reused each runtime container implementation, but since then, I broken out that part into native/containers and if we don't think there is much value continue using runtime specific implementations of these low level artifacts shimmed by EventPipe runtime layer, then we should probably move towards one EventPipe OS PAL source file shared by Mono/CoreCLR/NAOT.

There will still be some runtime specific things that will stay in the runtime shim layer, but it will be smaller, and it will be simpler to run EventPipe standalone, making it simpler to get our low-level native runtime tests running outside of runtime. I believe some of these things would potentially end up in minipal, we already have some artifacts that could be used by EventPipe in minipal like, mutex, hig-res timers, utf8-ucs2 conversions and I had some volatile/atomics functions in another PR that would suite minipal and EventPipe as well.

I had an ambition for a long time to get our low-level native EventPipe and container tests currently under Mono, https://github.com/dotnet/runtime/tree/main/src/mono/mono/eventpipe/test, running as a runtime test (as a shared native library or separate binary executed from test) running on all runtimes, part of that was to make EventPipe less dependent on runtime specific artifacts, moving towards one EventPipe OS PAL shared by all runtimes. If we believe that is a good direction, then doing that work could start moving us towards an EventPipe OS PAL.

noahfalk · 2025-10-14T22:52:13Z

src/coreclr/nativeaot/Runtime/eventpipe/ep-rt-aot.h

+void
+ep_rt_fatal_error_with_message (const ep_char8_t *message)
+{
+    /* Not implemented, no-op */


We should make sure NativeAOT has an implementation of this too. Ideally similar to above we could have a shared minipal_raise_fatal_error() function that no longer needs runtime-specific callouts. Doing that depends on the fatal error handling staying simple, runtime agnostic, and directly aligning with underlying OS APIs.

fatal error handling staying simple, runtime agnostic, and directly aligning with underlying OS APIs.

Fatal error handling includes runtime-specific crash dump and watson logic currently...

As fatal error handling is runtime-specific, I stuck with just implementing this method in ep-rt-aot.h by using RhFailFast(). I think it worked in a nativeAOT app when I forced the consistency check to fail, and saw RhFailFast in the dump's stacktrace.

src/native/eventpipe/ep-buffer-manager.c

src/native/eventpipe/ep-buffer.c

src/mono/mono/eventpipe/ep-rt-mono.h

jkotas · 2025-10-15T04:56:41Z

On rare occasions, EventPipeBuffers may be corrupted during their lifetimes (the reason is currently unknown, but the main suspects are corruption by an external process or some uncaught use-after-free scenario).

I would bet on a race condition that leads to use-after-free.

We should have an issue that describes the observed problem and what we have found about the nature of the corruption so far. Once we trace it down, we may consider deleting this instrumentation or at least simplifying it - depending on what we find.

With user_events support added in #115265, this PR looks to test a few end-to-end user_events scenario. ## Alternative testing approaches considered ### **Existing EventPipe runtime tests** Existing EventPipe tests under `src/tests/tracing/eventpipe` are incompatible with testing the user_events scenario due to: 1. Starting EventPipeSessions through DiagnosticClient ❌ DiagnosticClient does not have the support to send the IPC command to start a user_events based EventPipe session, because it requires the user_events_data file descriptor to be sent using SCM_RIGHTS (see https://github.com/dotnet/diagnostics/blob/main/documentation/design-docs/ipc-protocol.md#passing_file_descriptor). 2. Using an EventPipeEventSource to validate events streamed through EventPipe ❌ User_events based EventPipe sessions do not stream events. Instead, events are written to configured TraceFS tracepoints, and currently only RecordTrace from https://github.com/microsoft/one-collect/ is capable of generating `.nettrace` traces from tracepoint user_events. ### **Native EventPipe Unit Tests** There are Mono Native EventPipe tests under `src/mono/mono/eventpipe/test` that are not hooked up to CI. These unit tests are built through linking the shared EventPipe interface library against [Mono's EventPipe runtime shims](https://github.com/dotnet/runtime/tree/main/src/mono/mono/eventpipe) and using [Mono's test runner](https://github.com/dotnet/runtime/tree/main/src/mono/mono/eglib/test). To update these unit tests into the [standard runtime tests structure](https://github.com/dotnet/runtime/tree/main/src/tests), a **larger investment** is needed to either migrate EventPipe from using runtime shims to a OS Pal source shared by coreclr/nativeaot/mono (see #118874 (comment)) or build an EventPipe shared library specifically for the runtime test using a runtime-agnostic shim. As existing mono unit tests don't currently test IPC commands, coupled with no existing runtime infrastructure to read events from tracepoints, there would be even more work on top of updating mono native eventpipe unit tests to even test the user_events scenario. ## End-to-End Testing Added A low-cost approach to testing .NET Runtime's user_events functionality leverages RecordTrace from https://github.com/microsoft/one-collect/, which is already capable of starting user_events based EventPipe sessions and generating `.nettrace`s. (Note: [dotnet-trace wraps around RecordTrace](dotnet/diagnostics#5570)) Despite adding an external dependency which allows RecordTrace failures to fail the end-to-end test, user_events was initially added with the intent to depend on RecordTrace for the end-to-end scenario, and there are no other ways to functionally test a user_events based eventpipe session. ### Approach Each scenario uses the same pattern: 1. **Scenario invokes the shared test runner** User events scenarios can differ in their tracee logic, the events expected in the .nettrace, the record-trace script used to collect those events, and how long it takes for the tracee to emit them and for record-trace to resolve symbols and write the .nettrace. To handle this variance, UserEventsTestRunner lets each scenario pass in its scenario-specific record-trace script path, the path to its test assembly (used to spawn the tracee process), a validator that checks for the expected events from the tracee, and optional timeouts for both the tracee and record-trace to exit gracefully. 2. **`UserEventsTestRunner` orchestrates tracing and validation** Using this configuration, UserEventsTestRunner first checks whether user events are supported. It then starts record-trace with the scenario’s script and launches the tracee process so it can emit events. After the run completes, the runner stops both the tracee and record-trace, opens the resulting .nettrace with EventPipeEventSource, and applies the scenario’s validator to confirm that the expected events were recorded. Finally, it returns an exit code indicating whether the scenario passed or failed. ### Dependencies: - Environment with a kernel 6.4+, .NET 10, glibc 2.35+ - Microsoft.OneCollect.RecordTrace (transitively resolved through a dotnet diagnostics public feed) - Microsoft.Diagnostics.Tracing.TraceEvent 3.1.24+ (to read [NetTrace V6](https://github.com/microsoft/perfview/blob/main/src/TraceEvent/EventPipe/NetTraceFormat.md)) ## Helix Nuances UserEvents functional runtime tests differ from other runtime tests because it depends on OneCollect's Record-Trace tool to enable a userevents-based eventpipe session and to collect events. By design, Record-Trace requires elevated privileges, so these tests invoke a record-trace executable with sudo. When tests run on Helix, test artifacts are stripped of their permissions, so the test infrastructure was modified to give record-trace execute permissions (helix-extra-executables.list). Moreover, to avoid having one copy of record-trace per scenario, which in turn requires re-adding execute permissions for each, more modifications were added to copy over a single record-trace executable that would be used by all scenarios (OutOfProcess marker). Additionally, in Helix environments, TMPDIR is set to a helix specific temporary directory like /datadisks/disk1/work/<id>/t, and at this time, record-trace only scans /tmp/ for the runtime's diagnostic ports. So as a workaround, the tracee apps are spawned with TMPDIR set to /tmp. Lastly, the job steps to run tests on AzDO prevents restoring individual runtime test projects. Because record-trace is currently only resolvable through the dotnet-diagnostics-tests source, userevents_common.csproj was added to the group of projects restored at the beginning of copying native test components to restore Microsoft.OneCollect.RecordTrace.

mdh1418 · 2025-12-18T06:07:54Z

Created the issue at #122630

mdh1418 added 6 commits August 18, 2025 20:43

[ClrConfig] Add EventPipeBufferGuardLevel Config Switch

fb9b360

[EventPipeBuffer] Add Buffer Guard macros

6df06bc

[EventPipeBuffer] Inject buffer guards

a16ece5

[EventPipeBuffer] Ensure buffer guard consistent or FailFast

6a74500

[EventPipeBuffer] Add buffer ReadOnly protection

af7729e

mdh1418 added this to the 11.0.0 milestone Aug 19, 2025

mdh1418 requested review from Copilot and noahfalk August 19, 2025 01:51

mdh1418 requested review from MichalStrehovsky, lateralusX, steveisok and vitek-karas as code owners August 19, 2025 01:51

github-actions bot added the area-Tracing-coreclr label Aug 19, 2025

dotnet-policy-service bot assigned mdh1418 Aug 19, 2025

Copilot AI reviewed Aug 19, 2025

View reviewed changes

src/native/eventpipe/ep-buffer.c Outdated Show resolved Hide resolved

src/native/eventpipe/ep-buffer.c Outdated Show resolved Hide resolved

src/coreclr/vm/eventing/eventpipe/ep-rt-coreclr.h Show resolved Hide resolved

src/native/eventpipe/ep-buffer.c Outdated Show resolved Hide resolved

mdh1418 added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 19, 2025

build-analysis bot mentioned this pull request Aug 19, 2025

System.Data.OleDb.Tests timeout in net48 x86 Release leg #87783

Closed

mdh1418 removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 7, 2025

noahfalk reviewed Oct 14, 2025

View reviewed changes

jkotas reviewed Oct 15, 2025

View reviewed changes

src/mono/mono/eventpipe/ep-rt-mono.h Outdated Show resolved Hide resolved

mdh1418 mentioned this pull request Nov 3, 2025

[UserEvents] Add end-to-end runtime test #121316

Closed

mdh1418 mentioned this pull request Dec 3, 2025

[Tests][UserEvents] Add userevents functional runtime tests #122134

Merged

mdh1418 mentioned this pull request Dec 11, 2025

[Tests][UserEvents] Add userevents functional runtime tests #122430

Open

mdh1418 added 2 commits December 16, 2025 04:46

Soft code expected padding bytes

d2e2712

Address Feedback

c4338a2

build-analysis bot mentioned this pull request Dec 18, 2025

Unable to pull image from mcr.microsoft.com #117164

Open

build-analysis bot mentioned this pull request Dec 18, 2025

[wasm/browser] Failed to connect to socket /run/dbus/system_bus_socket #120176

Open

[EventPipe] Add EventPipeBuffer corruption diagnostics #118874

Are you sure you want to change the base?

[EventPipe] Add EventPipeBuffer corruption diagnostics #118874

Uh oh!

Conversation

mdh1418 commented Aug 19, 2025

Header and Footer Guard Details

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noahfalk left a comment

Choose a reason for hiding this comment

Uh oh!

noahfalk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

jkotas Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noahfalk Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lateralusX Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noahfalk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

jkotas Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

mdh1418 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkotas commented Oct 15, 2025

Uh oh!

mdh1418 commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jkotas Oct 15, 2025 •

edited

Loading

lateralusX Oct 29, 2025 •

edited

Loading