Skip to content

[cDAC] Fix EEClass validation corner case#124780

Merged
max-charlamb merged 1 commit intodotnet:mainfrom
max-charlamb:cdac-fix-eeclass-validation
Feb 27, 2026
Merged

[cDAC] Fix EEClass validation corner case#124780
max-charlamb merged 1 commit intodotnet:mainfrom
max-charlamb:cdac-fix-eeclass-validation

Conversation

@max-charlamb
Copy link
Member

@max-charlamb max-charlamb commented Feb 24, 2026

Looked into the persistent CI failure and think I found the issue. It looks like SOS is calling GetMethodTableData on a random address that happens to pass validation because it has a pointer going back to the MethodTable. However, when we try to read the full EEClass it isn't available and we throw a different error.

This change should make sure the EEClass is validated and readable. Added unit test to verify.

CI Failure
        STDIN: 00:00.374: !runcommand !clrstack
        00:00.683: OS Thread Id: 0xb08 (0)
        00:00.692:         Child SP               IP Call Site
        00:00.692: 0000002EEDD7E9E0 00007ff99863d280 [InlinedCallFrame: 0000002eedd7e9e0] VarargPInvokeInteropMD.Interop.printf(System.String, ...)
        00:00.697: 0000002EEDD7E9E0 00007ff8e620021a [InlinedCallFrame: 0000002eedd7e9e0] VarargPInvokeInteropMD.Interop.printf(System.String, ...)
        00:00.697: 0000002EEDD7E9B0 00007FF8E620021A ILStubClass.IL_STUB_PInvoke(System.String, Int32, Double, ...)
        00:00.745: 0000002EEDD7EAD0 00007FF8E61218B0 VarargPInvokeInteropMD.Program.Main() [/_/src/tests/SOS.UnitTests/Debuggees/VarargPInvokeInteropMD/Program.cs @ 16]
        00:00.751: <END_COMMAND_OUTPUT>
        00:00.751: 0:000> 
        STDIN: 00:00.752: !runcommand !IP2MD 00007FF8E620021A
        00:00.754: MethodDesc:   00007ff8e61e7b38
        00:00.754: Method Name:          ILStubClass.IL_STUB_PInvoke(System.String, Int32, Double, ...)
        00:00.754: Class:                00007ff8e61e7ac8
        00:00.754: MethodTable:          00007ff8e61e7ac8
        00:00.754: mdToken:              0000000006000000
        00:00.754: Module:               00007ff8e61e1b00
        00:00.754: IsJitted:             yes
        00:00.754: Current CodeAddr:     00007ff8e6200040
        00:00.754: Version History:
        00:00.755:   ILCodeVersion:      0000000000000000
        00:00.755:   ReJIT ID:           0
        00:00.755:   IL Addr:            0000000000000000
        00:00.755:      CodeAddr:           00007ff8e6200040  (MinOptJitted)
        00:00.755:      NativeCodeVersion:  0000000000000000
        00:00.757: <END_COMMAND_OUTPUT>
        00:00.757: 0:000> 
        STDIN: 00:00.757: !runcommand !clru 00007ff8e61e7b38
        00:00.758: Normal JIT generated code
        00:00.758: ILStubClass.IL_STUB_PInvoke(System.String, Int32, Double, ...)
        00:00.758: Begin 00007FF8E6200040, size 279
        00:00.759: 00007ff8`e6200040 48894c2408      mov     qword ptr [rsp+8],rcx
        00:00.761: 00007ff8`e6200045 4889542410      mov     qword ptr [rsp+10h],rdx
        00:00.762: 00007ff8`e620004a 4c89442418      mov     qword ptr [rsp+18h],r8
        00:00.763: 00007ff8`e620004f 4c894c2420      mov     qword ptr [rsp+20h],r9
        00:00.764: 00007ff8`e6200054 55              push    rbp
        00:00.766: 00007ff8`e6200055 4157            push    r15
        00:00.767: 00007ff8`e6200057 4156            push    r14
        00:00.768: 00007ff8`e6200059 4155            push    r13
        00:00.769: 00007ff8`e620005b 4154            push    r12
        00:00.770: 00007ff8`e620005d 57              push    rdi
        00:00.771: 00007ff8`e620005e 56              push    rsi
        00:00.773: 00007ff8`e620005f 53              push    rbx
        00:00.774: 00007ff8`e6200060 4881ecd8000000  sub     rsp,0D8h
        00:00.775: 00007ff8`e6200067 488d6c2420      lea     rbp,[rsp+20h]
        STDERROR: 00:00.787: Process terminated. Assertion failed.
        STDERROR: 00:00.788: cDAC: 80131c49, DAC: 80070057
        STDERROR: 00:00.788:    at System.Diagnostics.DebugProvider.Fail(String, String)
        STDERROR: 00:00.788:    at System.Diagnostics.Debug.Fail(String, String)
        STDERROR: 00:00.788:    at System.Diagnostics.Debug.Assert(Boolean, String, String)
        STDERROR: 00:00.788:    at System.Diagnostics.Debug.Assert(Boolean, String)
        STDERROR: 00:00.788:    at System.Diagnostics.Debug.Assert(Boolean, Debug.AssertInterpolatedStringHandler&)
        STDERROR: 00:00.788:    at Microsoft.Diagnostics.DataContractReader.Legacy.SOSDacImpl.Microsoft.Diagnostics.DataContractReader.Legacy.ISOSDacInterface.GetMethodTableData(ClrDataAddress, DacpMethodTableData*)
        STDERROR: 00:00.788:    at <Microsoft_Diagnostics_DataContractReader_Legacy_ISOSDacInterface>F7D08DFA63EEFD39A651C932BEE9B168F60916DB84778D32AACF3004D988BD863__InterfaceImplementation.ABI_GetMethodTableData(ComWrappers.ComInterfaceDispatch*, UInt64, DacpMethodTableData*)
    }

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a cDAC/legacy DAC HRESULT mismatch when SOS queries GetMethodTableData for a MethodTable whose EEClass pointer relationship superficially validates but whose EEClass memory is not actually readable (observed as a persistent CI failure). The fix makes EEClass readability part of MethodTable validation, and adds a regression test to ensure E_INVALIDARG is returned (matching legacy DAC behavior) instead of CORDBG_E_READVIRTUAL_FAILURE.

Changes:

  • Update MethodTable validation to eagerly construct/read Data.EEClass during validation so unreadable EEClass memory fails validation early.
  • Add a unit test that reproduces the “partially readable EEClass” scenario and asserts GetMethodTableData returns E_INVALIDARG.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/RuntimeTypeSystemHelpers/TypeValidation.cs Make EEClass validation eagerly read all EEClass fields so unreadable EEClass memory causes validation failure (and thus E_INVALIDARG).
src/native/managed/cdac/tests/MethodTableTests.cs Add regression test covering the unreadable/partial EEClass scenario for GetMethodTableData.

Copilot AI review requested due to automatic review settings February 24, 2026 03:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

@max-charlamb max-charlamb marked this pull request as draft February 24, 2026 03:56
@jkotas
Copy link
Member

jkotas commented Feb 24, 2026

SOS is calling GetMethodTableData on a random address that happens to pass validation

This is like 123th time we are trying to patch some hole in this validation to fix intermittent failures. The current scheme is going to produce false positives by design.

I am wondering whether we can do better and implement 100% reliable validation: get module, token and instantiation from type, and lookup the type using those. If we get back the type we started with, it is a valid type. If not, it is a random pointer that looks like valid type.

@noahfalk
Copy link
Member

noahfalk commented Feb 24, 2026

get module, token and instantiation from type, and lookup the type using those

This sounds like it would be reliable at detecting if the pointer was originally allocated in the debuggee as a MethodTable. It wouldn't catch memory corruption to any portion of the data structure that wasn't directly used in the lookup. To me it sounds complimentary, but it wouldn't necessarily catch the kinds of issues Max's validation would detect.

As for feasibility, triage dumps today don't contain the EETypeHashTables and there may be other gaps. I'd guess we need to add at least 50 bytes per MethodTable to capture all the data structures the validation algo would need to touch. I wouldn't expect a ton of types in a triage dump (1 per stack frame) so maybe 10s of KB on a 2MB dump? Put a big margin of error on that until someone explores in more detail.

I think we'd get a good return on doing a little more validation of the immediate MethodTable/EEClass fields and stopping there. If you think its important we go farther we can, I'm just not sure it will give us much return on the dev time and extra dump memory.

This is like 123th time we are trying to patch some hole in this validation to fix intermittent failures

Maybe I'm missing some history. My understanding is that DAC's approach to MethodTable validation has been reasonably stable over a long period of time. We check the MethodTable -> EEClass -> MethodTable loop and assume any datastructure satisfying that constraint is valid. I wasn't aware of the history of validation changes you mentioned. Any breadcrumb I should be following?

@jkotas
Copy link
Member

jkotas commented Feb 24, 2026

My understanding is that DAC's approach to MethodTable validation has been reasonably stable over a long period of time.

I have been personally fighting with it number of times. Mostly in .NET framework days where we run the SOS tests in the inner loop and the non-deterministic failures were a problem. We are not running the SOS tests in the inner loop these days. If we started running them again with high frequency, I expect we would start seeing the instability again.

It wouldn't catch memory corruption to any portion of the data structure that wasn't directly used in the lookup.

For investigation of crash dumps with corrupted data structures, this sort of validation is about as harmful as it is useful. For example, I have investigated a crash a few months ago where the EEClass pointer was corrupted: #119761 (comment) . This validation was not helping with the investigation.

triage dumps

Do we really need this sort of validation for triage dumps? Can the workflows for investigating triage dumps avoid throwing random pointers against DAC APIs and hoping it to return semi-accurate answer? Most SOS commands do not work well in triage dumps. I do not think we would lose much if we stopped doing this validation in triage dumps.

we'd get a good return on doing a little more validation of the immediate MethodTable/EEClass fields and stopping there.

I expect we will want to investigate creating EEClass/MethodDesc/FieldDesc lazily at some point to further improve startup performance by making CoreCLR w/ R2R characteristics more similar to NativeAOT. Doubling down on using EEClass/MethodDesc/FieldDesc for validation of random pointers would go against that.

I do not expect that this will be solved in this PR. I wanted to mention this since I do not think the current "design" of these validations is good. Maybe create an issue about this?

@max-charlamb
Copy link
Member Author

SOS is calling GetMethodTableData on a random address that happens to pass validation

This is like 123th time we are trying to patch some hole in this validation to fix intermittent failures. The current scheme is going to produce false positives by design.

I am wondering whether we can do better and implement 100% reliable validation: get module, token and instantiation from type, and lookup the type using those. If we get back the type we started with, it is a valid type. If not, it is a random pointer that looks like valid type.

I'm not trying to modify the DAC MethodTable validation, I'm attempting to make the cDAC follow the same scheme to prevent failures in the runtime-diagnostic pipeline.

This error occurs because the cDAC validation logic does not check that the entire method table is readable until after validation occurs. This results in a virtual read exception rather than an argument exception.

@jkotas
Copy link
Member

jkotas commented Feb 24, 2026

Right, I understand you are trying to reimplement the quirks of the legacy DAC in this PR. My point was that I do not think it is the best forward-looking approach.

@max-charlamb max-charlamb force-pushed the cdac-fix-eeclass-validation branch from 646f1e8 to 7bf5e94 Compare February 24, 2026 16:44
@max-charlamb max-charlamb marked this pull request as ready for review February 24, 2026 17:30
Copilot AI review requested due to automatic review settings February 24, 2026 17:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@noahfalk
Copy link
Member

@jkotas - thanks for all the extra info. I read your concerns as being at least as much about having more control over where validation occurs in the workflow and what the UX experience of the validation is. Thus far SOS's approach I'd say is ad-hoc and leans towards eager validation + errors rather than lazy validation + non-blocking warnings. I can see advantages for both in different circumstances but I'm certainly open to changing defaults or giving more control that could be used by sophisticated devs to get the behavior they want. I opened: #124829

In terms of triage dumps, we could certainly skip doing the validation you proposed if the various type hashtables are missing. I don't believe we have any direct info about whether a dump is or isn't a triage dump but we can make decisions based on what memory blocks we find. Depending on the scenario SOS may or may not be in control of what pointers are being analyzed as MethodTables.

@max-charlamb max-charlamb force-pushed the cdac-fix-eeclass-validation branch from 7bf5e94 to f43b229 Compare February 26, 2026 16:52
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@max-charlamb max-charlamb merged commit c69c476 into dotnet:main Feb 27, 2026
48 of 52 checks passed
@max-charlamb max-charlamb deleted the cdac-fix-eeclass-validation branch February 27, 2026 15:05
max-charlamb added a commit that referenced this pull request Mar 6, 2026
## Summary

Add IsContinuation to the cDAC RuntimeTypeSystem contract, enabling the
cDAC to identify and validate continuation MethodTables created by the
async continuation feature.

Continuations are dynamically-created MethodTables (similar to arrays)
whose parent is the base `Continuation` class stored in
`g_pContinuationClassIfSubTypeCreated`. Without this change, the cDAC's
MT→EEClass→MT validation roundtrip would reject valid continuation MTs.

Related discussion:
#124780 (comment)

## Changes

- **`datadescriptor.inc`** — Expose
`g_pContinuationClassIfSubTypeCreated` as `ContinuationMethodTable`
global pointer
- **`IRuntimeTypeSystem.cs`** — Add `IsContinuation(TypeHandle)` to the
contract interface
- **`RuntimeTypeSystem_1.cs`** — Implement `IsContinuation` by checking
`ParentMethodTable == continuationMethodTablePointer`
- **`RuntimeTypeSystemFactory.cs`** — Read the continuation MT global
(gracefully handles missing global via `TryReadGlobalPointer`)
- **`TypeValidation.cs`** — Fix MT→EEClass→MT validation to allow
continuations (like arrays/generics)
- **`Constants.cs`** — Add `ContinuationMethodTable` constant name
- **Tests** — 4 test methods (8 cases across architectures): true
positive, true negative, null global, and CanonMT validation

---------

Co-authored-by: Max Charlamb <maxcharlamb@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants