[cDAC] Fix EEClass validation corner case#124780
Conversation
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
This PR addresses a cDAC/legacy DAC HRESULT mismatch when SOS queries GetMethodTableData for a MethodTable whose EEClass pointer relationship superficially validates but whose EEClass memory is not actually readable (observed as a persistent CI failure). The fix makes EEClass readability part of MethodTable validation, and adds a regression test to ensure E_INVALIDARG is returned (matching legacy DAC behavior) instead of CORDBG_E_READVIRTUAL_FAILURE.
Changes:
- Update MethodTable validation to eagerly construct/read
Data.EEClassduring validation so unreadable EEClass memory fails validation early. - Add a unit test that reproduces the “partially readable EEClass” scenario and asserts
GetMethodTableDatareturnsE_INVALIDARG.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/RuntimeTypeSystemHelpers/TypeValidation.cs | Make EEClass validation eagerly read all EEClass fields so unreadable EEClass memory causes validation failure (and thus E_INVALIDARG). |
| src/native/managed/cdac/tests/MethodTableTests.cs | Add regression test covering the unreadable/partial EEClass scenario for GetMethodTableData. |
...icrosoft.Diagnostics.DataContractReader.Contracts/RuntimeTypeSystemHelpers/TypeValidation.cs
Show resolved
Hide resolved
This is like 123th time we are trying to patch some hole in this validation to fix intermittent failures. The current scheme is going to produce false positives by design. I am wondering whether we can do better and implement 100% reliable validation: get module, token and instantiation from type, and lookup the type using those. If we get back the type we started with, it is a valid type. If not, it is a random pointer that looks like valid type. |
This sounds like it would be reliable at detecting if the pointer was originally allocated in the debuggee as a MethodTable. It wouldn't catch memory corruption to any portion of the data structure that wasn't directly used in the lookup. To me it sounds complimentary, but it wouldn't necessarily catch the kinds of issues Max's validation would detect. As for feasibility, triage dumps today don't contain the EETypeHashTables and there may be other gaps. I'd guess we need to add at least 50 bytes per MethodTable to capture all the data structures the validation algo would need to touch. I wouldn't expect a ton of types in a triage dump (1 per stack frame) so maybe 10s of KB on a 2MB dump? Put a big margin of error on that until someone explores in more detail. I think we'd get a good return on doing a little more validation of the immediate MethodTable/EEClass fields and stopping there. If you think its important we go farther we can, I'm just not sure it will give us much return on the dev time and extra dump memory.
Maybe I'm missing some history. My understanding is that DAC's approach to MethodTable validation has been reasonably stable over a long period of time. We check the MethodTable -> EEClass -> MethodTable loop and assume any datastructure satisfying that constraint is valid. I wasn't aware of the history of validation changes you mentioned. Any breadcrumb I should be following? |
...icrosoft.Diagnostics.DataContractReader.Contracts/RuntimeTypeSystemHelpers/TypeValidation.cs
Outdated
Show resolved
Hide resolved
I have been personally fighting with it number of times. Mostly in .NET framework days where we run the SOS tests in the inner loop and the non-deterministic failures were a problem. We are not running the SOS tests in the inner loop these days. If we started running them again with high frequency, I expect we would start seeing the instability again.
For investigation of crash dumps with corrupted data structures, this sort of validation is about as harmful as it is useful. For example, I have investigated a crash a few months ago where the EEClass pointer was corrupted: #119761 (comment) . This validation was not helping with the investigation.
Do we really need this sort of validation for triage dumps? Can the workflows for investigating triage dumps avoid throwing random pointers against DAC APIs and hoping it to return semi-accurate answer? Most SOS commands do not work well in triage dumps. I do not think we would lose much if we stopped doing this validation in triage dumps.
I expect we will want to investigate creating EEClass/MethodDesc/FieldDesc lazily at some point to further improve startup performance by making CoreCLR w/ R2R characteristics more similar to NativeAOT. Doubling down on using EEClass/MethodDesc/FieldDesc for validation of random pointers would go against that. I do not expect that this will be solved in this PR. I wanted to mention this since I do not think the current "design" of these validations is good. Maybe create an issue about this? |
I'm not trying to modify the DAC MethodTable validation, I'm attempting to make the cDAC follow the same scheme to prevent failures in the runtime-diagnostic pipeline. This error occurs because the cDAC validation logic does not check that the entire method table is readable until after validation occurs. This results in a virtual read exception rather than an argument exception. |
|
Right, I understand you are trying to reimplement the quirks of the legacy DAC in this PR. My point was that I do not think it is the best forward-looking approach. |
646f1e8 to
7bf5e94
Compare
|
@jkotas - thanks for all the extra info. I read your concerns as being at least as much about having more control over where validation occurs in the workflow and what the UX experience of the validation is. Thus far SOS's approach I'd say is ad-hoc and leans towards eager validation + errors rather than lazy validation + non-blocking warnings. I can see advantages for both in different circumstances but I'm certainly open to changing defaults or giving more control that could be used by sophisticated devs to get the behavior they want. I opened: #124829 In terms of triage dumps, we could certainly skip doing the validation you proposed if the various type hashtables are missing. I don't believe we have any direct info about whether a dump is or isn't a triage dump but we can make decisions based on what memory blocks we find. Depending on the scenario SOS may or may not be in control of what pointers are being analyzed as MethodTables. |
7bf5e94 to
f43b229
Compare
...icrosoft.Diagnostics.DataContractReader.Contracts/RuntimeTypeSystemHelpers/TypeValidation.cs
Show resolved
Hide resolved
## Summary Add IsContinuation to the cDAC RuntimeTypeSystem contract, enabling the cDAC to identify and validate continuation MethodTables created by the async continuation feature. Continuations are dynamically-created MethodTables (similar to arrays) whose parent is the base `Continuation` class stored in `g_pContinuationClassIfSubTypeCreated`. Without this change, the cDAC's MT→EEClass→MT validation roundtrip would reject valid continuation MTs. Related discussion: #124780 (comment) ## Changes - **`datadescriptor.inc`** — Expose `g_pContinuationClassIfSubTypeCreated` as `ContinuationMethodTable` global pointer - **`IRuntimeTypeSystem.cs`** — Add `IsContinuation(TypeHandle)` to the contract interface - **`RuntimeTypeSystem_1.cs`** — Implement `IsContinuation` by checking `ParentMethodTable == continuationMethodTablePointer` - **`RuntimeTypeSystemFactory.cs`** — Read the continuation MT global (gracefully handles missing global via `TryReadGlobalPointer`) - **`TypeValidation.cs`** — Fix MT→EEClass→MT validation to allow continuations (like arrays/generics) - **`Constants.cs`** — Add `ContinuationMethodTable` constant name - **Tests** — 4 test methods (8 cases across architectures): true positive, true negative, null global, and CanonMT validation --------- Co-authored-by: Max Charlamb <maxcharlamb@microsoft.com>
Looked into the persistent CI failure and think I found the issue. It looks like SOS is calling GetMethodTableData on a random address that happens to pass validation because it has a pointer going back to the MethodTable. However, when we try to read the full EEClass it isn't available and we throw a different error.
This change should make sure the EEClass is validated and readable. Added unit test to verify.
CI Failure