[Android] In-Proc Crash Reporter#126916
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, in-process crash reporter for Android CoreCLR that emits a createdump-shaped JSON crash report to logcat/stderr and optionally to a *.crashreport.json file, with VM-provided managed thread/stack/exception callbacks.
Changes:
- Introduces PAL-side crash report generation (JSON writer, module/process name helpers, crash report generator).
- Wires VM callbacks for managed thread enumeration/stack walking/exception extraction into the PAL crash reporter on Android startup.
- Integrates crash report triggering into the Android crash/abort signal path and adds Android-only build plumbing.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/crashreportstackwalker.h | Declares Android-only VM registration hook for crash report stack walking. |
| src/coreclr/vm/crashreportstackwalker.cpp | Implements VM callbacks for managed stack walking, thread enumeration, and exception extraction. |
| src/coreclr/vm/ceemain.cpp | Registers VM callbacks during Android EE startup. |
| src/coreclr/vm/CMakeLists.txt | Adds the new VM crashreport stack walker source to the build. |
| src/coreclr/pal/src/thread/process.cpp | Adds Android crash report initialization, enablement state, and crash-time generation call. |
| src/coreclr/pal/src/include/pal/process.h | Exposes PROCIsCrashReportEnabled() for signal-path gating. |
| src/coreclr/pal/src/exception/signal.cpp | Avoids duplicate managed stack logging when crash reporting is enabled. |
| src/coreclr/pal/src/crashreport/moduleenumerator.h | Declares helpers to resolve process/module names via /proc. |
| src/coreclr/pal/src/crashreport/moduleenumerator.cpp | Implements /proc/self/cmdline and /proc/self/maps parsing for crash-time module/process lookup. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.h | Declares crash report generator API and VM callback hooks. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.cpp | Implements JSON crash report generation, logcat/stderr output, optional file output, and callback integration. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.h | Declares fixed-buffer JSON writer intended for signal-safe usage. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.cpp | Implements the fixed-buffer JSON writer. |
| src/coreclr/pal/src/CMakeLists.txt | Adds Android-only PAL crashreport sources to the build. |
| } | ||
| else if (exceptionType != NULL && exceptionType[0] != '\0') | ||
| { | ||
| char hresultBuffer[32]; |
There was a problem hiding this comment.
Do we want to condition this if a native code fault occurs in the middle of managed exception processing? It seems like the managed throwable would still be on the thread. Would we want to keep the managed throwable for context?
There was a problem hiding this comment.
Looks like this comment have ended up in wrong place ?
There was a problem hiding this comment.
It might no longer be in that spot, but I think the scenario is a native fault while a managed throwable is pinned on the thread, leading to the crash report attributing the crash to that throwable.
| const char* last = path; | ||
| for (const char* p = path; *p != '\0'; p++) | ||
| { | ||
| if (*p == '/') |
There was a problem hiding this comment.
Maybe not make it platform specific, at least put it in a define that can be more easily identified.
| // |bufferSize| > 0. This helper is async-signal-safe: it performs no | ||
| // allocation, locking, or TLS access. | ||
| void | ||
| FormatHexValue( |
There was a problem hiding this comment.
Maybe these methods fit better in the json writer, so instead of writing strings, add methods to WriteAsHex(uint64_t), WriteDecimal(uint64_t), WriteDecimal(int64_t) etc.
If we emit them as the real data type in JSON, then it should be written as is, but if we emit them as strings surrounded by "", then we should probably also have WriteDecimalAsString methods.
| @@ -35,6 +35,10 @@ SET_DEFAULT_DEBUG_CHANNEL(PROCESS); // some headers have code with asserts, so d | |||
| #include <generatedumpflags.h> | |||
| #include <clrconfignocache.h> | |||
|
|
|||
| #ifdef FEATURE_INPROC_CRASHREPORT | |||
| #include "debug/crashreport/inproccrashreporter.h" | |||
There was a problem hiding this comment.
@jkotas, ok to add a dependency like this between corelcr pal and the in-proc crash reporter library? Alternative is to do what we did with g_logManagedCallstackForSignalCallback and use registered callbacks and remove the hard dependecy.
There was a problem hiding this comment.
PAL is linked into multiple binaries. Are all those binaries going to need to statically link the crash reporter as well?
You may be getting away with this on Android, but I doubt that the build will work if the crashreporter is enabled on other platforms.
Alternative is to do what we did with g_logManagedCallstackForSignalCallback
I think the direct linking without pointer indirections would be more ok in this case than it is for PAL.
| @@ -37,6 +37,9 @@ endif(CORECLR_SET_RPATH) | |||
| # Include directories | |||
|
|
|||
| include_directories(include) | |||
| if(CLR_CMAKE_TARGET_ANDROID) | |||
| include_directories(${CLR_DIR}) | |||
There was a problem hiding this comment.
Same question as above, ok to setup a dependency between in-crash reporter and coreclr pal? If we are ok with that, maybe we should at least reduce the scope of the include dir to only point to the in-proc crash reporter.
There was a problem hiding this comment.
If its OK we should probably use the in-proc crash reporter feature flag.
| @@ -4,6 +4,9 @@ set(CMAKE_INCLUDE_CURRENT_DIR ON) | |||
| include_directories(BEFORE ${CMAKE_CURRENT_SOURCE_DIR}) | |||
| include_directories(${ARCH_SOURCES_DIR}) | |||
| include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../interop/inc) | |||
| if(CLR_CMAKE_TARGET_ANDROID) | |||
| include_directories(${CLR_DIR}) | |||
There was a problem hiding this comment.
maybe scope it down to the crash reporter and use feature flag to trigger adding the include directory.
| @@ -109,6 +109,13 @@ if(CLR_CMAKE_HOST_UNIX) | |||
| add_linker_flag(-Wl,-z,notext) | |||
| endif() | |||
|
|
|||
| # PAL consumes FEATURE_INPROC_CRASHREPORT, so the feature variable must be | |||
There was a problem hiding this comment.
You should set this like other features add their compile definitions, clrdefinitions.cmake.
| FrameCallbackAdapter( | ||
| CrawlFrame* pCF, | ||
| VOID* pData) | ||
| { |
There was a problem hiding this comment.
Should we have some contract in this method making sure we are not calling methods that could cause issues when called from signal handler? Like:
CONTRACTL
{
NOTHROW;
GC_NOTRIGGER;
CANNOT_TAKE_LOCK;
MODE_ANY;
}
CONTRACTL_END;
You need to run on debug/checked build to make sure the contract holds.
There was a problem hiding this comment.
The runtime contract checks are enabled on Windows only. (It is still a good idea to add contracts even in non-Windows to document the expectations.)
| if (g_pDebugInterface != nullptr && pMD != nullptr) | ||
| { | ||
| DWORD resolvedILOffset = 0; | ||
| if (g_pDebugInterface->GetILOffsetFromNative( |
There was a problem hiding this comment.
This method will probably violate a stricter contract since it can throw, we should at least handle the scenario when it throws.
| // If DbgMiniDumpName is just a filename (no directory component), write | ||
| // the crash report under TMPDIR / /tmp so it lands somewhere writable. | ||
| char dumpPathBuf[CRASHREPORT_STRING_BUFFER_SIZE]; | ||
| if (strchr(dumpName, '/') == nullptr) |
There was a problem hiding this comment.
cross platform deliminiator?
| char dumpPathBuf[CRASHREPORT_STRING_BUFFER_SIZE]; | ||
| if (strchr(dumpName, '/') == nullptr) | ||
| { | ||
| const char* tmpDir = getenv("TMPDIR"); |
There was a problem hiding this comment.
Looks like we have this logic spread out a little through the codebase, g_get_tmp_dir, ep_rt_temp_path_get etc. Maybe it should end up in shared code at some point (cross platform async signal safe).
Bundle of changes addressing reviewer feedback on the in-proc crash reporter (PR dotnet#126916). Highlights: - Group file-scope helpers in inproccrashreporter.cpp under a single CrashReportHelpers struct; drop static forward-declaration block. - Fold FormatHexValue / FormatUnsignedDecimal / FormatSignedDecimal helpers onto SignalSafeJsonWriter and add WriteHex / WriteDecimal / WriteSignedDecimal members; eliminate per-call scratch buffers across inproccrashreporter.cpp. - crashreportstackwalker.cpp: rename CrashReportRegisterStackWalker to CrashReportConfigure; have CrashReportSuspendThreads return the suspended state and pass it to CrashReportResumeThreads (drop the static s_runtimeSuspendedForCrashReport flag); wrap GetILOffset call in EX_TRY / EX_CATCH; add CONTRACTL annotations on FrameCallbackAdapter and CrashReportGetExceptionForThread. - PAL callback inversion: register the in-proc crash reporter as a PAL signal callback instead of having PAL know about the reporter type directly; gate the wiring with FEATURE_INPROC_CRASH_REPORT cmake define. - Various smaller review-feedback cleanups consistent with reviewer preferences (e.g. (void)param; for unused parameters). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three places in the runtime independently resolve the system temporary
directory: g_get_tmp_dir (eglib), ep_rt_temp_path_get (EventPipe), and
the new VM crash reporter added in this PR. lateralusX asked for these
to converge on a shared signal-safe helper so cross-platform divergence
(Unix TMPDIR/tmp vs Windows TMP/TEMP) and any future hardening live in
one place.
Add minipal_get_tempdir(buffer, size) under src/native/minipal:
- Unix: probes \, falls back to "/tmp/".
- Windows: probes %TMP%, then %TEMP%, falls back to "C:\\Temp\\".
- Always writes a NUL-terminated path that ends with the platform
separator, so callers can append a filename directly.
- Uses only getenv / strlen / memcpy, matching the existing in-proc
crash reporter contract (no allocation, no syscalls beyond getenv).
Migrate the VM crash reporter caller in CrashReportConfigure to use
the new helper. The other two pre-existing callers (g_get_tmp_dir,
ep_rt_temp_path_get) are intentionally left for follow-up so this
change stays scoped and low-risk for the in-proc crash reporter PR.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Temp directory location is security sensitive. You do not want to open-code it like this, in particular on Windows. Implementations like this led to security incidents in the past.
If you would like to add a general-purpose temp directory method to minipal, it has to do exactly what the reviewed managed implementation does. I am not sure whether it is worth the pain.
There was a problem hiding this comment.
The API to get tmpdir is a Java one and is not available to you within the crash handler. Should we just use the API and cache its value at startup if the right env variables are set?
[Note] The first commit adds the skeleton with stubbed fields, the following commits resolves fields, followed by another commit that emits the json file.
Android CoreCLR does not currently have the same out-of-proc
createdumpexperience available on other platforms. That makes native crashes, aborts, and mixed managed/native failures significantly harder to diagnose in real applications.This PR adds an opt-in in-proc crash reporter for Android CoreCLR that emits a JSON-formatted crash report modeled after
createdump'sCrashReportWriter.When enabled, the runtime writes the crash report directly from the crash path to:
1. logcat / console output2. an optional
*.crashreport.jsonfile when DbgMiniDumpName is set.The long-term intent is for this reporting path to be async-signal-safe. This PR starts that work by making the lower-risk / low-hanging pieces fit that model where practical, while leaving the more complex runtime-state publication and hardening work to follow-up PRs.
Enabling the crash reporter
The crash reporter is opt-in.
Enable JSON crash reporting
Example: write the report to a file as well
When configured this way, the runtime will also attempt to write:
Current design note
This PR is intentionally the first step, not the complete hardening story.
The payload shape and reporting path are designed with async-signal-safety as the target, but today the implementation is a mix of:
Follow-up PRs will continue hardening the remaining pieces so the implementation better matches the intended strict crash-path constraints.
Example crash report shapes
https://gist.github.com/mdh1418/a893c7dd953bf725569b22abc10e59c1
Follow-up work
Cross-platform support (iOS/osx)
Emitting the crash report to logcat/console