Use SIGCHLD to trigger Process waitpid check#26291
Use SIGCHLD to trigger Process waitpid check#26291stephentoub merged 36 commits intodotnet:masterfrom
Conversation
| { | ||
| internal static partial class Sys | ||
| { | ||
| private static bool s_signalHandlingInitialized = false; |
There was a problem hiding this comment.
This lock won't work once this file is used in multiple .dlls (System.Console + System.Diagnostics.Process). The lock should in the native implementation, I think.
There was a problem hiding this comment.
Right!
I am wondering: is it supposed to work that a single Unix process may host .NET Core several times?
I guess those instances would share the global variables in the native implementation?
There was a problem hiding this comment.
is it supposed to work that a single Unix process may host .NET Core several times?
We do not support or test config like these. It should be possible in theory, but I doubt that it would "just work".
I guess those instances would share the global variables in the native implementation?
Right.
| do | ||
| { | ||
| int status; | ||
| while (CheckInterrupted(pid = waitpid(WAIT_ANY, &status, WNOHANG))); |
There was a problem hiding this comment.
Why is it necessary to do waitpid if it's not to clear out zombies?
There was a problem hiding this comment.
The comment is confusing. I'll reword it.
If the original disposition is SIG_IGN, then the kernel won't generate zombies. (comment)
But since we overwrote the disposition, we do get zombies, and we need to waitpid them. (code)
|
I have added the managed implementation. It should be possible to eliminate the foreach loop in |
| static void UninitializeConsole() | ||
| void UninitializeConsole() | ||
| { | ||
| // pal_signal.cpp calls this on SIGKILL/SIGTERM. |
| static struct sigaction g_origSigIntHandler, g_origSigQuitHandler; // saved signal handlers for ctrl handling | ||
| static struct sigaction g_origSigContHandler, g_origSigChldHandler; // saved signal handlers for reinitialization | ||
| static volatile CtrlCallback g_ctrlCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT | ||
| static volatile SigChldCallback g_sigChldCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT |
| // In general, we now want to remove our handler and reissue the signal to | ||
| // be picked up by the previously registered handler. In the most common case, | ||
| // this will be the default handler, causing the process to be torn down. | ||
| // It could also be a custom handle registered by other code before us. |
| } | ||
|
|
||
| // Finally, register our signal handlers | ||
| InstallSignalHandler(SIGINT , /* overwriteIgnored */ false); |
There was a problem hiding this comment.
Could you clarify why the overwrite ignored / not ignored behavior for these signals?
| } | ||
| else if (waitResult == -1) | ||
| { | ||
| Debug.Fail("Unexpected errno value from waitpid"); |
There was a problem hiding this comment.
If someone else has waitpid our child (which they shouldn't), we'd get ECHILD. The code then does what is expected: SetExited.
For non-child processes (which were previously detected as ECHILD), the detection is now in CheckForNonChildExit.
danmoseley
left a comment
There was a problem hiding this comment.
Someone else should review this as well as I'm not experienced in Linux signals.
| extern "C" int32_t SystemNative_InitializeSignalHandling() | ||
| { | ||
| static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; | ||
| static bool initialized = false; |
There was a problem hiding this comment.
Mono team is in the process of rewriting the System.Native PAL from .cpp to .c so that it can be shared with Mono (#25032 (comment)). It would be nice for more significant rewrites and additions like this file to be in .c so that they do not need to be rewritten.
| <data name="IO_AlreadyExists_Name" xml:space="preserve"> | ||
| <value>Cannot create '{0}' because a file or directory with the same name already exists.</value> | ||
| </data> | ||
| <data name="IO_BindHandleFailed" xml:space="preserve"> |
There was a problem hiding this comment.
Is GetExceptionForIoErrno ever going to produce a good diagnosable error for InitializeSignalHandling? It seems like that these errors are specifically designed for file I/O.
There was a problem hiding this comment.
Indeed. I'll change this to throw Win32Exception.
|
@danmosemsft I have some additional changes I'd like to make:
To not make this PR larger, perhaps I should make those changes in separate PRs? |
|
@tmds if this stands alone then separate PR's seems fine. |
| { | ||
| internal partial class Sys | ||
| { | ||
| internal delegate void SigChldCallback(); |
| CloseIfOpen(stdinFds[WRITE_END_OF_PIPE]); | ||
| CloseIfOpen(stdoutFds[READ_END_OF_PIPE]); | ||
| CloseIfOpen(stderrFds[READ_END_OF_PIPE]); | ||
| // Reap child |
| } | ||
| else if (origHandler->sa_sigaction != NULL) | ||
| { | ||
| // TODO?: We are passing a NULL siginfo and context, do we need to try and do better? |
There was a problem hiding this comment.
Oh, I see. This seems problematic, no? That we're not passing the original information down, on the original thread, etc.? Why can't we do this delegation to the original as part of the actual signal handler rather than in the asynchronous handler, e.g. have the async handler delegate to the original and then queue the work to the separate thread?
There was a problem hiding this comment.
This seems problematic, no? That we're not passing the original information down, on the original thread, etc.?
Since SIGCHLD can be merged, you can not count on getting this info for each child. So it's less problematic than it seems at first.
I think we must properly handle SIG_DLF case, and should handle SIG_IGN.
If there is a custom handler in place, I think we always end up making some assumptions on its behavior. If we call it from the signal thread, we are assuming it won't reap our children. I think we can document that assumption and move this to the signal thread.
| } | ||
| else | ||
| { | ||
| SystemNative_ResumeSigChld(); |
There was a problem hiding this comment.
In what situation does this occur? Is it only when no process has ever been started with Process.Start yet in this process, or is it possible for g_sigChldCallback to go from non-null to null? I'm wondering about the waitpids this is doing and whether it'll interfere with a Process later getting the results of a process when code Waits on it and accesses its exit information.
There was a problem hiding this comment.
This occurs when signal handling was setup for the Console and no Process.Starts have occurred.
I'll add a comment.
|
Is this included in 2.1 preview 2? |
|
@myrup yes it will be. Please try it out when that is available |
|
@danmosemsft Thanks! I'm on the edge of my seat as this issue is the only thing holding back a general switch from mono to .Net Core for us :) |
|
Happy to report this leak has been fixed in 2.1 preview 2! I know it's in the enhancement department, but since I felt a significant difference testing mono and .Net Core I decided to time it. The following snip takes 15s vs 10s in .Net Core 2.1 preview 2 vs. mono 5.8.1 : 33% slower launch times for processes seems significant to me. (I'm on Darwin. You know best if this concerns all *nix ) |
|
@myrup great to hear is fixed. Definitely interested to know when we are slower than mono. Could you please open a new issue? If you happen to have Linux timings that would be interesting also. |
This is the worst-case scenario: 10000 processes are started and exiting concurrently. I'm having a look in the mono implementation, some notable differences:
|
|
From those differences I think the locking is causing the performance difference. This is how corefx locks: corefx/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs Lines 313 to 342 in 1b643bd Notice: the lock includes the The equivalent lock in mono does not include the fork call: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L2056-L2060 This means when sigchld comes, there is a very tiny chance the process is not known and doesn't get reaped. It can get reaped when the next child exits, since all executing processes are checked each time: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L725-L746 corefx implementation avoids this O(N) lookup and performs a dictionary lookup instead. |
|
@danmosemsft I haven't had an opportunity to try this on linux. @tmds I've created a new issue #29074 |
|
@tmds this may interest you: #29123 |
dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitpid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to terminated children. Fixes https://github.com/dotnet/corefx/issues/29370
dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to terminated children. Fixes https://github.com/dotnet/corefx/issues/29370
dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to terminated children. Fixes https://github.com/dotnet/corefx/issues/29370
dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370
* Fix Process.ExitCode on mac for killed processes #26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert
* Fix Process.ExitCode on mac for killed processes dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert
* Fix Process.ExitCode on mac for killed processes #26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert
Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.
Child reapping was changed to be triggered by the SIGCHLD signal (#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.
) Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.
* Separate signal handling from console implementation
* Pass SIGCHLD to SignalHandlerLoop
* Move lock to native code
* Add sigchld callback
* Reap children on SIGCHLD
* Reap child on unsuccesful ForkAndExecProcess
* ResumeSigChld: improve comment and fix build
* Handle iterator becoming invalid while reaping children
* Fix comments
* pal_signal.cpp -> pal_signal.c
* Throw Win32Exception when InitializeSignalHandling fails
* Fix alpine build: missing C void arguments
* Remove ResumeSigChld
* Call InitializeSignalHandling from InitializeConsoleInitializeConsole and RegisterForSigChldRegisterForSigChld
* Use ReaderWriterLock to allow multiple Processes to start concurrently
* throw Win32Exception when InitializeConsole fails
* PR feedback
* Implement SystemNative_WaitPid using waitid to check OS support
* Fix WaitPid waitid implementation
* Replace WaitPid with WaitIdExitedNoHang
* Optimize child reaping by asking OS what children terminated instead of iterating
* Implement WaitForExit for children using ManualResetEvent
* Remove SystemNative_W{ExitStatus,IfExited,IfSignaled,TermSig}
* Don't spin up wait loop for children
* Don't create ManualResetEvent when the child has already exited
* Add TestChildProcessCleanup test
* TestChildProcessCleanup: 'uname' is not at '/usr/bin' on Debian systems
* Fix multiple assignments of _waitStateHolder
* Add TestProcessWaitStateReferenceCount
* FailFast when waitid gives an unexpected return
* ProcessWaitHandle: let ProcessWaitState dispose the handle
* TestProcessWaitStateReferenceCount: fix test, need to WaitForExit otherwise Dispose cancels Exited event
* Add TestChildProcessCleanupAfterDispose
* TestProcessWaitStateReferenceCount: add retry+sleep
Commit migrated from dotnet/corefx@07fbff4
* Fix Process.ExitCode on mac for killed processes dotnet/corefx#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert Commit migrated from dotnet/corefx@a03f785
…refx#29843) Child reapping was changed to be triggered by the SIGCHLD signal (dotnet/corefx#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841. Commit migrated from dotnet/corefx@50d6137
Fixes https://github.com/dotnet/corefx/issues/25962
@stephentoub @danmosemsft I have started on this by implementing the changes to the native code.
I've moved the signal handling code that is shared between console and process into its own file signal.cpp.
There is a separate
SystemNative_InitializeSignalHandlingthat spins up the signal handling thread and registers signal handlers.The
Processclass will also call this.signal.cpp calls back into console.cpp via
UninitializeConsoleandReinitializeConsole.HandleSignalForReinitializeandTransferSignalToHandlerLoopare merged into a singleSignalHandlersince both need to handle SIGCHLD.The TODO in
SignalHandlerLoopdescribes the to-be-implemented behavior in managed code.