Use SIGCHLD to trigger Process waitpid check by tmds · Pull Request #26291 · dotnet/corefx

tmds · 2018-01-12T14:13:51Z

Fixes https://github.com/dotnet/corefx/issues/25962

@stephentoub @danmosemsft I have started on this by implementing the changes to the native code.

I've moved the signal handling code that is shared between console and process into its own file signal.cpp.

There is a separate SystemNative_InitializeSignalHandling that spins up the signal handling thread and registers signal handlers.
The Process class will also call this.

signal.cpp calls back into console.cpp via UninitializeConsole and ReinitializeConsole.
HandleSignalForReinitialize and TransferSignalToHandlerLoop are merged into a single SignalHandler since both need to handle SIGCHLD.

The TODO in SignalHandlerLoop describes the to-be-implemented behavior in managed code.

jkotas · 2018-01-12T16:54:40Z

+{
+    internal static partial class Sys
+    {
+        private static bool s_signalHandlingInitialized = false;


This lock won't work once this file is used in multiple .dlls (System.Console + System.Diagnostics.Process). The lock should in the native implementation, I think.

Right!

I am wondering: is it supposed to work that a single Unix process may host .NET Core several times?
I guess those instances would share the global variables in the native implementation?

is it supposed to work that a single Unix process may host .NET Core several times?

We do not support or test config like these. It should be possible in theory, but I doubt that it would "just work".

I guess those instances would share the global variables in the native implementation?

Right.

danmoseley · 2018-01-12T19:08:04Z

+        do
+        {
+            int status;
+            while (CheckInterrupted(pid = waitpid(WAIT_ANY, &status, WNOHANG)));


Why is it necessary to do waitpid if it's not to clear out zombies?

The comment is confusing. I'll reword it.

If the original disposition is SIG_IGN, then the kernel won't generate zombies. (comment)
But since we overwrote the disposition, we do get zombies, and we need to waitpid them. (code)

tmds · 2018-01-15T17:08:45Z

I have added the managed implementation.
ProcessWaitState differentiates between child processes which are checked in TryReapChild (on sigchld) and non-child processes which are checked in CheckForNonChildExit.

It should be possible to eliminate the foreach loop in CheckChildren, if we assume no-one is reaping our child processes behind our backs.

danmoseley · 2018-01-15T18:41:58Z

-static void UninitializeConsole()
+void UninitializeConsole()
 {
+    // pal_signal.cpp calls this on SIGKILL/SIGTERM.


Seems like SIGQUIT/SIGINT?

danmoseley · 2018-01-15T18:44:17Z

+static struct sigaction g_origSigIntHandler, g_origSigQuitHandler; // saved signal handlers for ctrl handling
+static struct sigaction g_origSigContHandler, g_origSigChldHandler; // saved signal handlers for reinitialization
+static volatile CtrlCallback g_ctrlCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT
+static volatile SigChldCallback g_sigChldCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT


comment should be SIGCHLD?

danmoseley · 2018-01-15T18:50:38Z

+                // In general, we now want to remove our handler and reissue the signal to
+                // be picked up by the previously registered handler.  In the most common case,
+                // this will be the default handler, causing the process to be torn down.
+                // It could also be a custom handle registered by other code before us.


typo handler

danmoseley · 2018-01-15T18:55:09Z

+    }
+
+    // Finally, register our signal handlers
+    InstallSignalHandler(SIGINT , /* overwriteIgnored */ false);


Could you clarify why the overwrite ignored / not ignored behavior for these signals?

danmoseley · 2018-01-15T19:10:33Z

+                }
+                else if (waitResult == -1)
+                {
+                    Debug.Fail("Unexpected errno value from waitpid");


ECHILD is impossible?

If someone else has waitpid our child (which they shouldn't), we'd get ECHILD. The code then does what is expected: SetExited.
For non-child processes (which were previously detected as ECHILD), the detection is now in CheckForNonChildExit.

danmoseley

Someone else should review this as well as I'm not experienced in Linux signals.

jkotas · 2018-01-16T03:56:38Z

+extern "C" int32_t SystemNative_InitializeSignalHandling()
+{
+    static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+    static bool initialized = false;


Mono team is in the process of rewriting the System.Native PAL from .cpp to .c so that it can be shared with Mono (#25032 (comment)). It would be nice for more significant rewrites and additions like this file to be in .c so that they do not need to be rewritten.

jkotas · 2018-01-16T04:03:16Z

+  <data name="IO_AlreadyExists_Name" xml:space="preserve">
+    <value>Cannot create '{0}' because a file or directory with the same name already exists.</value>
+  </data>
+  <data name="IO_BindHandleFailed" xml:space="preserve">


Is GetExceptionForIoErrno ever going to produce a good diagnosable error for InitializeSignalHandling? It seems like that these errors are specifically designed for file I/O.

Indeed. I'll change this to throw Win32Exception.

tmds · 2018-01-16T07:57:31Z

@danmosemsft I have some additional changes I'd like to make:

for child processes, implement WaitForExit using _exitedEvent instead of polling
try to eliminate the foreach loop in CheckChildren

To not make this PR larger, perhaps I should make those changes in separate PRs?

danmoseley · 2018-01-16T18:32:16Z

@tmds if this stands alone then separate PR's seems fine.

stephentoub · 2018-01-16T18:33:10Z

+{
+    internal partial class Sys
+    {
+        internal delegate void SigChldCallback();


Why can't we use Action?

stephentoub · 2018-01-16T18:35:24Z

        CloseIfOpen(stdinFds[WRITE_END_OF_PIPE]);
        CloseIfOpen(stdoutFds[READ_END_OF_PIPE]);
        CloseIfOpen(stderrFds[READ_END_OF_PIPE]);
+        // Reap child


Nit: newline before this

stephentoub · 2018-01-16T18:43:24Z

+    }
+    else if (origHandler->sa_sigaction != NULL)
+    {
+        // TODO?: We are passing a NULL siginfo and context, do we need to try and do better?


Oh, I see. This seems problematic, no? That we're not passing the original information down, on the original thread, etc.? Why can't we do this delegation to the original as part of the actual signal handler rather than in the asynchronous handler, e.g. have the async handler delegate to the original and then queue the work to the separate thread?

This seems problematic, no? That we're not passing the original information down, on the original thread, etc.?

Since SIGCHLD can be merged, you can not count on getting this info for each child. So it's less problematic than it seems at first.

I think we must properly handle SIG_DLF case, and should handle SIG_IGN.
If there is a custom handler in place, I think we always end up making some assumptions on its behavior. If we call it from the signal thread, we are assuming it won't reap our children. I think we can document that assumption and move this to the signal thread.

stephentoub · 2018-01-16T18:47:19Z

+            }
+            else
+            {
+                SystemNative_ResumeSigChld();


In what situation does this occur? Is it only when no process has ever been started with Process.Start yet in this process, or is it possible for g_sigChldCallback to go from non-null to null? I'm wondering about the waitpids this is doing and whether it'll interfere with a Process later getting the results of a process when code Waits on it and accesses its exit information.

This occurs when signal handling was setup for the Console and no Process.Starts have occurred.
I'll add a comment.

myrup · 2018-03-30T13:17:57Z

Is this included in 2.1 preview 2?

danmoseley · 2018-03-30T13:41:22Z

@myrup yes it will be. Please try it out when that is available

myrup · 2018-03-30T13:50:10Z

@danmosemsft Thanks! I'm on the edge of my seat as this issue is the only thing holding back a general switch from mono to .Net Core for us :)

myrup · 2018-04-12T15:09:38Z

Happy to report this leak has been fixed in 2.1 preview 2!

I know it's in the enhancement department, but since I felt a significant difference testing mono and .Net Core I decided to time it.

The following snip takes 15s vs 10s in .Net Core 2.1 preview 2 vs. mono 5.8.1 :

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 10000; i++)
        Process.Start("echo", i.ToString());
Console.WriteLine("Took " + stopwatch.Elapsed);

33% slower launch times for processes seems significant to me.

(I'm on Darwin. You know best if this concerns all *nix )

danmoseley · 2018-04-12T15:49:35Z

@myrup great to hear is fixed. Definitely interested to know when we are slower than mono. Could you please open a new issue? If you happen to have Linux timings that would be interesting also.

tmds · 2018-04-12T15:55:16Z

33% slower launch times for processes seems significant to me.

This is the worst-case scenario: 10000 processes are started and exiting concurrently.

I'm having a look in the mono implementation, some notable differences:

the list of processes is tracked in native code
a regular mutex is used, here we are using a readerwriterlock which allows concurrent startup but when there is a sigchld the writer (reaping pids) will take precedence
I think the mono code has a race condition. When the process exits quickly, it may not yet be in the process list when waitpid is called. It is unlikely since the waitpid check is delayed by thread scheduling. We keep the lock longer to ensure this can never happen.
mono does a waitpid for each process which can cause load when there are a lot of long-running processes

tmds · 2018-04-12T17:57:05Z

From those differences I think the locking is causing the performance difference.

This is how corefx locks:

corefx/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs

Lines 313 to 342 in 1b643bd

    
           // Lock to avoid races with OnSigChild 
        
           // By using a ReaderWriterLock we allow multiple processes to start concurrently. 
        
           s_processStartLock.EnterReadLock(); 
        
           try 
        
           { 
        
               // Invoke the shim fork/execve routine.  It will create pipes for all requested 
        
               // redirects, fork a child process, map the pipe ends onto the appropriate stdin/stdout/stderr 
        
               // descriptors, and execve to execute the requested process.  The shim implementation 
        
               // is used to fork/execve as executing managed code in a forked process is not safe (only 
        
               // the calling thread will transfer, thread IDs aren't stable across the fork, etc.) 
        
               Interop.Sys.ForkAndExecProcess( 
        
                   filename, argv, envp, cwd, 
        
                   startInfo.RedirectStandardInput, startInfo.RedirectStandardOutput, startInfo.RedirectStandardError, 
        
                   setCredentials, userId, groupId,  
        
                   out childPid, 
        
                   out stdinFd, out stdoutFd, out stderrFd); 
        
               // Ensure we'll reap this process. 
        
               // note: SetProcessId will set this if we don't set it first. 
        
               _waitStateHolder = new ProcessWaitState.Holder(childPid, isNewChild: true); 
        
               // Store the child's information into this Process object. 
        
               Debug.Assert(childPid >= 0); 
        
               SetProcessId(childPid); 
        
               SetProcessHandle(new SafeProcessHandle(childPid)); 
        
           } 
        
           finally 
        
           { 
        
               s_processStartLock.ExitReadLock(); 
        
           }

Notice: the lock includes the ForkAndExecProcess. This ensures the new pid is definitely know when we start checking the children that have exited.

The equivalent lock in mono does not include the fork call: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L2056-L2060

This means when sigchld comes, there is a very tiny chance the process is not known and doesn't get reaped.

It can get reaped when the next child exits, since all executing processes are checked each time: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L725-L746

corefx implementation avoids this O(N) lookup and performs a dictionary lookup instead.

myrup · 2018-04-12T18:59:48Z

@danmosemsft I haven't had an opportunity to try this on linux.

@tmds I've created a new issue #29074

myrup · 2018-04-15T14:17:42Z

@tmds this may interest you: #29123

dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitpid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to terminated children. Fixes https://github.com/dotnet/corefx/issues/29370

dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to terminated children. Fixes https://github.com/dotnet/corefx/issues/29370

dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370

* Fix Process.ExitCode on mac for killed processes #26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert

* Fix Process.ExitCode on mac for killed processes dotnet#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert

* Fix Process.ExitCode on mac for killed processes #26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert

Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.

Child reapping was changed to be triggered by the SIGCHLD signal (#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.

) Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841.

* Separate signal handling from console implementation * Pass SIGCHLD to SignalHandlerLoop * Move lock to native code * Add sigchld callback * Reap children on SIGCHLD * Reap child on unsuccesful ForkAndExecProcess * ResumeSigChld: improve comment and fix build * Handle iterator becoming invalid while reaping children * Fix comments * pal_signal.cpp -> pal_signal.c * Throw Win32Exception when InitializeSignalHandling fails * Fix alpine build: missing C void arguments * Remove ResumeSigChld * Call InitializeSignalHandling from InitializeConsoleInitializeConsole and RegisterForSigChldRegisterForSigChld * Use ReaderWriterLock to allow multiple Processes to start concurrently * throw Win32Exception when InitializeConsole fails * PR feedback * Implement SystemNative_WaitPid using waitid to check OS support * Fix WaitPid waitid implementation * Replace WaitPid with WaitIdExitedNoHang * Optimize child reaping by asking OS what children terminated instead of iterating * Implement WaitForExit for children using ManualResetEvent * Remove SystemNative_W{ExitStatus,IfExited,IfSignaled,TermSig} * Don't spin up wait loop for children * Don't create ManualResetEvent when the child has already exited * Add TestChildProcessCleanup test * TestChildProcessCleanup: 'uname' is not at '/usr/bin' on Debian systems * Fix multiple assignments of _waitStateHolder * Add TestProcessWaitStateReferenceCount * FailFast when waitid gives an unexpected return * ProcessWaitHandle: let ProcessWaitState dispose the handle * TestProcessWaitStateReferenceCount: fix test, need to WaitForExit otherwise Dispose cancels Exited event * Add TestChildProcessCleanupAfterDispose * TestProcessWaitStateReferenceCount: add retry+sleep Commit migrated from dotnet/corefx@07fbff4

* Fix Process.ExitCode on mac for killed processes dotnet/corefx#26291 changed process reaping from using waitpid to waitid. This caused a regression on mac, since for processes that are killed, (on mac) waitid does not return the signal number that caused the process to terminated. We change back to waitpid for reaping children and determining the exit code. waitid is used to find terminated children. Fixes https://github.com/dotnet/corefx/issues/29370 * TestExitCodeKilledChild: remove runtime check * TestExitCodeKilledChild: remove greater than assert Commit migrated from dotnet/corefx@a03f785

…refx#29843) Child reapping was changed to be triggered by the SIGCHLD signal (dotnet/corefx#26291). As part of that change, code was added to handle the original handler being SIG_IGN. In that case, there was a missing mutex unlock. Fixes https://github.com/dotnet/corefx/issues/29841. Commit migrated from dotnet/corefx@50d6137

tmds added 2 commits January 12, 2018 11:42

Separate signal handling from console implementation

096b089

Pass SIGCHLD to SignalHandlerLoop

a72a37b

jkotas reviewed Jan 12, 2018

View reviewed changes

danmoseley reviewed Jan 12, 2018

View reviewed changes

tmds added 6 commits January 15, 2018 10:30

Move lock to native code

246feef

Add sigchld callback

e735e52

Reap children on SIGCHLD

69ba861

Reap child on unsuccesful ForkAndExecProcess

f529e41

ResumeSigChld: improve comment and fix build

c338572

Handle iterator becoming invalid while reaping children

45d4d01

danmoseley reviewed Jan 15, 2018

View reviewed changes

danmoseley approved these changes Jan 15, 2018

View reviewed changes

jkotas reviewed Jan 16, 2018

View reviewed changes

tmds added 4 commits January 16, 2018 06:01

Fix comments

7812480

pal_signal.cpp -> pal_signal.c

1c3ab72

Throw Win32Exception when InitializeSignalHandling fails

79c27d1

Fix alpine build: missing C void arguments

5ea7f7f

tmds changed the title ~~[WIP] Use SIGCHLD to trigger Process waitpid check~~ Use SIGCHLD to trigger Process waitpid check Jan 16, 2018

stephentoub reviewed Jan 16, 2018

View reviewed changes

karelz added this to the 2.1.0 milestone Mar 10, 2018

tmds mentioned this pull request Apr 11, 2018

Process: avoid performing operations on a different process with recycled pid #28404

Merged

tmds mentioned this pull request Apr 30, 2018

Fix Process.ExitCode on mac for killed processes #29407

Merged

joperezr mentioned this pull request May 1, 2018

[release/2.1] Fix Process.ExitCode on mac for killed processes (#29407) #29445

Merged

tmds mentioned this pull request May 22, 2018

pal_signal: add missing mutex unlock when SIGCHLD==SIG_IGN #29843

Merged

myrup mentioned this pull request Jan 31, 2020

Mass spawning of processes slower than with mono dotnet/runtime#25879

Closed

wfurt mentioned this pull request Jan 31, 2020

CoreFX native PAL build break on FreeBSD due to "error: mutex 'lock' is not held on every path" dotnet/runtime#26241

Closed

tmds mentioned this pull request Mar 4, 2026

Implement SafeProcessHandle APIs for Linux and other Unixes dotnet/runtime#124979

Closed

19 tasks

Conversation

tmds commented Jan 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmds commented Jan 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley Jan 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmds commented Jan 16, 2018

Uh oh!

danmoseley commented Jan 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myrup commented Mar 30, 2018

Uh oh!

danmoseley commented Mar 30, 2018

Uh oh!

myrup commented Mar 30, 2018

Uh oh!

myrup commented Apr 12, 2018

Uh oh!

danmoseley commented Apr 12, 2018

Uh oh!

tmds commented Apr 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myrup commented Apr 12, 2018

Uh oh!

myrup commented Apr 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

danmoseley Jan 15, 2018 •

edited

Loading

tmds commented Apr 12, 2018 •

edited

Loading

tmds commented Apr 12, 2018 •

edited

Loading