Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Use SIGCHLD to trigger Process waitpid check#26291

Merged
stephentoub merged 36 commits intodotnet:masterfrom
tmds:sigchld
Mar 2, 2018
Merged

Use SIGCHLD to trigger Process waitpid check#26291
stephentoub merged 36 commits intodotnet:masterfrom
tmds:sigchld

Conversation

@tmds
Copy link
Copy Markdown
Member

@tmds tmds commented Jan 12, 2018

Fixes https://github.com/dotnet/corefx/issues/25962

@stephentoub @danmosemsft I have started on this by implementing the changes to the native code.

I've moved the signal handling code that is shared between console and process into its own file signal.cpp.

There is a separate SystemNative_InitializeSignalHandling that spins up the signal handling thread and registers signal handlers.
The Process class will also call this.

signal.cpp calls back into console.cpp via UninitializeConsole and ReinitializeConsole.
HandleSignalForReinitialize and TransferSignalToHandlerLoop are merged into a single SignalHandler since both need to handle SIGCHLD.

The TODO in SignalHandlerLoop describes the to-be-implemented behavior in managed code.

{
internal static partial class Sys
{
private static bool s_signalHandlingInitialized = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock won't work once this file is used in multiple .dlls (System.Console + System.Diagnostics.Process). The lock should in the native implementation, I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

I am wondering: is it supposed to work that a single Unix process may host .NET Core several times?
I guess those instances would share the global variables in the native implementation?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it supposed to work that a single Unix process may host .NET Core several times?

We do not support or test config like these. It should be possible in theory, but I doubt that it would "just work".

I guess those instances would share the global variables in the native implementation?

Right.

do
{
int status;
while (CheckInterrupted(pid = waitpid(WAIT_ANY, &status, WNOHANG)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to do waitpid if it's not to clear out zombies?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is confusing. I'll reword it.

If the original disposition is SIG_IGN, then the kernel won't generate zombies. (comment)
But since we overwrote the disposition, we do get zombies, and we need to waitpid them. (code)

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Jan 15, 2018

I have added the managed implementation.
ProcessWaitState differentiates between child processes which are checked in TryReapChild (on sigchld) and non-child processes which are checked in CheckForNonChildExit.

It should be possible to eliminate the foreach loop in CheckChildren, if we assume no-one is reaping our child processes behind our backs.

static void UninitializeConsole()
void UninitializeConsole()
{
// pal_signal.cpp calls this on SIGKILL/SIGTERM.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like SIGQUIT/SIGINT?

static struct sigaction g_origSigIntHandler, g_origSigQuitHandler; // saved signal handlers for ctrl handling
static struct sigaction g_origSigContHandler, g_origSigChldHandler; // saved signal handlers for reinitialization
static volatile CtrlCallback g_ctrlCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT
static volatile SigChldCallback g_sigChldCallback = nullptr; // Callback invoked for SIGINT/SIGQUIT
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should be SIGCHLD?

// In general, we now want to remove our handler and reissue the signal to
// be picked up by the previously registered handler. In the most common case,
// this will be the default handler, causing the process to be torn down.
// It could also be a custom handle registered by other code before us.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo handler

}

// Finally, register our signal handlers
InstallSignalHandler(SIGINT , /* overwriteIgnored */ false);
Copy link
Copy Markdown
Member

@danmoseley danmoseley Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why the overwrite ignored / not ignored behavior for these signals?

}
else if (waitResult == -1)
{
Debug.Fail("Unexpected errno value from waitpid");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECHILD is impossible?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone else has waitpid our child (which they shouldn't), we'd get ECHILD. The code then does what is expected: SetExited.
For non-child processes (which were previously detected as ECHILD), the detection is now in CheckForNonChildExit.

Copy link
Copy Markdown
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone else should review this as well as I'm not experienced in Linux signals.

extern "C" int32_t SystemNative_InitializeSignalHandling()
{
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static bool initialized = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono team is in the process of rewriting the System.Native PAL from .cpp to .c so that it can be shared with Mono (#25032 (comment)). It would be nice for more significant rewrites and additions like this file to be in .c so that they do not need to be rewritten.

<data name="IO_AlreadyExists_Name" xml:space="preserve">
<value>Cannot create '{0}' because a file or directory with the same name already exists.</value>
</data>
<data name="IO_BindHandleFailed" xml:space="preserve">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GetExceptionForIoErrno ever going to produce a good diagnosable error for InitializeSignalHandling? It seems like that these errors are specifically designed for file I/O.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I'll change this to throw Win32Exception.

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Jan 16, 2018

@danmosemsft I have some additional changes I'd like to make:

  • for child processes, implement WaitForExit using _exitedEvent instead of polling
  • try to eliminate the foreach loop in CheckChildren

To not make this PR larger, perhaps I should make those changes in separate PRs?

@tmds tmds changed the title [WIP] Use SIGCHLD to trigger Process waitpid check Use SIGCHLD to trigger Process waitpid check Jan 16, 2018
@danmoseley
Copy link
Copy Markdown
Member

@tmds if this stands alone then separate PR's seems fine.

{
internal partial class Sys
{
internal delegate void SigChldCallback();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use Action?

CloseIfOpen(stdinFds[WRITE_END_OF_PIPE]);
CloseIfOpen(stdoutFds[READ_END_OF_PIPE]);
CloseIfOpen(stderrFds[READ_END_OF_PIPE]);
// Reap child
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: newline before this

}
else if (origHandler->sa_sigaction != NULL)
{
// TODO?: We are passing a NULL siginfo and context, do we need to try and do better?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. This seems problematic, no? That we're not passing the original information down, on the original thread, etc.? Why can't we do this delegation to the original as part of the actual signal handler rather than in the asynchronous handler, e.g. have the async handler delegate to the original and then queue the work to the separate thread?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems problematic, no? That we're not passing the original information down, on the original thread, etc.?

Since SIGCHLD can be merged, you can not count on getting this info for each child. So it's less problematic than it seems at first.

I think we must properly handle SIG_DLF case, and should handle SIG_IGN.
If there is a custom handler in place, I think we always end up making some assumptions on its behavior. If we call it from the signal thread, we are assuming it won't reap our children. I think we can document that assumption and move this to the signal thread.

}
else
{
SystemNative_ResumeSigChld();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation does this occur? Is it only when no process has ever been started with Process.Start yet in this process, or is it possible for g_sigChldCallback to go from non-null to null? I'm wondering about the waitpids this is doing and whether it'll interfere with a Process later getting the results of a process when code Waits on it and accesses its exit information.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs when signal handling was setup for the Console and no Process.Starts have occurred.
I'll add a comment.

@karelz karelz added this to the 2.1.0 milestone Mar 10, 2018
@myrup
Copy link
Copy Markdown

myrup commented Mar 30, 2018

Is this included in 2.1 preview 2?

@danmoseley
Copy link
Copy Markdown
Member

@myrup yes it will be. Please try it out when that is available

@myrup
Copy link
Copy Markdown

myrup commented Mar 30, 2018

@danmosemsft Thanks! I'm on the edge of my seat as this issue is the only thing holding back a general switch from mono to .Net Core for us :)

@myrup
Copy link
Copy Markdown

myrup commented Apr 12, 2018

Happy to report this leak has been fixed in 2.1 preview 2!

I know it's in the enhancement department, but since I felt a significant difference testing mono and .Net Core I decided to time it.

The following snip takes 15s vs 10s in .Net Core 2.1 preview 2 vs. mono 5.8.1 :

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 10000; i++)
        Process.Start("echo", i.ToString());
Console.WriteLine("Took " + stopwatch.Elapsed);

33% slower launch times for processes seems significant to me.

(I'm on Darwin. You know best if this concerns all *nix )

@danmoseley
Copy link
Copy Markdown
Member

@myrup great to hear is fixed. Definitely interested to know when we are slower than mono. Could you please open a new issue? If you happen to have Linux timings that would be interesting also.

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Apr 12, 2018

33% slower launch times for processes seems significant to me.

This is the worst-case scenario: 10000 processes are started and exiting concurrently.

I'm having a look in the mono implementation, some notable differences:

  • the list of processes is tracked in native code
  • a regular mutex is used, here we are using a readerwriterlock which allows concurrent startup but when there is a sigchld the writer (reaping pids) will take precedence
  • I think the mono code has a race condition. When the process exits quickly, it may not yet be in the process list when waitpid is called. It is unlikely since the waitpid check is delayed by thread scheduling. We keep the lock longer to ensure this can never happen.
  • mono does a waitpid for each process which can cause load when there are a lot of long-running processes

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Apr 12, 2018

From those differences I think the locking is causing the performance difference.

This is how corefx locks:

// Lock to avoid races with OnSigChild
// By using a ReaderWriterLock we allow multiple processes to start concurrently.
s_processStartLock.EnterReadLock();
try
{
// Invoke the shim fork/execve routine. It will create pipes for all requested
// redirects, fork a child process, map the pipe ends onto the appropriate stdin/stdout/stderr
// descriptors, and execve to execute the requested process. The shim implementation
// is used to fork/execve as executing managed code in a forked process is not safe (only
// the calling thread will transfer, thread IDs aren't stable across the fork, etc.)
Interop.Sys.ForkAndExecProcess(
filename, argv, envp, cwd,
startInfo.RedirectStandardInput, startInfo.RedirectStandardOutput, startInfo.RedirectStandardError,
setCredentials, userId, groupId,
out childPid,
out stdinFd, out stdoutFd, out stderrFd);
// Ensure we'll reap this process.
// note: SetProcessId will set this if we don't set it first.
_waitStateHolder = new ProcessWaitState.Holder(childPid, isNewChild: true);
// Store the child's information into this Process object.
Debug.Assert(childPid >= 0);
SetProcessId(childPid);
SetProcessHandle(new SafeProcessHandle(childPid));
}
finally
{
s_processStartLock.ExitReadLock();
}

Notice: the lock includes the ForkAndExecProcess. This ensures the new pid is definitely know when we start checking the children that have exited.

The equivalent lock in mono does not include the fork call: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L2056-L2060

This means when sigchld comes, there is a very tiny chance the process is not known and doesn't get reaped.

It can get reaped when the next child exits, since all executing processes are checked each time: https://github.com/mono/mono/blob/17a2fba78de10678cf1ad903d410b057340a2795/mono/metadata/w32process-unix.c#L725-L746

corefx implementation avoids this O(N) lookup and performs a dictionary lookup instead.

@myrup
Copy link
Copy Markdown

myrup commented Apr 12, 2018

@danmosemsft I haven't had an opportunity to try this on linux.

@tmds I've created a new issue #29074

@myrup
Copy link
Copy Markdown

myrup commented Apr 15, 2018

@tmds this may interest you: #29123

tmds added a commit to tmds/corefx that referenced this pull request Apr 30, 2018
dotnet#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitpid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370
tmds added a commit to tmds/corefx that referenced this pull request Apr 30, 2018
dotnet#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370
tmds added a commit to tmds/corefx that referenced this pull request Apr 30, 2018
dotnet#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370
tmds added a commit to tmds/corefx that referenced this pull request Apr 30, 2018
dotnet#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to find terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370
stephentoub pushed a commit that referenced this pull request May 1, 2018
* Fix Process.ExitCode on mac for killed processes

#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to find terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370

* TestExitCodeKilledChild: remove runtime check

* TestExitCodeKilledChild: remove greater than assert
joperezr pushed a commit to joperezr/corefx that referenced this pull request May 1, 2018
* Fix Process.ExitCode on mac for killed processes

dotnet#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to find terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370

* TestExitCodeKilledChild: remove runtime check

* TestExitCodeKilledChild: remove greater than assert
stephentoub pushed a commit that referenced this pull request May 2, 2018
* Fix Process.ExitCode on mac for killed processes

#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to find terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370

* TestExitCodeKilledChild: remove runtime check

* TestExitCodeKilledChild: remove greater than assert
tmds added a commit to tmds/corefx that referenced this pull request May 22, 2018
Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291).
As part of that change, code was added to handle the original handler being SIG_IGN.
In that case, there was a missing mutex unlock.

Fixes https://github.com/dotnet/corefx/issues/29841.
stephentoub pushed a commit that referenced this pull request May 22, 2018
Child reapping was changed to be triggered by the SIGCHLD signal (#26291).
As part of that change, code was added to handle the original handler being SIG_IGN.
In that case, there was a missing mutex unlock.

Fixes https://github.com/dotnet/corefx/issues/29841.
stephentoub pushed a commit to stephentoub/corefx that referenced this pull request May 22, 2018
)

Child reapping was changed to be triggered by the SIGCHLD signal (dotnet#26291).
As part of that change, code was added to handle the original handler being SIG_IGN.
In that case, there was a missing mutex unlock.

Fixes https://github.com/dotnet/corefx/issues/29841.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Separate signal handling from console implementation

* Pass SIGCHLD to SignalHandlerLoop

* Move lock to native code

* Add sigchld callback

* Reap children on SIGCHLD

* Reap child on unsuccesful ForkAndExecProcess

* ResumeSigChld: improve comment and fix build

* Handle iterator becoming invalid while reaping children

* Fix comments

* pal_signal.cpp -> pal_signal.c

* Throw Win32Exception when InitializeSignalHandling fails

* Fix alpine build: missing C void arguments

* Remove ResumeSigChld

* Call InitializeSignalHandling from InitializeConsoleInitializeConsole and RegisterForSigChldRegisterForSigChld

* Use ReaderWriterLock to allow multiple Processes to start concurrently

* throw Win32Exception when InitializeConsole fails

* PR feedback

* Implement SystemNative_WaitPid using waitid to check OS support

* Fix WaitPid waitid implementation

* Replace WaitPid with WaitIdExitedNoHang

* Optimize child reaping by asking OS what children terminated instead of iterating

* Implement WaitForExit for children using ManualResetEvent

* Remove SystemNative_W{ExitStatus,IfExited,IfSignaled,TermSig}

* Don't spin up wait loop for children

* Don't create ManualResetEvent when the child has already exited

* Add TestChildProcessCleanup test

* TestChildProcessCleanup: 'uname' is not at '/usr/bin' on Debian systems

* Fix multiple assignments of _waitStateHolder

* Add TestProcessWaitStateReferenceCount

* FailFast when waitid gives an unexpected return

* ProcessWaitHandle: let ProcessWaitState dispose the handle

* TestProcessWaitStateReferenceCount: fix test, need to WaitForExit otherwise Dispose cancels Exited event

* Add TestChildProcessCleanupAfterDispose

* TestProcessWaitStateReferenceCount: add retry+sleep


Commit migrated from dotnet/corefx@07fbff4
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Fix Process.ExitCode on mac for killed processes

dotnet/corefx#26291 changed process reaping
from using waitpid to waitid. This caused a regression on mac, since
for processes that are killed, (on mac) waitid does not return the
signal number that caused the process to terminated.

We change back to waitpid for reaping children and determining the
exit code. waitid is used to find terminated children.

Fixes https://github.com/dotnet/corefx/issues/29370

* TestExitCodeKilledChild: remove runtime check

* TestExitCodeKilledChild: remove greater than assert


Commit migrated from dotnet/corefx@a03f785
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…refx#29843)

Child reapping was changed to be triggered by the SIGCHLD signal (dotnet/corefx#26291).
As part of that change, code was added to handle the original handler being SIG_IGN.
In that case, there was a missing mutex unlock.

Fixes https://github.com/dotnet/corefx/issues/29841.

Commit migrated from dotnet/corefx@50d6137
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants