Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Handle recycled child PIDs#27763

Merged
stephentoub merged 6 commits intodotnet:masterfrom
tmds:recycle_childpid
Apr 5, 2018
Merged

Handle recycled child PIDs#27763
stephentoub merged 6 commits intodotnet:masterfrom
tmds:recycle_childpid

Conversation

@tmds
Copy link
Copy Markdown
Member

@tmds tmds commented Mar 6, 2018

When the user keeping references to old processes, a new child would re-use some state of an old process with the same PID.
This causes the following code to throw when PID is recycled (https://github.com/dotnet/corefx/issues/27249#issuecomment-367288949).

            List<Process> processes = new List<Process>();
            for (int i = 0; i < 1_000_000; i++)
            {
                if (i % 100 == 0)
                {
                    System.Console.Write(".");
                }
                try
                {
                    var process = new Process();
                    processes.Add(process);
                    process.StartInfo.FileName = "/usr/bin/sleep";
                    process.StartInfo.Arguments = "0";
                    process.EnableRaisingEvents = true;
                    process.Start();

                    process.WaitForExit();
                }
                catch
                {
                    System.Console.WriteLine($"Exception when creating process {i}");
                    throw;
                }
            }

We detect when the PID is recycled and use a new ProcessWaitState for the new child.
Also, we ensure the ProcessWaitHandle is using the same ProcessWaitState as the Process it was created for.

The above test code no longer throws now when the PID is recycled. I have not added a test because it takes a long time to run.

CC @stephentoub @danmosemsft

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 6, 2018

Interesting failure in CI:

2018-03-06 10:27:54,455: INFO: proc(54): run_and_log_output: Output: Unhandled Exception: System.ObjectDisposedException: The semaphore has been disposed.
2018-03-06 10:27:54,455: INFO: proc(54): run_and_log_output: Output:    at System.Threading.SemaphoreSlim.CheckDispose() in /root/coreclr/src/mscorlib/src/System/Threading/SemaphoreSlim.cs:line 956
2018-03-06 10:27:54,455: INFO: proc(54): run_and_log_output: Output:    at System.Threading.SemaphoreSlim.Release(Int32 releaseCount) in /root/coreclr/src/mscorlib/src/System/Threading/SemaphoreSlim.cs:line 805
2018-03-06 10:27:54,456: INFO: proc(54): run_and_log_output: Output:    at System.Diagnostics.Tests.ProcessTests.<>c__DisplayClass162_0.<TestProcessWaitStateReferenceCount>b__0(Object o, EventArgs e) in /mnt/j/workspace/dotnet_corefx/master/linux-TGroup_netcoreapp+CGroup_Release+AGroup_x64+TestOuter_false_prtest/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs:line 482
2018-03-06 10:27:54,456: INFO: proc(54): run_and_log_output: Output:    at System.Diagnostics.Process.RaiseOnExited() in /mnt/j/workspace/dotnet_corefx/master/linux-TGroup_netcoreapp+CGroup_Release+AGroup_x64+TestOuter_false_prtest/src/System.Diagnostics.Process/src/System/Diagnostics/Process.cs:line 1095
2018-03-06 10:27:54,457: INFO: proc(54): run_and_log_output: Output:    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs:line 166
2018-03-06 10:27:54,457: INFO: proc(54): run_and_log_output: Output: --- End of stack trace from previous location where exception was thrown ---
2018-03-06 10:27:54,457: INFO: proc(54): run_and_log_output: Output:    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in /root/coreclr/src/mscorlib/src/System/Runtime/ExceptionServices/ExceptionDispatchInfo.cs:line 132

This is caused by the Exited event being called twice.

I think this is what happens:
a. WaitForExit() is called, this emits the Exited event and schedules a completion on the ThreadPool
b. Dispose() clears the _raisedOnExited flag
c. The completion runs on the ThreadPool causing the Exited event to be emitted again

I had a look at existing tests and none would fail if Exited was emitted multiple times. The test that fails here was recently added. Probably this can behave similar on Windows.

@wfurt
Copy link
Copy Markdown
Member

wfurt commented Mar 6, 2018

waitpid() needs to be called before OS can recycle the pid, right? So process.Kill() should never kill wrong guy...

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 6, 2018

waitpid() needs to be called before OS can recycle the pid, right? So process.Kill() should never kill wrong guy...

Process.Kill kills whatever it has the pid for, so this may be the wrong guy.
This PR doesn't deal with that. It handles the ProcessWaitState which is looked up based on the pid. Since a new child needs a new ProcessWaitState, we evict the old one.

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 6, 2018

CI failed on unrelated test on OpenSuse:

2018-03-06 19:22:56,892: INFO: proc(54): run_and_log_output: Output: Discovering: System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests
2018-03-06 19:22:57,116: INFO: proc(54): run_and_log_output: Output: Discovered:  System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests
2018-03-06 19:22:57,234: INFO: proc(54): run_and_log_output: Output: Starting:    System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests
2018-03-06 19:22:58,697: INFO: proc(54): run_and_log_output: Output: /home/helixbot/dotnetbuild/work/2401b50a-490e-479d-9066-7360dec50307/Work/de50b4c9-d6d3-46fd-9645-11900f0ff693/Unzip/RunTests.sh: line 88: 20928 Aborted                 (core dumped) $RUNTIME_PATH/dotnet xunit.console.netcore.exe System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests.dll -xml testResults.xml -notrait Benchmark=true -notrait category=nonnetcoreapptests -notrait category=nonlinuxtests -notrait category=OuterLoop -notrait category=failing
2018-03-06 19:22:58,758: INFO: proc(54): run_and_log_output: Output: Dumpling has not been installed yet. Please run "DumplingHelper.py install_dumpling" before collect_dumps.
2018-03-06 19:22:58,764: INFO: proc(54): run_and_log_output: Output: ~/dotnetbuild/work/2401b50a-490e-479d-9066-7360dec50307/Work/de50b4c9-d6d3-46fd-9645-11900f0ff693/Unzip
2018-03-06 19:22:58,766: INFO: proc(54): run_and_log_output: Output: ----- end 19:22:58 ----- exit code 134 ----------------------------------------------------------
2018-03-06 19:22:58,769: INFO: proc(58): run_and_log_output: Exit Code: 134

@wfurt
Copy link
Copy Markdown
Member

wfurt commented Mar 6, 2018

Thanks for clarification @tmds.

@wfurt wfurt requested review from michellemcdaniel and removed request for michellemcdaniel March 6, 2018 20:19
@danmoseley danmoseley requested a review from stephentoub March 7, 2018 04:36
@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 7, 2018

@wfurt once this PR is merged, I'll look into making Process.Kill throw instead of hitting a recycled pid.

@danmoseley
Copy link
Copy Markdown
Member

@danmoseley
Copy link
Copy Markdown
Member

if (_waitStateHolder == null)
{
_waitStateHolder = new ProcessWaitState.Holder(_processId);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding non-trivial expense here to cases that never previously had it. Is it really worthwhile?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it simpler to initialize this once than to figure out the lazy initialization doesn't cause issues. I'll revert this.

ProcessWaitState waitState = _waitStateHolder?._state;
if (waitState == null)
{
// This will throw
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always throws. This will be reverted too.

{
// Check the exited event that we get from the threadpool
// matches the event we are waiting for.
if (waitHandleContext != _waitHandle)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation will they not match?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that is failing.
The test does a WaitForExit. This emits the exited event and schedules the CompletionCallback on the ThreadPool.
Then the user calls Close causing _raisedOnExited to be set to false. When the CompletionCallback runs, it emits the exited event a second time.

In case of the test, the _waitHandle is null (set in Close). In general, the Process instance may be re-used at this point causing _waitHandle to be a different non-null value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is shared between Windows and Unix. Will this ever not match on Windows?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, the same thing can happen. So the fix is also needed.

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 12, 2018

Unrelated CI failures on OSX.1013 and Win.10 (hang in System.Reflection.Metadata.Tests?):

2018-03-12 15:30:43,454: INFO: proc(54): run_and_log_output: Output: Copyright (C) 2014 Outercurve Foundation.
2018-03-12 15:30:43,454: INFO: proc(54): run_and_log_output: Output: 
2018-03-12 15:30:43,532: INFO: proc(54): run_and_log_output: Output: Discovering: System.Reflection.Metadata.Tests
2018-03-12 15:30:43,752: INFO: proc(54): run_and_log_output: Output: Discovered:  System.Reflection.Metadata.Tests
2018-03-12 15:30:43,907: INFO: proc(54): run_and_log_output: Output: Starting:    System.Reflection.Metadata.Tests

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 15, 2018

@stephentoub I've addressed review comments, can you take another look please?

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 20, 2018

@stephentoub @danmosemsft @wfurt can you review this PR please? I'd like to do a follow-up PR that ensures a Process instance doesn't start to refer to another process that has the same (recycled) pid.

@stephentoub
Copy link
Copy Markdown
Member

@dotnet-bot test Outerloop Linux x64 Debug Build please
@dotnet-bot test Outerloop OSX x64 Debug Build please
@dotnet-bot test OSX x64 Debug Build please

{
--pws._outstandingRefCount;
if (pws._outstandingRefCount == 0)
--_outstandingRefCount;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the large documentation comment at the beginning of the file up-to-date?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this doc as part of this PR: #28404

@danmoseley
Copy link
Copy Markdown
Member

I have not added a test because it takes a long time to run.

How long? A minute or two is fine if you mark it [OuterLoop]

@stephentoub
Copy link
Copy Markdown
Member

stephentoub commented Mar 23, 2018

How long? A minute or two is fine if you mark it [OuterLoop]

FWIW, I disagree, or we need to come up with some new leveling scheme. I view "inner loop" as the tests that should be run across all libraries all the time, e.g. in CI, whereas when iterating on one library, you really need to run all of the outerloop tests for that library every time you make a change to it. And if a bunch of tests get added that each take minutes, that significantly slows that down. Maybe we need a new loop level :) Or I'd be happy to see it be added as [ConditionalFact(typeof(TestEnvironment), nameof(TestEnvironment.IsStressModeEnabled))], which is effectively the these-tests-are-going-to-take-a-super-long-time-...-go-get-lunch level.

@danmoseley
Copy link
Copy Markdown
Member

whereas when iterating on one library, you really need to run all of the tests for that library every time you make a change, including outerloop.

Interesting, I wonder if others work that way. I know we have used it as headless run only (not just for slow tests also eg those tests that potentially mess with your documents folder). We probably need a trait.

Meantime in this case a test would be great even if you just check it passes locally (and fails before the change) then submit it with [Fact] commented out for now. Surely if we don't add it now, it may never be added.

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 26, 2018

@danmosemsft @stephentoub on my laptop, the code runs in 3m20. This is an expensive pid counter. If you lower pid_max on the machines, it will reduce the time. The test code is in the first comment of this PR, perhaps that is (almost) as good as having it in commented code? Or how should I add this?

@danmoseley
Copy link
Copy Markdown
Member

My preference would be to add it exactly like any other test, referencing the PR, with [Fact] commented out. That way it's easy to uncomment it temporarily when anyone works on this code. That is probably less likely if they have to copy paste out of github. @stephentoub seem reasonable?

@stephentoub
Copy link
Copy Markdown
Member

with [Fact] commented out

It's effectively a stress test, so I'd prefer to see it added as:

[ConditionalFact(typeof(TestEnvironment), nameof(TestEnvironment.IsStressModeEnabled))]

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 27, 2018

I have added the test. PTAL. Is there a way to make CI do the stress tests?

@tmds
Copy link
Copy Markdown
Member Author

tmds commented Mar 29, 2018

@dotnet-bot test Linux x64 Release Build please
@dotnet-bot test OSX x64 Debug Build please

@stephentoub
Copy link
Copy Markdown
Member

@dotnet-bot test Linux x64 Release Build please

@stephentoub
Copy link
Copy Markdown
Member

The System.Diagnostics.Process tests crashed on CentOS:

2018-03-31 02:04:50,158: INFO: proc(54): run_and_log_output: Output: System.ObjectDisposedException: The semaphore has been disposed.
2018-03-31 02:04:50,159: INFO: proc(54): run_and_log_output: Output:    at System.Threading.SemaphoreSlim.CheckDispose() in /root/coreclr/src/mscorlib/src/System/Threading/SemaphoreSlim.cs:line 956
2018-03-31 02:04:50,159: INFO: proc(54): run_and_log_output: Output:    at System.Threading.SemaphoreSlim.Release(Int32 releaseCount) in /root/coreclr/src/mscorlib/src/System/Threading/SemaphoreSlim.cs:line 805
2018-03-31 02:04:50,160: INFO: proc(54): run_and_log_output: Output:    at System.Diagnostics.Tests.ProcessTests.<>c__DisplayClass168_0.<TestProcessWaitStateReferenceCount>b__0(Object o, EventArgs e) in /mnt/j/workspace/dotnet_corefx/master/linux-TGroup_netcoreapp+CGroup_Release+AGroup_x64+TestOuter_false_prtest/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs:line 555
2018-03-31 02:04:50,160: INFO: proc(54): run_and_log_output: Output:    at System.Diagnostics.Process.RaiseOnExited() in /mnt/j/workspace/dotnet_corefx/master/linux-TGroup_netcoreapp+CGroup_Release+AGroup_x64+TestOuter_false_prtest/src/System.Diagnostics.Process/src/System/Diagnostics/Process.cs:line 1105
2018-03-31 02:04:50,161: INFO: proc(54): run_and_log_output: Output:    at System.Diagnostics.Process.CompletionCallback(Object waitHandleContext, Boolean wasSignaled) in /mnt/j/workspace/dotnet_corefx/master/linux-TGroup_netcoreapp+CGroup_Release+AGroup_x64+TestOuter_false_prtest/src/System.Diagnostics.Process/src/System/Diagnostics/Process.cs:line 797
2018-03-31 02:04:50,161: INFO: proc(54): run_and_log_output: Output:    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs:line 167
2018-03-31 02:04:50,162: INFO: proc(54): run_and_log_output: Output: --- End of stack trace from previous location where exception was thrown ---
2018-03-31 02:04:50,162: INFO: proc(54): run_and_log_output: Output:    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in /root/coreclr/src/mscorlib/src/System/Runtime/ExceptionServices/ExceptionDispatchInfo.cs:line 132

Looks related to this PR.

@danmoseley danmoseley added this to the 2.1.0 milestone Apr 2, 2018
@tmds
Copy link
Copy Markdown
Member Author

tmds commented Apr 3, 2018

The System.Diagnostics.Process tests crashed on CentOS:

There was still a race between Close and CompletionCallback: see 0de56c5.

@danmoseley
Copy link
Copy Markdown
Member

@stephentoub can you OK this last change and maybe merge?

@stephentoub
Copy link
Copy Markdown
Member

@dotnet-bot test Outerloop Windows x64 Debug Build please
@dotnet-bot test Outerloop Linux x64 Debug Build please
@dotnet-bot test Outerloop OSX x64 Debug Build please

@stephentoub stephentoub merged commit 8b66ba0 into dotnet:master Apr 5, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* ProcessWaitState: support recycled child pids

* Ensure ProcessWaitHandle uses same ProcessWaitState as Process

* Fix race between Close and CompletionCallback

* Revert GetWaitState

* Add test

* Handle race between Close and CompletionCallback


Commit migrated from dotnet/corefx@8b66ba0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants