Handle recycled child PIDs#27763
Conversation
|
Interesting failure in CI: This is caused by the Exited event being called twice. I think this is what happens: I had a look at existing tests and none would fail if Exited was emitted multiple times. The test that fails here was recently added. Probably this can behave similar on Windows. |
|
waitpid() needs to be called before OS can recycle the pid, right? So process.Kill() should never kill wrong guy... |
Process.Kill kills whatever it has the pid for, so this may be the wrong guy. |
|
CI failed on unrelated test on OpenSuse: |
|
Thanks for clarification @tmds. |
|
@wfurt once this PR is merged, I'll look into making Process.Kill throw instead of hitting a recycled pid. |
| if (_waitStateHolder == null) | ||
| { | ||
| _waitStateHolder = new ProcessWaitState.Holder(_processId); | ||
| } |
There was a problem hiding this comment.
We're adding non-trivial expense here to cases that never previously had it. Is it really worthwhile?
There was a problem hiding this comment.
I found it simpler to initialize this once than to figure out the lazy initialization doesn't cause issues. I'll revert this.
| ProcessWaitState waitState = _waitStateHolder?._state; | ||
| if (waitState == null) | ||
| { | ||
| // This will throw |
There was a problem hiding this comment.
It always throws. This will be reverted too.
| { | ||
| // Check the exited event that we get from the threadpool | ||
| // matches the event we are waiting for. | ||
| if (waitHandleContext != _waitHandle) |
There was a problem hiding this comment.
In what situation will they not match?
There was a problem hiding this comment.
This is the test that is failing.
The test does a WaitForExit. This emits the exited event and schedules the CompletionCallback on the ThreadPool.
Then the user calls Close causing _raisedOnExited to be set to false. When the CompletionCallback runs, it emits the exited event a second time.
In case of the test, the _waitHandle is null (set in Close). In general, the Process instance may be re-used at this point causing _waitHandle to be a different non-null value.
There was a problem hiding this comment.
This code is shared between Windows and Unix. Will this ever not match on Windows?
There was a problem hiding this comment.
On Windows, the same thing can happen. So the fix is also needed.
|
Unrelated CI failures on OSX.1013 and Win.10 (hang in System.Reflection.Metadata.Tests?): |
|
@stephentoub I've addressed review comments, can you take another look please? |
|
@stephentoub @danmosemsft @wfurt can you review this PR please? I'd like to do a follow-up PR that ensures a Process instance doesn't start to refer to another process that has the same (recycled) pid. |
|
@dotnet-bot test Outerloop Linux x64 Debug Build please |
| { | ||
| --pws._outstandingRefCount; | ||
| if (pws._outstandingRefCount == 0) | ||
| --_outstandingRefCount; |
There was a problem hiding this comment.
Is the large documentation comment at the beginning of the file up-to-date?
There was a problem hiding this comment.
I have updated this doc as part of this PR: #28404
How long? A minute or two is fine if you mark it [OuterLoop] |
FWIW, I disagree, or we need to come up with some new leveling scheme. I view "inner loop" as the tests that should be run across all libraries all the time, e.g. in CI, whereas when iterating on one library, you really need to run all of the outerloop tests for that library every time you make a change to it. And if a bunch of tests get added that each take minutes, that significantly slows that down. Maybe we need a new loop level :) Or I'd be happy to see it be added as |
Interesting, I wonder if others work that way. I know we have used it as headless run only (not just for slow tests also eg those tests that potentially mess with your documents folder). We probably need a trait. Meantime in this case a test would be great even if you just check it passes locally (and fails before the change) then submit it with |
|
@danmosemsft @stephentoub on my laptop, the code runs in 3m20. This is an expensive pid counter. If you lower pid_max on the machines, it will reduce the time. The test code is in the first comment of this PR, perhaps that is (almost) as good as having it in commented code? Or how should I add this? |
|
My preference would be to add it exactly like any other test, referencing the PR, with [Fact] commented out. That way it's easy to uncomment it temporarily when anyone works on this code. That is probably less likely if they have to copy paste out of github. @stephentoub seem reasonable? |
It's effectively a stress test, so I'd prefer to see it added as: |
|
I have added the test. PTAL. Is there a way to make CI do the stress tests? |
|
@dotnet-bot test Linux x64 Release Build please |
|
@dotnet-bot test Linux x64 Release Build please |
|
The System.Diagnostics.Process tests crashed on CentOS: Looks related to this PR. |
There was still a race between Close and CompletionCallback: see 0de56c5. |
|
@stephentoub can you OK this last change and maybe merge? |
|
@dotnet-bot test Outerloop Windows x64 Debug Build please |
* ProcessWaitState: support recycled child pids * Ensure ProcessWaitHandle uses same ProcessWaitState as Process * Fix race between Close and CompletionCallback * Revert GetWaitState * Add test * Handle race between Close and CompletionCallback Commit migrated from dotnet/corefx@8b66ba0
When the user keeping references to old processes, a new child would re-use some state of an old process with the same PID.
This causes the following code to throw when PID is recycled (https://github.com/dotnet/corefx/issues/27249#issuecomment-367288949).
We detect when the PID is recycled and use a new ProcessWaitState for the new child.
Also, we ensure the ProcessWaitHandle is using the same ProcessWaitState as the Process it was created for.
The above test code no longer throws now when the PID is recycled. I have not added a test because it takes a long time to run.
CC @stephentoub @danmosemsft