-
-
Notifications
You must be signed in to change notification settings - Fork 748
Avoid deadlocks in tests that use popen
#6483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The subprocess writes a bunch of output when it terminates. Using `Popen.wait()` here will deadlock, as the Python docs loudly warn you in numerous places.
Not a huge fan of this; it's a weird argument to pass in. Maybe should just inline the function.
Our `popen` helper would always capture stdout/stderr. Redirecting output via pipes carries the risk of deadlock (see admonition under https://docs.python.org/3/library/subprocess.html#subprocess.Popen.stderr), so we would run `Popen.communicate` in a background thread to always be draining the pipe. If the test wasn't actually using stdout/stderr (most don't), it's just simpler to just not redirect it and let it print out as normal. As usual, pytest will hide the output if the test passes, and print it if it fails. This change isn't strictly necessary, it's just a simplification. And it makes it a little easier to implement the terminate-communicate logic for the `capture_output=True` case, since you don't have to worry about a background thread already running `communicate`.
Unit Test Results 15 files ± 0 15 suites ±0 6h 37m 25s ⏱️ + 28m 3s For more details on these failures and errors, see this check. Results for commit 90fe1b5. ± Comparison against base commit c2b28cf. ♻️ This comment has been updated with latest results. |
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
crusaderky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic notes only
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 28m 24s ⏱️ + 19m 2s For more details on these failures, see this check. Results for commit 89669f6. ± Comparison against base commit c2b28cf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests:
All 7 runs failed: test_dashboard_non_standard_ports (distributed.cli.tests.test_dask_scheduler)
All 7 runs failed: test_scheduler_port_zero (distributed.cli.tests.test_dask_scheduler)
|
https://github.com/dask/distributed/runs/6798102761?check_suite_focus=true#step:11:1744 |
|
looks like 781af78 introduced a new use of flush_output= |
|
please merge from main |
|
thanks @crusaderky |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files + 8 15 suites +8 6h 14m 22s ⏱️ + 3h 55m 25s For more details on these failures, see this check. Results for commit 79f3bcb. ± Comparison against base commit 81e237b. |
|
If this stops the popen tests then Gabe, you have my undying gratitude.
…On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***> wrote:
Merged #6483 <#6483> into main.
—
Reply to this email directly, view it on GitHub
<#6483 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
*popen test failures
…On Thu, Jun 9, 2022 at 3:20 PM Matthew Rocklin ***@***.***> wrote:
If this stops the popen tests then Gabe, you have my undying gratitude.
On Thu, Jun 9, 2022 at 2:43 PM crusaderky ***@***.***>
wrote:
> Merged #6483 <#6483> into main.
>
> —
> Reply to this email directly, view it on GitHub
> <#6483 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTEL6MYPO4YWFYTE6CLVOJCMVANCNFSM5XPL3B4Q>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
|
Let's see after a few days. I expect some of these tests will still fail, but for typical reasons (port already in use, OSError timed out connecting to scheduler, etc.). |
|
It doesn't seem to be effective: https://github.com/dask/distributed/runs/6825736211?check_suite_focus=true |
|
Yup. We're slowly getting closer though. Now we can get more information: #6567 |
I believe tests using
popenmay be occasionally failing withsubprocess.TimeoutExpirederrors because they're deadlocking in the way thesubprocessdocs warn you to avoid.Instead of using
Popen.wait()to wait for subprocesses to shut down, we should usePopen.communicate(). If the subprocess writes a bunch of stuff to stdout/stderr as it's shutting down, the stdout pipe may get filled up, blocking further writes and preventing the subprocess from shutting down.I can't confirm this is actually what's happening. I just see these tracebacks pointing to a
wait()call, a warning in the docs aboutwaitdeadlocking, and my new test confirming that if this did happen, the current implementation would fail withTimeoutExpired. So this seems like the right thing to do regardless. But it's entirely possible this isn't the problem (and it's actually something where the scheduler/worker isn't responding to SIGINT well and isn't shutting down).c4737b6 is the important change.
In 6a8ad6e, I refactored our
popenhelper to not even capture stdout/stderr if we weren't going to use it (very few tests do). This may not be strictly necessary, but it just seems much simpler and more reliable.Previously, we were launching
Popen.communicatein a background thread to flush the pipe. This is complicated, and may not have actually worked reliably.Popen.communicate, like all interactions withPopenor file objects, is not thread-safe. Tests likedistributed.cli.tests.test_dask_scheduler.test_hostportwere timing out despite usingflush_output=True, which should in principle have made them immune to this problem. So I'm wondering ifPopen.communicatein one thread andPopen.waitin another could intermittently cause some internalPopenstate to break.I'm hoping this will fix the flakiness in:
distributed.cli.tests.test_dask_scheduler.test_hostportdistributed.cli.tests.test_dask_scheduler.test_preload_commanddistributed.cli.tests.test_dask_scheduler.test_preload_command_defaultdistributed.cli.tests.test_dask_scheduler.test_preload_moduledistributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero#6395distributed.cli.tests.test_dask_worker.test_error_during_startup[--nanny]distributed.cli.tests.test_client.test_quiet_close_process[True]pre-commit run --all-filescc @crusaderky @fjetter @graingert