Add task blocked watchdog.#596
Add task blocked watchdog.#596Fuyukai wants to merge 3 commits intopython-trio:masterfrom Fuyukai:watchdog
Conversation
Codecov Report
@@ Coverage Diff @@
## master #596 +/- ##
==========================================
- Coverage 99.31% 99.18% -0.14%
==========================================
Files 91 93 +2
Lines 10873 11258 +385
Branches 758 841 +83
==========================================
+ Hits 10799 11166 +367
- Misses 56 66 +10
- Partials 18 26 +8
Continue to review full report at Codecov.
|
|
Interesting approach! I was imagining something using a wake-up fd, but I think you're right and This version does have a problem though: if the main loop goes to sleep for 5 seconds because there's nothing to do, then that's fine... but the watchdog still fires. We only want the watchdog to fire if we spend a long time in the task-running part of the loop; it should ignore the IO-handling part. Of course, doing that between two threads without race conditions is not trivial... Here's my idea. Keep two counters: one that we increment whenever we start running user tasks, and one that we increment whenever we stop. So when we're in the task running part of the code, the counter values are unequal, and when we're in the io-handling part they're equal. Only the main thread mutates these counters; the watchdog thread observes them. The watchdog should fire if it ever makes two observations that are at least X seconds apart, where the values haven't changed, and are unequal to each other. That combination means that we're stuck in a user task. Since python's memory model is pretty strongly sequential and consistent, and assignment is atomic, I think we can skip any locks on these counters, which is nice. With this setup, a correct-but-slightly-inefficient implementation would be: while True:
orig_starts = self._starts
orig_stops = self._stops
time.sleep(X)
if orig_starts = self._starts and orig_stops = self._stops and orig_starts != orig_stops:
# watchdog firesThe downside to this implementation is that it wakes up every X seconds even if the main thread is sleeping, which isn't a huge flaw but it does waste some power and besides, it's just not elegant. Can we do better? while True:
self._event.clear()
orig_starts = self._starts
orig_stops = self._stops
if orig_starts == orig_stops:
# main thread asleep; nothing to do until it wakes up
self._event.wait()
else:
self._event.wait(timeout=5)
if orig_starts = self._starts and orig_stops = self._stops:
# watchdog firesAnd I think whenever we start running tasks we do Maybe this would be simpler if we used a |
|
Made the changes so now it should work even when there's nothing to do (basically copy-pasted your code...) |
| file=sys.stderr | ||
| ) | ||
| # scary internal function! | ||
| traceback.print_stack(sys._current_frames()[thread.ident]) |
There was a problem hiding this comment.
What's the advantage of doing this by hand instead of using faulthandler?
(This seems like a pretty delicate and race-prone operation, so I'm nervous about trying to get it right ourselves. E.g., what happens if a thread exits while print_stack is walking its stack?)
There was a problem hiding this comment.
It prints the code being ran rather than an unhelpful list of lines (backwards, too).
| self._thread.start() | ||
|
|
||
| def stop(self): | ||
| self._stopped = True |
There was a problem hiding this comment.
This should also set the event, so that the thread wakes up promptly and notices that self._stopped has been set.
trio/_core/tests/test_watchdog.py
Outdated
| "Trio Watchdog has not received any " | ||
| "notifications in 5 seconds, main " | ||
| "thread is blocked!" | ||
| ) |
There was a problem hiding this comment.
Maybe assert that time.sleep and test_watchdog show up in the output too, to make sure that the watchdog is printing tracebacks?
trio/_core/tests/test_watchdog.py
Outdated
| await _run.checkpoint() | ||
| time.sleep(6) # ensure if the watchdog is waiting for 5s, it wakes | ||
| await _run.checkpoint() | ||
| assert not watchdog._thread.is_alive() |
There was a problem hiding this comment.
This looks very delicate... calling self._event.set() from Watchdog.stop() will help, but even so...
Here's another idea: make the watchdog thread not a daemon, and then join it in Watchdog.stop. That way every time we call trio.run, we're implicitly checking the watchdog shutdown logic (and also making sure we don't leave stray background threads running, even temporarily). What do you think?
trio/_core/tests/test_watchdog.py
Outdated
| target = StringIO() | ||
| with contextlib.redirect_stderr(target): | ||
| time.sleep(7) | ||
|
|
There was a problem hiding this comment.
Maybe it would also be good to have a test where we await trio.sleep(7) and make sure the watchdog doesn't fire? (I.e., a test to catch the bug that was in the first draft of this code.)
Also, these sleeps are kind of long, even for a @slow test... maybe it would be nicer to make the argument watchdog_timeout=5, so then in tests we can adjust it? (And we get rid of the unconfigurable magic constant.)
| else: | ||
| watchdog = None | ||
|
|
||
| GLOBAL_RUN_CONTEXT.watchdog = watchdog |
There was a problem hiding this comment.
It doesn't really matter, but... any reason in particular this isn't an attribute on runner?
There was a problem hiding this comment.
No paticular reason - just where I put it at first.
|
For some reason, only on PyPy nightly, joining the thread blocks forever randomly. |
So the traceback shows the watchdog thread is blocked in Line 42 in 753753b And the main thread is stuck in Lines 81 to 84 in 753753b That's weird, it looks like our Lines 34 to 42 in 753753b Ah crud, the watchdog thread calls Man I hate threads. I suspect that instead of messing around with |
|
To elaborate a bit more on how a # initialize
self._cond = threading.Condition()
# going to sleep
with self._cond:
if self._stopped:
return
self._cond.wait()
# waking up
with self._cond:
self._stopped = True
self._cond.notify()This is the same pattern as our current code, except that before we had a problem because the wake-up thread could run in between when the sleeping thread checks My intuition says that we can extend this pretty straightforwardly to the slightly more complex full watchdog code, but right now I'm too sleepy to figure it out, so maybe give it a try and see how far you get? |
|
What's left to do here? Is this something I can pick up? |
|
I haven't touched this in a week or so (been more focused on the subprocess support). Afaik the only thing left here is to prevent that deadlock (using a Condition). |
|
I've added TODO. I'll update here if I can pick this up, maybe next weekend. |
|
Closing stale PR... we can always re-open if someone wants to pick it up again, but if anyone is looking to pick this up again I'd probably start with the new idea described in this comment: #591 (comment) |
Closes #591.
TODO:
watchdog_timeout=Noneinstead of additionaluse_watchdogoption?