Add task blocked watchdog. by Fuyukai · Pull Request #596 · python-trio/trio

Fuyukai · 2018-08-12T15:16:44Z

Closes #591.

TODO:

rebase to master
fix deadlock (see Add task blocked watchdog. #596 (comment) and Add task blocked watchdog. #596 (comment))
improve tests: code coverage, reduce time scale
API review
- watchdog_timeout=None instead of additional use_watchdog option?
- best default? Perhaps "max block time of a healthy Trio program x 10".
documentation
newsfragment
future work: include e.g. 99.9 percentile observed block time in Trio stats, to help users decide appropriate watchdog_timeout for their app

codecov · 2018-08-12T15:43:58Z

Codecov Report

Merging #596 into master will decrease coverage by 0.13%.
The diff coverage is 94.8%.

@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
- Coverage   99.31%   99.18%   -0.14%     
==========================================
  Files          91       93       +2     
  Lines       10873    11258     +385     
  Branches      758      841      +83     
==========================================
+ Hits        10799    11166     +367     
- Misses         56       66      +10     
- Partials       18       26       +8

Impacted Files	Coverage Δ
trio/_core/_watchdog.py	`100% <100%> (ø)`
trio/_core/tests/test_watchdog.py	`100% <100%> (ø)`
trio/_core/_run.py	`99.27% <77.77%> (-0.73%)`	⬇️
trio/_core/_io_windows.py	`74.05% <0%> (-4.11%)`	⬇️
trio/_core/tests/test_io.py	`99.23% <0%> (-0.77%)`	⬇️
trio/_core/_traps.py	`100% <0%> (ø)`	⬆️
trio/_core/__init__.py	`100% <0%> (ø)`	⬆️
trio/tests/test_socket.py	`100% <0%> (ø)`	⬆️
trio/_core/_parking_lot.py	`100% <0%> (ø)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f0dd7d...753753b. Read the comment docs.

njsmith · 2018-08-12T20:42:40Z

Interesting approach! I was imagining something using a wake-up fd, but I think you're right and threading.Event could work too, and more simply.

This version does have a problem though: if the main loop goes to sleep for 5 seconds because there's nothing to do, then that's fine... but the watchdog still fires. We only want the watchdog to fire if we spend a long time in the task-running part of the loop; it should ignore the IO-handling part.

Of course, doing that between two threads without race conditions is not trivial... Here's my idea. Keep two counters: one that we increment whenever we start running user tasks, and one that we increment whenever we stop. So when we're in the task running part of the code, the counter values are unequal, and when we're in the io-handling part they're equal. Only the main thread mutates these counters; the watchdog thread observes them. The watchdog should fire if it ever makes two observations that are at least X seconds apart, where the values haven't changed, and are unequal to each other. That combination means that we're stuck in a user task.

Since python's memory model is pretty strongly sequential and consistent, and assignment is atomic, I think we can skip any locks on these counters, which is nice.

With this setup, a correct-but-slightly-inefficient implementation would be:

while True:
    orig_starts = self._starts
    orig_stops = self._stops
    time.sleep(X)
    if orig_starts = self._starts and orig_stops = self._stops and orig_starts != orig_stops:
        # watchdog fires

The downside to this implementation is that it wakes up every X seconds even if the main thread is sleeping, which isn't a huge flaw but it does waste some power and besides, it's just not elegant. Can we do better?

while True:
    self._event.clear()
    orig_starts = self._starts
    orig_stops = self._stops
    if orig_starts == orig_stops:
        # main thread asleep; nothing to do until it wakes up
        self._event.wait()
    else:
        self._event.wait(timeout=5)
        if orig_starts = self._starts and orig_stops = self._stops:
            # watchdog fires

And I think whenever we start running tasks we do self._started += 1; self._event.set() (in that order), and when we stop running tasks we can skip the event and just do self._stopped += 1.

Maybe this would be simpler if we used a Condition directly? Event is a thin shim around it, and not adding much value here...

Fuyukai · 2018-08-13T13:09:01Z

Made the changes so now it should work even when there's nothing to do (basically copy-pasted your code...)

njsmith · 2018-08-14T07:39:24Z

trio/_core/_watchdog.py

+                file=sys.stderr
+            )
+            # scary internal function!
+            traceback.print_stack(sys._current_frames()[thread.ident])


What's the advantage of doing this by hand instead of using faulthandler?

(This seems like a pretty delicate and race-prone operation, so I'm nervous about trying to get it right ourselves. E.g., what happens if a thread exits while print_stack is walking its stack?)

It prints the code being ran rather than an unhelpful list of lines (backwards, too).

njsmith · 2018-08-14T07:40:47Z

trio/_core/_watchdog.py

+        self._thread.start()
+
+    def stop(self):
+        self._stopped = True


This should also set the event, so that the thread wakes up promptly and notices that self._stopped has been set.

njsmith · 2018-08-14T07:41:48Z

trio/_core/tests/test_watchdog.py

+        "Trio Watchdog has not received any "
+        "notifications in 5 seconds, main "
+        "thread is blocked!"
+    )


Maybe assert that time.sleep and test_watchdog show up in the output too, to make sure that the watchdog is printing tracebacks?

njsmith · 2018-08-14T07:44:47Z

trio/_core/tests/test_watchdog.py

+    await _run.checkpoint()
+    time.sleep(6)  # ensure if the watchdog is waiting for 5s, it wakes
+    await _run.checkpoint()
+    assert not watchdog._thread.is_alive()


This looks very delicate... calling self._event.set() from Watchdog.stop() will help, but even so...

Here's another idea: make the watchdog thread not a daemon, and then join it in Watchdog.stop. That way every time we call trio.run, we're implicitly checking the watchdog shutdown logic (and also making sure we don't leave stray background threads running, even temporarily). What do you think?

njsmith · 2018-08-14T07:48:29Z

trio/_core/tests/test_watchdog.py

+    target = StringIO()
+    with contextlib.redirect_stderr(target):
+        time.sleep(7)
+


Maybe it would also be good to have a test where we await trio.sleep(7) and make sure the watchdog doesn't fire? (I.e., a test to catch the bug that was in the first draft of this code.)

Also, these sleeps are kind of long, even for a @slow test... maybe it would be nicer to make the argument watchdog_timeout=5, so then in tests we can adjust it? (And we get rid of the unconfigurable magic constant.)

njsmith · 2018-08-14T07:49:01Z

trio/_core/_run.py

+    else:
+        watchdog = None
+
+    GLOBAL_RUN_CONTEXT.watchdog = watchdog


It doesn't really matter, but... any reason in particular this isn't an attribute on runner?

No paticular reason - just where I put it at first.

Closes #591.

Fuyukai · 2018-08-22T17:41:17Z

For some reason, only on PyPy nightly, joining the thread blocks forever randomly.

njsmith · 2018-08-23T03:17:11Z

For some reason, only on PyPy nightly, joining the thread blocks forever randomly.

So the traceback shows the watchdog thread is blocked in event.wait, and it's blocked for a minute, so probably a call without a timeout. Probably the one on line 42 of _watchdog.py:

trio/trio/_core/_watchdog.py

Line 42 in 753753b

self._notify_event.wait()

And the main thread is stuck in join, called from inside our stop method.

trio/trio/_core/_watchdog.py

Lines 81 to 84 in 753753b

    
           def stop(self): 
        
               self._stopped = True 
        
               self._notify_event.set() 
        
               self._thread.join()

That's weird, it looks like our stop method sets the event, so how could the watchdog be blocked waiting for it? Assuming Event actually works, the only way this could happen is if someone calls clear...

trio/trio/_core/_watchdog.py

Lines 34 to 42 in 753753b

    
           if self._stopped: 
        
               return 
        
           self._notify_event.clear() 
        
           orig_starts = self._before_counter 
        
           orig_stops = self._after_counter 
        
           if orig_starts == orig_stops: 
        
               # main thread asleep; nothing to do until it wakes up 
        
               self._notify_event.wait()

Ah crud, the watchdog thread calls clear just before it goes to sleep. So if we happen to call stop just in that interval between the if self._stopped and self._notify_event.clear, then we'll deadlock. I'm not 100% sure that this is the problem we're seeing on pypy, but it's certainly a problem in the algorithm I suggested, and could explain it.

Man I hate threads.

I suspect that instead of messing around with Event like this, we want to switch to using a Condition (like how threading.Event is implemented, if you look), so that we can atomically check the stopped flag and counter values before sleeping.

njsmith · 2018-08-23T08:43:33Z

To elaborate a bit more on how a Condition could help here: a Condition always has an attached Lock. And the basic pattern of using a Condition is: one thread acquires the lock, checks some stuff, and then blocks in Condition.wait – which temporarily releases the lock. Another thread acquires the lock, mutates some stuff, and then calls Condition.notify – which can only be done with the lock held. So for example, we might write:

# initialize
self._cond = threading.Condition()

# going to sleep
with self._cond:
    if self._stopped:
        return
    self._cond.wait()

# waking up
with self._cond:
    self._stopped = True
    self._cond.notify()

This is the same pattern as our current code, except that before we had a problem because the wake-up thread could run in between when the sleeping thread checks self._stopped, and when it calls wait to go to sleep. Now, the sleeping thread holds the lock for this period, so the wake-up thread can't run during this time. So that race condition is solved. Does that make sense?

My intuition says that we can extend this pretty straightforwardly to the slightly more complex full watchdog code, but right now I'm too sleepy to figure it out, so maybe give it a try and see how far you get?

belm0 · 2018-09-03T12:53:11Z

What's left to do here? Is this something I can pick up?

Fuyukai · 2018-09-03T15:45:12Z

I haven't touched this in a week or so (been more focused on the subprocess support). Afaik the only thing left here is to prevent that deadlock (using a Condition).

belm0 · 2018-09-03T16:21:20Z

I've added TODO. I'll update here if I can pick this up, maybe next weekend.

njsmith · 2019-04-26T04:09:52Z

Closing stale PR... we can always re-open if someone wants to pick it up again, but if anyone is looking to pick this up again I'd probably start with the new idea described in this comment: #591 (comment)

njsmith reviewed Aug 14, 2018

View reviewed changes

Fuyukai added 3 commits August 22, 2018 18:07

Add task blocked watchdog.

68f84c8

Closes #591.

Add a test for the watchdog.

d343965

Make changes as appropriate

753753b

njsmith mentioned this pull request Aug 29, 2018

Should we deprecate trio.Event.clear? #637

Closed

belm0 mentioned this pull request Sep 17, 2018

Busy-waiting watchdog #672

Open

njsmith mentioned this pull request Feb 13, 2019

Add a blocked task watchdog #591

Open

komuw mentioned this pull request Feb 23, 2019

validate that code inside Task.async_run is not blocking komuw/wiji#4

Closed

njsmith closed this Apr 26, 2019

Uh oh!

Conversation

Fuyukai commented Aug 12, 2018 • edited by belm0 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njsmith commented Aug 12, 2018

Uh oh!

Fuyukai commented Aug 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fuyukai commented Aug 22, 2018

Uh oh!

njsmith commented Aug 23, 2018

Uh oh!

njsmith commented Aug 23, 2018

Uh oh!

belm0 commented Sep 3, 2018

Uh oh!

Fuyukai commented Sep 3, 2018

Uh oh!

belm0 commented Sep 3, 2018

Uh oh!

njsmith commented Apr 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fuyukai commented Aug 12, 2018 •

edited by belm0

Loading

codecov bot commented Aug 12, 2018 •

edited

Loading