Skip to content

Add task blocked watchdog.#596

Closed
Fuyukai wants to merge 3 commits intopython-trio:masterfrom
Fuyukai:watchdog
Closed

Add task blocked watchdog.#596
Fuyukai wants to merge 3 commits intopython-trio:masterfrom
Fuyukai:watchdog

Conversation

@Fuyukai
Copy link
Member

@Fuyukai Fuyukai commented Aug 12, 2018

Closes #591.

TODO:

  • rebase to master
  • fix deadlock (see Add task blocked watchdog. #596 (comment) and Add task blocked watchdog. #596 (comment))
  • improve tests: code coverage, reduce time scale
  • API review
    • watchdog_timeout=None instead of additional use_watchdog option?
    • best default? Perhaps "max block time of a healthy Trio program x 10".
  • documentation
  • newsfragment
  • future work: include e.g. 99.9 percentile observed block time in Trio stats, to help users decide appropriate watchdog_timeout for their app

@codecov
Copy link

codecov bot commented Aug 12, 2018

Codecov Report

Merging #596 into master will decrease coverage by 0.13%.
The diff coverage is 94.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
- Coverage   99.31%   99.18%   -0.14%     
==========================================
  Files          91       93       +2     
  Lines       10873    11258     +385     
  Branches      758      841      +83     
==========================================
+ Hits        10799    11166     +367     
- Misses         56       66      +10     
- Partials       18       26       +8
Impacted Files Coverage Δ
trio/_core/_watchdog.py 100% <100%> (ø)
trio/_core/tests/test_watchdog.py 100% <100%> (ø)
trio/_core/_run.py 99.27% <77.77%> (-0.73%) ⬇️
trio/_core/_io_windows.py 74.05% <0%> (-4.11%) ⬇️
trio/_core/tests/test_io.py 99.23% <0%> (-0.77%) ⬇️
trio/_core/_traps.py 100% <0%> (ø) ⬆️
trio/_core/__init__.py 100% <0%> (ø) ⬆️
trio/tests/test_socket.py 100% <0%> (ø) ⬆️
trio/_core/_parking_lot.py 100% <0%> (ø) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f0dd7d...753753b. Read the comment docs.

@njsmith
Copy link
Member

njsmith commented Aug 12, 2018

Interesting approach! I was imagining something using a wake-up fd, but I think you're right and threading.Event could work too, and more simply.

This version does have a problem though: if the main loop goes to sleep for 5 seconds because there's nothing to do, then that's fine... but the watchdog still fires. We only want the watchdog to fire if we spend a long time in the task-running part of the loop; it should ignore the IO-handling part.

Of course, doing that between two threads without race conditions is not trivial... Here's my idea. Keep two counters: one that we increment whenever we start running user tasks, and one that we increment whenever we stop. So when we're in the task running part of the code, the counter values are unequal, and when we're in the io-handling part they're equal. Only the main thread mutates these counters; the watchdog thread observes them. The watchdog should fire if it ever makes two observations that are at least X seconds apart, where the values haven't changed, and are unequal to each other. That combination means that we're stuck in a user task.

Since python's memory model is pretty strongly sequential and consistent, and assignment is atomic, I think we can skip any locks on these counters, which is nice.

With this setup, a correct-but-slightly-inefficient implementation would be:

while True:
    orig_starts = self._starts
    orig_stops = self._stops
    time.sleep(X)
    if orig_starts = self._starts and orig_stops = self._stops and orig_starts != orig_stops:
        # watchdog fires

The downside to this implementation is that it wakes up every X seconds even if the main thread is sleeping, which isn't a huge flaw but it does waste some power and besides, it's just not elegant. Can we do better?

while True:
    self._event.clear()
    orig_starts = self._starts
    orig_stops = self._stops
    if orig_starts == orig_stops:
        # main thread asleep; nothing to do until it wakes up
        self._event.wait()
    else:
        self._event.wait(timeout=5)
        if orig_starts = self._starts and orig_stops = self._stops:
            # watchdog fires

And I think whenever we start running tasks we do self._started += 1; self._event.set() (in that order), and when we stop running tasks we can skip the event and just do self._stopped += 1.

Maybe this would be simpler if we used a Condition directly? Event is a thin shim around it, and not adding much value here...

@Fuyukai
Copy link
Member Author

Fuyukai commented Aug 13, 2018

Made the changes so now it should work even when there's nothing to do (basically copy-pasted your code...)

file=sys.stderr
)
# scary internal function!
traceback.print_stack(sys._current_frames()[thread.ident])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of doing this by hand instead of using faulthandler?

(This seems like a pretty delicate and race-prone operation, so I'm nervous about trying to get it right ourselves. E.g., what happens if a thread exits while print_stack is walking its stack?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prints the code being ran rather than an unhelpful list of lines (backwards, too).

self._thread.start()

def stop(self):
self._stopped = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also set the event, so that the thread wakes up promptly and notices that self._stopped has been set.

"Trio Watchdog has not received any "
"notifications in 5 seconds, main "
"thread is blocked!"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that time.sleep and test_watchdog show up in the output too, to make sure that the watchdog is printing tracebacks?

await _run.checkpoint()
time.sleep(6) # ensure if the watchdog is waiting for 5s, it wakes
await _run.checkpoint()
assert not watchdog._thread.is_alive()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very delicate... calling self._event.set() from Watchdog.stop() will help, but even so...

Here's another idea: make the watchdog thread not a daemon, and then join it in Watchdog.stop. That way every time we call trio.run, we're implicitly checking the watchdog shutdown logic (and also making sure we don't leave stray background threads running, even temporarily). What do you think?

target = StringIO()
with contextlib.redirect_stderr(target):
time.sleep(7)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would also be good to have a test where we await trio.sleep(7) and make sure the watchdog doesn't fire? (I.e., a test to catch the bug that was in the first draft of this code.)

Also, these sleeps are kind of long, even for a @slow test... maybe it would be nicer to make the argument watchdog_timeout=5, so then in tests we can adjust it? (And we get rid of the unconfigurable magic constant.)

else:
watchdog = None

GLOBAL_RUN_CONTEXT.watchdog = watchdog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter, but... any reason in particular this isn't an attribute on runner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No paticular reason - just where I put it at first.

@Fuyukai
Copy link
Member Author

Fuyukai commented Aug 22, 2018

For some reason, only on PyPy nightly, joining the thread blocks forever randomly.

@njsmith
Copy link
Member

njsmith commented Aug 23, 2018

For some reason, only on PyPy nightly, joining the thread blocks forever randomly.

So the traceback shows the watchdog thread is blocked in event.wait, and it's blocked for a minute, so probably a call without a timeout. Probably the one on line 42 of _watchdog.py:

self._notify_event.wait()

And the main thread is stuck in join, called from inside our stop method.

def stop(self):
self._stopped = True
self._notify_event.set()
self._thread.join()

That's weird, it looks like our stop method sets the event, so how could the watchdog be blocked waiting for it? Assuming Event actually works, the only way this could happen is if someone calls clear...

if self._stopped:
return
self._notify_event.clear()
orig_starts = self._before_counter
orig_stops = self._after_counter
if orig_starts == orig_stops:
# main thread asleep; nothing to do until it wakes up
self._notify_event.wait()

Ah crud, the watchdog thread calls clear just before it goes to sleep. So if we happen to call stop just in that interval between the if self._stopped and self._notify_event.clear, then we'll deadlock. I'm not 100% sure that this is the problem we're seeing on pypy, but it's certainly a problem in the algorithm I suggested, and could explain it.

Man I hate threads.

I suspect that instead of messing around with Event like this, we want to switch to using a Condition (like how threading.Event is implemented, if you look), so that we can atomically check the stopped flag and counter values before sleeping.

@njsmith
Copy link
Member

njsmith commented Aug 23, 2018

To elaborate a bit more on how a Condition could help here: a Condition always has an attached Lock. And the basic pattern of using a Condition is: one thread acquires the lock, checks some stuff, and then blocks in Condition.wait – which temporarily releases the lock. Another thread acquires the lock, mutates some stuff, and then calls Condition.notify – which can only be done with the lock held. So for example, we might write:

# initialize
self._cond = threading.Condition()

# going to sleep
with self._cond:
    if self._stopped:
        return
    self._cond.wait()

# waking up
with self._cond:
    self._stopped = True
    self._cond.notify()

This is the same pattern as our current code, except that before we had a problem because the wake-up thread could run in between when the sleeping thread checks self._stopped, and when it calls wait to go to sleep. Now, the sleeping thread holds the lock for this period, so the wake-up thread can't run during this time. So that race condition is solved. Does that make sense?

My intuition says that we can extend this pretty straightforwardly to the slightly more complex full watchdog code, but right now I'm too sleepy to figure it out, so maybe give it a try and see how far you get?

@belm0
Copy link
Member

belm0 commented Sep 3, 2018

What's left to do here? Is this something I can pick up?

@Fuyukai
Copy link
Member Author

Fuyukai commented Sep 3, 2018

I haven't touched this in a week or so (been more focused on the subprocess support). Afaik the only thing left here is to prevent that deadlock (using a Condition).

@belm0
Copy link
Member

belm0 commented Sep 3, 2018

I've added TODO. I'll update here if I can pick this up, maybe next weekend.

@njsmith
Copy link
Member

njsmith commented Apr 26, 2019

Closing stale PR... we can always re-open if someone wants to pick it up again, but if anyone is looking to pick this up again I'd probably start with the new idea described in this comment: #591 (comment)

@njsmith njsmith closed this Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a blocked task watchdog

3 participants