Skip to content

Conversation

@crusaderky
Copy link
Collaborator

Functions decorated by @fail_hard are frequently called by something else. fail_hard should not block exceptions.

Also, the decorator is a last resort when error handling for everything that we foresaw could possibly go wrong failed.
Instead of trying to force a crash in methods that should not crash to begin with, which is only possible if bypassing the usual callers, write tests which use custom methods which do crash.

Also write methods for the sync use case, which was insofar untested.

Blocks #6388

@crusaderky crusaderky self-assigned this May 20, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request May 20, 2022
@crusaderky crusaderky mentioned this pull request May 20, 2022
@mrocklin
Copy link
Member

cc @gjoseph92

@crusaderky crusaderky linked an issue May 20, 2022 that may be closed by this pull request
8 tasks
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a lot of discussion about this, see #6210 (comment).

The basic idea was that if @fail_hard was being triggered at all, some state is clearly broken, and we want to shut down the worker immediately. We want shutdown to happen cleanly if possible. If we don't reraise the exception, it's basically a goto self.cose(). This is bad, but it means we don't have to reason about "what is the error handler for the reraised exception going to do? Could it interfere with our close in some way?". We're tossing reasonable control flow out the window in favor of doing everything we can to minimize other concurrent things that might interfere with Worker.close.

This isn't based on evidence, just intuition (less concurrency -> more predictability). I agree it's problematic. If your change here seems to make things better, I'm fine trying it. I'm curious why it's a blocker—I assume you want to test somewhere that an exception was raised?

@crusaderky
Copy link
Collaborator Author

crusaderky commented May 20, 2022

It's a blocker because before the test was calling gather_dep and expecting its except Exception: block to raise an exception itself, which will no longer happen.

@crusaderky crusaderky merged commit cc383a6 into dask:main May 20, 2022
@crusaderky crusaderky deleted the fail_hard branch May 20, 2022 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate ensure_communicating transitions to new WorkerState event mechanism

3 participants