-
-
Notifications
You must be signed in to change notification settings - Fork 747
error_message(): pickle exception and traceback immediately #5338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
If I understand correctly, most of the changes in On an unrelated note, since you're already modifying |
That was also my first thought but interesting enough using
Agree we could impose a more strict semantic here but at the moment users are allowed to raise any kind of exception. For instance, in testing we raise a |
|
This PR is ready for review, from rapidsai/dask-cuda#725 (comment) (@VibhuJawa):
|
|
I'm surprised that deepcopying doesn't work, but happy that pickling solves the problem! That's a little odd to see other errors coming out of pickling, but clearly out of scope here so this looks good to me. I don't have permissions on this repo, so consider this 👍 my approval. |
|
I think this is ok -- if you have time, I left a small question. If @crusaderky has time to review that'd be great. If not, I'll plan to merge in on Monday |
crusaderky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @crusaderky |
Sometimes when a task fails by going out of memory, the worker will not cleanup failed memory allocations immediately.
In this PR we make sure to clear any references to the failed task immediately by pickling both the exception and traceback before marking them with
to_serialize().black distributed/flake8 distributed/isort distributedReproducer
In the following code we try to allocate a small cupy array after a big array has failed. Even though we catch the OOM exception and continue, the subsequent small allocation also fails with a OOM exception. This code should work with this PR.