error_message(): pickle exception and traceback immediately #5338

madsbk · 2021-09-22T11:40:59Z

Sometimes when a task fails by going out of memory, the worker will not cleanup failed memory allocations immediately.
In this PR we make sure to clear any references to the failed task immediately by pickling both the exception and traceback before marking them with to_serialize().

Closes [BUG] Holding on to a RMM allocation exception makes cuDF go OOM rapidsai/cudf#9232
Closes [BUG] dask-cuda workers sometimes enter bad state due to OOM errors rather than restarting rapidsai/dask-cuda#725
[BUG] dask-cuda workers sometimes enter bad state due to OOM errors rather than restarting rapidsai/dask-cuda#725
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

Reproducer

In the following code we try to allocate a small cupy array after a big array has failed. Even though we catch the OOM exception and continue, the subsequent small allocation also fails with a OOM exception. This code should work with this PR.

import cupy as cp
import pickle
from dask_cuda import LocalCUDACluster
from distributed import Client

def create_cupy_aray(n_rows):
    print(f"create_cupy_aray({n_rows})")
    s_1 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_2 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_3 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_4 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_5 = cp.asarray([1], dtype="int64").repeat(n_rows)
    return len(s_1) + len(s_2) + len(s_3) + len(s_4) + len(s_5)

if __name__ == "__main__":
    with LocalCUDACluster(n_workers=1) as cluster:
        with Client(cluster) as client:
            print("*"*100)
            print("Create one small array")
            print(client.submit(create_cupy_aray, 200_000_000).result())

            try:
                print("*"*100)
                print("Create large array")
                print(client.submit(create_cupy_aray, 900_000_000).result())
            except Exception as e:
                print("Catching exception: ", repr(e))

            print("*"*100)
            print("Create one small array")
            print(client.submit(create_cupy_aray, 200_000_000).result())
            print("FINISHED")

…into error_message_pickle

vyasr · 2021-09-22T14:25:52Z

If I understand correctly, most of the changes in clean_exception are just variable renames and the relevant parts of the code are those that are calling clean_exception to ensure that the stored exceptions are new objects that are decoupled from the originals. If I'm interpreting this correctly, you may find that a simpler and more explicit solution would be to deepcopy the relevant exceptions, which is more explicit and avoids the extra overhead introduced by all the unpickling/exception handling in clean_exception.

On an unrelated note, since you're already modifying clean_exception you may want to replace the except Exception lines with something more specific like except PickleError since those blocks seem solely designed to try unpickling if the exception is pickled. You never know when catching a bare Exception will come back to bite you.

…_pickle

madsbk · 2021-09-23T07:04:05Z

If I understand correctly, most of the changes in clean_exception are just variable renames and the relevant parts of the code are those that are calling clean_exception to ensure that the stored exceptions are new objects that are decoupled from the originals. If I'm interpreting this correctly, you may find that a simpler and more explicit solution would be to deepcopy the relevant exceptions, which is more explicit and avoids the extra overhead introduced by all the unpickling/exception handling in clean_exception.

That was also my first thought but interesting enough using deepcopy instead of pickle doesn't fix the issue.
Anyways, I think it is appropriate to use pickle explicitly here. The exception and traceback has to be marked with protocol.to_serialize for communication anyways.

On an unrelated note, since you're already modifying clean_exception you may want to replace the except Exception lines with something more specific like except PickleError since those blocks seem solely designed to try unpickling if the exception is pickled. You never know when catching a bare Exception will come back to bite you.

Agree we could impose a more strict semantic here but at the moment users are allowed to raise any kind of exception. For instance, in testing we raise aTypeError exception.

madsbk · 2021-09-24T11:08:08Z

This PR is ready for review, from rapidsai/dask-cuda#725 (comment) (@VibhuJawa):

Spent close to 2 hours of testing various scenarios including cupy with and without pool, cudf with and without pool , various cluster sizes and was never able to get the worker in a bad state.
Can confirm that with #5338 seems to fix all the scenarios i could think of and test. Thanks a ton for working on the issue @madsbk . This is great work !!

vyasr · 2021-09-24T21:57:45Z

I'm surprised that deepcopying doesn't work, but happy that pickling solves the problem! That's a little odd to see other errors coming out of pickling, but clearly out of scope here so this looks good to me. I don't have permissions on this repo, so consider this 👍 my approval.

distributed/core.py

quasiben · 2021-09-24T22:44:41Z

I think this is ok -- if you have time, I left a small question. If @crusaderky has time to review that'd be great. If not, I'll plan to merge in on Monday

crusaderky

LGTM

quasiben · 2021-09-28T15:18:46Z

Thanks @crusaderky

madsbk added 2 commits September 22, 2021 13:22

Pickle exception and traceback before to_serialize()

aed0da7

Fixing tests that accesses response["exception"] directly

8713e1a

madsbk marked this pull request as ready for review September 22, 2021 13:39

This was referenced Sep 22, 2021

[BUG] Holding on to a RMM allocation exception makes cuDF go OOM rapidsai/cudf#9232

Closed

[BUG] dask-cuda workers sometimes enter bad state due to OOM errors rather than restarting rapidsai/dask-cuda#725

Closed

madsbk added 3 commits September 22, 2021 16:18

Fixing tests that accesses response["exception"] directly

11c1e44

Fixing more tests

11d3f8b

Merge branch 'error_message_pickle' of github.com:madsbk/distributed …

e8d1b0a

…into error_message_pickle

madsbk added 2 commits September 23, 2021 08:40

Merge branch 'main' of github.com:dask/distributed into error_message…

9c910b4

…_pickle

Serializing the fallback exception as well

777165a

madsbk changed the title ~~error_message(): pickle exception and traceback immediately~~ [REVIEW] error_message(): pickle exception and traceback immediately Sep 23, 2021

quasiben reviewed Sep 24, 2021

View reviewed changes

distributed/core.py Show resolved Hide resolved

crusaderky changed the title ~~[REVIEW] error_message(): pickle exception and traceback immediately~~ error_message(): pickle exception and traceback immediately Sep 28, 2021

crusaderky approved these changes Sep 28, 2021

View reviewed changes

quasiben merged commit ef28137 into dask:main Sep 28, 2021

VibhuJawa mentioned this pull request May 19, 2022

[BUG] Workers not recovering after OOM failures rapidsai/dask-cuda#859

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

error_message(): pickle exception and traceback immediately #5338

error_message(): pickle exception and traceback immediately #5338

Uh oh!

madsbk commented Sep 22, 2021 •

edited

Loading

Uh oh!

vyasr commented Sep 22, 2021

Uh oh!

madsbk commented Sep 23, 2021

Uh oh!

madsbk commented Sep 24, 2021

Uh oh!

vyasr commented Sep 24, 2021

Uh oh!

Uh oh!

quasiben commented Sep 24, 2021

Uh oh!

crusaderky left a comment

Uh oh!

quasiben commented Sep 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

error_message(): pickle exception and traceback immediately #5338

error_message(): pickle exception and traceback immediately #5338

Uh oh!

Conversation

madsbk commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducer

Uh oh!

vyasr commented Sep 22, 2021

Uh oh!

madsbk commented Sep 23, 2021

Uh oh!

madsbk commented Sep 24, 2021

Uh oh!

vyasr commented Sep 24, 2021

Uh oh!

Uh oh!

quasiben commented Sep 24, 2021

Uh oh!

crusaderky left a comment

Choose a reason for hiding this comment

Uh oh!

quasiben commented Sep 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

madsbk commented Sep 22, 2021 •

edited

Loading