Skip to content

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Sep 22, 2021

Sometimes when a task fails by going out of memory, the worker will not cleanup failed memory allocations immediately.
In this PR we make sure to clear any references to the failed task immediately by pickling both the exception and traceback before marking them with to_serialize().

Reproducer

In the following code we try to allocate a small cupy array after a big array has failed. Even though we catch the OOM exception and continue, the subsequent small allocation also fails with a OOM exception. This code should work with this PR.

import cupy as cp
import pickle
from dask_cuda import LocalCUDACluster
from distributed import Client

def create_cupy_aray(n_rows):
    print(f"create_cupy_aray({n_rows})")
    s_1 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_2 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_3 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_4 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_5 = cp.asarray([1], dtype="int64").repeat(n_rows)
    return len(s_1) + len(s_2) + len(s_3) + len(s_4) + len(s_5)

if __name__ == "__main__":
    with LocalCUDACluster(n_workers=1) as cluster:
        with Client(cluster) as client:
            print("*"*100)
            print("Create one small array")
            print(client.submit(create_cupy_aray, 200_000_000).result())

            try:
                print("*"*100)
                print("Create large array")
                print(client.submit(create_cupy_aray, 900_000_000).result())
            except Exception as e:
                print("Catching exception: ", repr(e))

            print("*"*100)
            print("Create one small array")
            print(client.submit(create_cupy_aray, 200_000_000).result())
            print("FINISHED")

@vyasr
Copy link

vyasr commented Sep 22, 2021

If I understand correctly, most of the changes in clean_exception are just variable renames and the relevant parts of the code are those that are calling clean_exception to ensure that the stored exceptions are new objects that are decoupled from the originals. If I'm interpreting this correctly, you may find that a simpler and more explicit solution would be to deepcopy the relevant exceptions, which is more explicit and avoids the extra overhead introduced by all the unpickling/exception handling in clean_exception.

On an unrelated note, since you're already modifying clean_exception you may want to replace the except Exception lines with something more specific like except PickleError since those blocks seem solely designed to try unpickling if the exception is pickled. You never know when catching a bare Exception will come back to bite you.

@madsbk
Copy link
Contributor Author

madsbk commented Sep 23, 2021

If I understand correctly, most of the changes in clean_exception are just variable renames and the relevant parts of the code are those that are calling clean_exception to ensure that the stored exceptions are new objects that are decoupled from the originals. If I'm interpreting this correctly, you may find that a simpler and more explicit solution would be to deepcopy the relevant exceptions, which is more explicit and avoids the extra overhead introduced by all the unpickling/exception handling in clean_exception.

That was also my first thought but interesting enough using deepcopy instead of pickle doesn't fix the issue.
Anyways, I think it is appropriate to use pickle explicitly here. The exception and traceback has to be marked with protocol.to_serialize for communication anyways.

On an unrelated note, since you're already modifying clean_exception you may want to replace the except Exception lines with something more specific like except PickleError since those blocks seem solely designed to try unpickling if the exception is pickled. You never know when catching a bare Exception will come back to bite you.

Agree we could impose a more strict semantic here but at the moment users are allowed to raise any kind of exception. For instance, in testing we raise aTypeError exception.

@madsbk madsbk changed the title error_message(): pickle exception and traceback immediately [REVIEW] error_message(): pickle exception and traceback immediately Sep 23, 2021
@madsbk
Copy link
Contributor Author

madsbk commented Sep 24, 2021

This PR is ready for review, from rapidsai/dask-cuda#725 (comment) (@VibhuJawa):

Spent close to 2 hours of testing various scenarios including cupy with and without pool, cudf with and without pool , various cluster sizes and was never able to get the worker in a bad state.
Can confirm that with #5338 seems to fix all the scenarios i could think of and test. Thanks a ton for working on the issue @madsbk . This is great work !!

@vyasr
Copy link

vyasr commented Sep 24, 2021

I'm surprised that deepcopying doesn't work, but happy that pickling solves the problem! That's a little odd to see other errors coming out of pickling, but clearly out of scope here so this looks good to me. I don't have permissions on this repo, so consider this 👍 my approval.

@quasiben
Copy link
Member

I think this is ok -- if you have time, I left a small question. If @crusaderky has time to review that'd be great. If not, I'll plan to merge in on Monday

@crusaderky crusaderky changed the title [REVIEW] error_message(): pickle exception and traceback immediately error_message(): pickle exception and traceback immediately Sep 28, 2021
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@quasiben quasiben merged commit ef28137 into dask:main Sep 28, 2021
@quasiben
Copy link
Member

Thanks @crusaderky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants