-
-
Notifications
You must be signed in to change notification settings - Fork 748
Description
When an exception occurs on the cluster, the traceback that shows up is long, and includes both a local component and a remote component. I suspect this is a stumbling block for beginners because:
- When tracebacks are too long, people glaze over and don't read them
- For those that do read them, separating the local client boilerplate from the remote part requires practice
Specifically, I always skim for something like raise exception.with_traceback(traceback) in distributed/client.py, ignore everything above it, and just look at the remote part of the traceback. Could we format our error messages differently so users don't have to learn this unintuitive skill?
Goals:
- Users can easily tell whether an exception happened locally or on the cluster
- Minimal internal distributed code is shown in tracebacks when we know the error wasn't an internal distributed error.
We already do a good job with 2 on the worker side thanks to get_traceback, which removes irrelevant frames. So this might be as simple as raise ... from None on the client when re-raising a remote exception. Plus somehow adding a prefix like:
------------------------------------------------------------------------------------------------
ValueError (remote) Traceback from cluster (most recent call last)
While calling Client.compute, this error occurred on worker 'worker-abcde'
while executing task ('map-blocks-12345', 0, 0):
so that remote exceptions are easily distinguishable from local ones.