Skip to content

Make remote tracebacks easier to read and distinguish #4880

@gjoseph92

Description

@gjoseph92

When an exception occurs on the cluster, the traceback that shows up is long, and includes both a local component and a remote component. I suspect this is a stumbling block for beginners because:

  1. When tracebacks are too long, people glaze over and don't read them
  2. For those that do read them, separating the local client boilerplate from the remote part requires practice

Specifically, I always skim for something like raise exception.with_traceback(traceback) in distributed/client.py, ignore everything above it, and just look at the remote part of the traceback. Could we format our error messages differently so users don't have to learn this unintuitive skill?

Goals:

  1. Users can easily tell whether an exception happened locally or on the cluster
  2. Minimal internal distributed code is shown in tracebacks when we know the error wasn't an internal distributed error.

We already do a good job with 2 on the worker side thanks to get_traceback, which removes irrelevant frames. So this might be as simple as raise ... from None on the client when re-raising a remote exception. Plus somehow adding a prefix like:

------------------------------------------------------------------------------------------------
ValueError (remote)                               Traceback from cluster (most recent call last)

While calling Client.compute, this error occurred on worker 'worker-abcde'
while executing task ('map-blocks-12345', 0, 0):

so that remote exceptions are easily distinguishable from local ones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions