Dashboard for failed tasks #6595

ian-r-rose · 2022-06-17T20:03:59Z

Not ready yet, but putting this out for early visibility. I also want to add similar data to the individual scheduler and worker info pages, as well as add various tests. Closes #6539

Tests added / passed
Passes pre-commit run --all-files

mrocklin · 2022-06-17T20:28:10Z

Thoughts on using the Jinja template pages instead? The Bokeh tables aren't the greatest. My sense is that tracebacks will be pretty critical here.

ian-r-rose · 2022-06-17T20:40:24Z

Thoughts on using the Jinja template pages instead? The Bokeh tables aren't the greatest. My sense is that tracebacks will be pretty critical here.

Yeah, I can definitely do a jinja template. I'm a little skittish about putting tracebacks in a <details> tag as it's hard to get nested tables working well, but we can iterate.

Note that I put the traceback in a tooltip for this table, so it is available, though not very well formatted.

github-actions · 2022-06-17T20:50:44Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 42m 30s ⏱️ - 3h 38m 55s
  2 962 tests +  65   2 872 ✔️ +  60     88 💤 +  5 2 ❌ ±0
21 970 runs +512 20 933 ✔️ +433 1 035 💤 +79 2 ❌ ±0

For more details on these failures, see this check.

Results for commit 83ca016. ± Comparison against base commit a8eb3b2.

♻️ This comment has been updated with latest results.

mrocklin · 2022-06-17T21:26:43Z

We could also grab code from the oldest computation associated to the exception.

I think that Bokeh is good when we're doing live-updating. This seems like a case where that doesn't need to be the case. The info pages might be simpler to work with and give us more flexilibity generally.

mrocklin · 2022-06-17T23:14:58Z

Also, in case the info pages aren't well known to you (I think I'm the only person who has touched them (it shows)) there are some pointers in the text of #6539 (in case you skimmed over it)

ian-r-rose · 2022-06-17T23:48:11Z

Also, in case the info pages aren't well known to you (I think I'm the only person who has touched them (it shows)) there are some pointers in the text of #6539 (in case you skimmed over it)

They're known, though I don't tend to go that way very often as they don't really play nicely with the labextension

mrocklin · 2022-06-18T00:50:24Z

Hrm, interesting point. I think that with this though I think that over time we're going to want to invest more into the UX beyond what the bokeh table and maybe even the JLab widgets can support. I think that clicking into the exception to get more information is important. I could imagine navigating to the task. I could even imagine (with some more security involved) clicking a button to retry the task. The bokeh plots are great for at a glance dynamic stuff. This feels more like a "let's dig in" situation where something else might be more appropriate.

…

On Fri, Jun 17, 2022, 4:48 PM Ian Rose ***@***.***> wrote: Also, in case the info pages aren't well known to you (I think I'm the only person who has touched them (it shows)) there are some pointers in the text of #6539 <#6539> (in case you skimmed over it) They're known, though I don't tend to go that way very often as they don't really play nicely with the labextension — Reply to this email directly, view it on GitHub <#6595 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTC2LFNCKTSEIP2KLGDVPUFENANCNFSM5ZDI5RRQ> . You are receiving this because you commented.Message ID: ***@***.***>

ian-r-rose · 2022-06-24T23:37:26Z

Updated with a dedicated HTML template for exceptions:

ian-r-rose · 2022-06-27T17:38:48Z

This is ready for review. The bokeh stuff is logically separate from the HTML template, and I could easily remove it, though I personally found it helpful when developing this.

ian-r-rose · 2022-06-27T18:12:22Z

Also added some info on the individual tasks page if there is an error:

mrocklin · 2022-06-27T22:26:35Z

distributed/scheduler.py

+                self.group_exceptions[(ts.group_key, ex)]["workers"].add(w or worker)
+                self.group_exceptions[(ts.group_key, ex)]["count"] += 1
+                self.group_exceptions[(ts.group_key, ex)]["last_seen"] = time()
+


I'm hesitant to add more state to the scheduler for this. Instead, keeping with the single-point-of-truth approach, I wonder if we can grab this information from the existing state.

Perhaps we could wander through all task groups, find any with erred tasks, and then use those?

Or is your goal to capture a history of exceptions? Typically the Dask dashboard doesn't try to capture much history. It's more about getting a view of current state. Thoughts?

My goal here was actually to do a level of deduplication, effectively trying to hash based on (group, exception_type), so that many similar-but-not-identical exceptions are consolidated into a single row.

Perhaps we could wander through all task groups, find any with erred tasks, and then use those?

Yes, I can refactor to do something like that. One thing that I did find helpful was keeping errored tasks in a deque to force them to be in scope a bit longer. Otherwise it was too easy to loose task state for what is presumably a forensic exercise. So I might stick up for that data structure with some history a bit more than this deduplicated-dictionary.

Okay, I've removed (most of) the state, and it now has a lighter footprint on the scheduler

I think what we've found useful to have is the exception and the traceback on the exception for investigating what went wrong, but I agree that I wouldn't want multiple states for this.

For the future (if this helps guide your design decisions) I can imagine:

wanting to retry these tasks via the dashboard

Figuring out what input mapped to each failed task (in particular for large client.map calls).

In our experience tasks generally fail:

All with exactly the same exception/traceback (example: something in the submitted function is broken)

All with the same exception/mostly the same traceback (example: some service is just down)

Singleish task failures (some specific input is bad)

Thanks for the insights @gerrymanoim! A few thoughts inline:

For the future (if this helps guide your design decisions) I can imagine:

1. wanting to retry these tasks via the dashboard

This would be really cool, though we'd probably want to proceed carefully, since adding things which actually affect scheduler state to the dashboard brings in some serious security considerations (cf #6407). I don't think anything in this design would prevent us from doing this, however, should we figure out good ways of handling the security concerns.

2. Figuring out what input mapped to each failed task (in particular for large client.map calls).

I also added the exception information to the individual task page, so as a user one can go from Exceptions -> individual task -> dependencies all via links. Does that work for the type of investigations you are thinking about?

* All with exactly the same exception/traceback (example: something in the submitted function is broken) * All with the same exception/mostly the same traceback (example: some service is just down) * Singleish task failures (some specific input is bad)

An earlier version of this did some light attempts at deduplicating similar tasks (e.g., they are all the same, but the chunk/partition index is different, so the text is strictly different). I have some thoughts about how that might be done, but for the time being I've removed that logic in favor of simplicity. If huge numbers of similar tasks winds up being a problem with this approach, I'd certainly be interested in hearing about it!

since adding things which actually affect scheduler state to the dashboard brings in some serious security considerations

Definitely a fair point. Maybe something that's better as a (future) jupyter widget instead of on the dashboard? I can imagine a whole UI wrapper for as_completed that gives you progress/tracebacks/restarts (just thinking out loud here).

Exceptions -> individual task -> dependencies

Does dependencies mean upstream nodes in the graph? Or the actual input (if any) a dask was passed? i.e. if I do

def fail_if_two(x): if x == 2: raise RuntimeError("It was two!") numbers_to_run = list(range(1000)) futures = client.map(fail_if_two, numbers_to_run)

will I be able to recover 2 via the dashboard?

If huge numbers of similar tasks winds up being a problem with this approach, I'd certainly be interested in hearing about it!

👍 will report back - thanks!

We can consider dashboard pages that affect state. Arguably the Dask Dashboard should be treated more securely than it currently is by most systems. We'd want to iterate with a broader set of stakeholders here first, but I'm open to it.

Does dependencies mean upstream nodes in the graph? Or the actual input (if any) a dask was passed? i.e. if I do

By dependencies I think that @ian-r-rose mostly means upstream nodes, and not inputs. Inputs might be tricky currently because they're currently stored in serialized form in the scheduler (and the scheduler isn't supposed to unpack them). This will change if we merge in #6028 (on hold until stability issues stabilize (which it seems like maybe they have?)).

By dependencies I think that @ian-r-rose mostly means upstream nodes, and not inputs

That's correct, the upstream task nodes are visible, but their deserialized inputs are not. For your example, if I edit it to have some upstream tasks:

def fail_if_two(x): if x == 2: raise RuntimeError("It was two!") def inc(x): return x + 1 numbers_to_run = list(range(1000)) futures_a = client.map(inc, numbers_to_run) futures_b = client.map(fail_if_two, futures_a)

I will be able to see that the exception has an inc task as a dependency, but not what the actual input value was.

Understood - thanks for the clarifications.

mrocklin · 2022-07-05T16:14:39Z

I like the content on the exceptions page. I also like that it's on the info page for the task.

Question though, how does one navigate to this page?

ian-r-rose · 2022-07-05T16:19:55Z

Question though, how does one navigate to this page?

There is a new "Exceptions" button at the top of the landing info page

mrocklin · 2022-07-05T16:24:02Z

Ah ha! So there is.

mrocklin · 2022-07-05T16:24:44Z

distributed/scheduler.py

        )
+        self.erred_tasks: deque[TaskState] = deque(
+            maxlen=dask.config.get("distributed.diagnostics.erred-tasks.max-history")
+        )


@crusaderky will keeping around a deque of references to old TaskStates that have failed cause any issues that you can think of?

I would wipe out the run_spec, as it can be potentially several MB per tasks. It's what we already do with the to_loggable method in worker_state_machine.py.

@ian-r-rose you should be able to do this by grepping for del self.tasks in scheduler.py. Just before the del line we'd also del ts.run_spec`.

I would also be fine just not tracking exceptions in this way and losing them when they go out of scope. I'm not sure that the benefit added by this page outweighs the extra state in the scheduler. My inclination would be to remove the deque entirely if that's possible.

I would also be fine just not tracking exceptions in this way and losing them when they go out of scope. I'm not sure that the benefit added by this page outweighs the extra state in the scheduler. My inclination would be to remove the deque entirely if that's possible.

It's definitely possible (and not hard), I am just not convinced that it results in a good user experience. To get really concrete, if I run the following:

import time import dask import distributed cluster = distributed.LocalCluster() client = distributed.Client(cluster) @dask.delayed def func(x): time.sleep(1) raise ValueError(f"Oops got {x}") func(10).compute()

The exceptions are not reachable on the dashboard, because there are no futures hanging on to the task reference, and they are very quickly cleaned up by the scheduler. This is true of basically any .compute() call that errors. In practice, I think that means that a lot of exceptions would not be visible, and it would not be possible to do any sort of forensics after-the-fact. Having a relatively short data structure that keeps them for a bit longer results in a real UX improvement.

My understanding is that @gjoseph92 is going to take a look at this and provide a tiebreaking vote

erred_tasks dict also hangs on to some references to task states, but we don't care about their (possibly large) run_specs

gjoseph92

I'm hesitant to add more state to the scheduler for this

I agree. I find adding a dedicated SchedulerState field just for a dashboard to use to be a bit much.

I also don't like holding onto TaskState objects (even with their runspecs cleared), since they have references to other tasks, and we want to get better about not keeping long-lived reference cycles around: #6250.

In principle, I also don't like keeping exceptions around for tasks that have been released. On a long-lived cluster, or just when you're iterating quickly, this could be confusing and messy. However, losing the exceptions from every compute call is probably worse UX. Ideally, maybe we'd keep them around, but at least have a clear button? That's not a pattern we support anywhere else on the dashboard though, so I think the way you have it now is the best pattern to start with.

Additionally, implementing this without some extra state is not feasible. Single-source-of-truth-ing it would require iterating through all of scheduler.tasks. I don't think there's any state that could restrict this search; task groups don't link back to the tasks they contain, so that doesn't help.

Rather than the erred_tasks queue and holding TaskStates, what if you use a scheduler plugin, like the TaskStream does?

from distributed.scheduler import TaskState
from dataclasses import dataclass


@dataclass
class ErredTaskInfo:
    key: str
    erred_on: set[str]
    exception_text: str
    traceback_text: str


class ErredTasksPlugin(SchedulerPlugin):
    name = "erred-tasks"

    def __init__(self, scheduler):
        self.errors: deque[ErredTaskInfo] = deque(
            maxlen=dask.config.get(
                "distributed.scheduler.dashboard.erred-tasks.max-history"
            )
        )
        self.scheduler = scheduler

    def transition(self, key, start, finish, *args, **kwargs):
        if finish == "erred":
            ts: TaskState = self.scheduler.tasks[key]
            self.errors.append(
                ErredTaskInfo(key, ts.erred_on, ts.exception_text, ts.traceback_text)

The downside is that errors won't be recorded until you've already opened the errors page on the dashboard, which is a bit annoying.

ian-r-rose · 2022-07-12T20:15:48Z

Rather than the erred_tasks queue and holding TaskStates, what if you use a scheduler plugin, like the TaskStream does?

I'd be okay with this. Agree that the "register on first load" behavior is potentially annoying and confusing to users, but perhaps not too bad.

gjoseph92 · 2022-07-12T20:28:15Z

I'd also be okay with SchedulerState.erred_tasks—since I bet that "open the dashboard first" UX will be really annoying—but storing something like ErredTaskInfo (instead of full TaskStates) in it instead.

ian-r-rose · 2022-07-12T20:30:58Z

I'd also be okay with SchedulerState.erred_tasks—since I bet that "open the dashboard first" UX will be really annoying—but storing something like ErredTaskInfo (instead of full TaskStates) in it instead.

Also fine with me. If the main concern is rogue references, I think this would provide the best UX.

tasks

ian-r-rose · 2022-07-12T21:05:51Z

I gave the latter suggestion a shot -- thoughts?

gjoseph92

This looks good. I'm happy with the ErredTask compromise, I think it'll be worth it. Excited to have this!

ian-r-rose · 2022-07-13T00:59:02Z

Thanks for the reviews @mrocklin and @gjoseph92 and the merge, @crusaderky!

Ian Rose added 12 commits June 27, 2022 10:33

Initial work on exceptions table

9682cf1

Pass on sizing mode

9e2689c

Reverse default order.

ab92dbe

Add "last seen" column

9bef4b1

Add start of exceptions page.

40b987a

Improve worker list

d29c855

Link to task

5816993

Separate page for exceptions

cfdc641

Keep track of erred tasks in a deque

1af258b

Remove end code formatter

d09dabc

Reorder

fae58f2

Lint

63cef84

ian-r-rose force-pushed the failed-task-dashboard branch from 991454d to 63cef84 Compare June 27, 2022 17:37

ian-r-rose marked this pull request as ready for review June 27, 2022 17:37

Add exception info to task page

b7a575d

Clear error recording upon client restart

05345f3

mrocklin reviewed Jun 27, 2022

View reviewed changes

Ian Rose added 2 commits June 27, 2022 15:48

Remove group_exceptions data structure, don't deduplicate for now.

2c7b0a8

Make max length configurable

00eb966

ian-r-rose requested a review from fjetter June 28, 2022 16:11

ian-r-rose requested a review from crusaderky July 5, 2022 15:20

mrocklin reviewed Jul 5, 2022

View reviewed changes

Del run_spec before removing it from the tasks dictionary. The

c228826

erred_tasks dict also hangs on to some references to task states, but we don't care about their (possibly large) run_specs

crusaderky requested review from gjoseph92, jacobtomlinson and mrocklin and removed request for crusaderky and fjetter July 8, 2022 15:15

gjoseph92 reviewed Jul 12, 2022

View reviewed changes

Avoid extra references to task states by creating a new type for erred

83ca016

tasks

gjoseph92 approved these changes Jul 12, 2022

View reviewed changes

crusaderky merged commit 25d140e into dask:main Jul 12, 2022

Uh oh!

Dashboard for failed tasks #6595

Dashboard for failed tasks #6595

Uh oh!

Conversation

ian-r-rose commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 17, 2022

Uh oh!

ian-r-rose commented Jun 17, 2022

Uh oh!

github-actions bot commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

mrocklin commented Jun 17, 2022

Uh oh!

mrocklin commented Jun 17, 2022

Uh oh!

ian-r-rose commented Jun 17, 2022

Uh oh!

mrocklin commented Jun 18, 2022 via email

Uh oh!

ian-r-rose commented Jun 24, 2022

Uh oh!

ian-r-rose commented Jun 27, 2022

Uh oh!

ian-r-rose commented Jun 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 5, 2022

Uh oh!

ian-r-rose commented Jul 5, 2022

Uh oh!

mrocklin commented Jul 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose commented Jul 12, 2022

Uh oh!

gjoseph92 commented Jul 12, 2022

Uh oh!

ian-r-rose commented Jul 12, 2022

Uh oh!

ian-r-rose commented Jul 12, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose commented Jun 17, 2022 •

edited

Loading

github-actions bot commented Jun 17, 2022 •

edited

Loading