Error logs in event system #5307

fjetter · 2021-09-10T17:25:15Z

This uses the event forwarding system introduced in #5217 to allow for forwarding logging information. Since tornado typically logs uncaught exceptions, this allows us for a cheap way to forward exceptions to users and/or keep them on the scheduler for dashboard visualizations.

Closes #5184

~~Builds on top of #5217~~

jrbourbeau · 2021-09-16T21:24:40Z

distributed/node.py

+        loggers_to_register = dask.config.get("distributed.admin.log-events")
+
+        for logger_name, level in loggers_to_register.items():


Since each Node subclass (e.g. Worker) calls self._setup_logging(logger) with their own logger as input

distributed/distributed/worker.py

Line 523 in 05677bb

self._setup_logging(logger)

I think we want to only set up an EventHandler for the specific logger that's input to _setup_logging instead of looping through them all each time _setup_logging is called

This is currently very awkward but the loggers passed in this methods are only distributed.scheduler and distributed.worker. Most interesting exception logs actually pop up in distributed.core and tornado.application.

The proper way to do this would be to attach the EventHandler to the logger distributed and ensure that all loggers we're using are propagating their msgs https://docs.python.org/3.8/library/logging.html#logging.Logger.propagate This is the default of the stdlib but it is user configuration and we are overwriting it for some reason for "old style" logging config

distributed/distributed/config.py

Line 101 in 5c3eacd

logger.propagate = False

Without this knowledge, the only sane way I see is to list the registered loggers explicitly. The same goes for the deque handler, btw, but I didn't touch it, yet.

…ent_system

github-actions · 2023-05-10T16:33:50Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±  0       20 suites ±0 11h 9m 51s ⏱️ + 2m 5s
  3 706 tests +  2   3 596 ✔️ -   2   106 💤 ±0 4 ❌ +4
35 846 runs +20 34 094 ✔️ +16 1 748 💤 ±0 4 ❌ +4

For more details on these failures, see this check.

Results for commit 3af381b. ± Comparison against base commit 566fd1f.

♻️ This comment has been updated with latest results.

…ent_system

asyncio.get_running_loop() is worker.loop.asyncio_loop is a more reliable way of determining if we are on the worker's event loop thread this also binds self.loop early so that log_event can be called immediately after _setup_logging()

graingert · 2023-06-29T10:00:08Z

@fjetter this is working now - can you review the changes I've added, especially around the use of asyncio.get_running_loop() is worker.loop.asyncio_loop instead of worker.thread_id == threading.get_ident()

distributed/pubsub.py

graingert · 2023-06-29T10:01:02Z

distributed/actor.py

+            asyncio_loop = asyncio.get_running_loop()
+        except RuntimeError:
+            return False
+        return self._worker.loop.asyncio_loop is asyncio_loop


this wasn't covered either that's odd

I don't think this is possible to cover, if the Actor binds self._worker then it has self._client bound

This might be worthwhile simplifying then, but that's out of scope for this PR.

distributed/node.py

…ent_system

hendrikmakait

This code generally looks good to me. I have some small nits, and I would like us to test that the configuration works as expected.

hendrikmakait · 2023-07-28T08:48:26Z

distributed/utils_test.py

    Deadline,
    DequeHandler,
+    EventHandler,
+    TimeoutError,


Why is the import of TimeoutError needed?

hendrikmakait · 2023-07-28T09:02:23Z

distributed/utils.py


+class SupportsLogEvent(Protocol):
+    def log_event(self, topic: list[str], msg: str) -> object:
+        ...


Suggested change

...

... # pragma: nocover

hendrikmakait · 2023-07-28T09:06:18Z

distributed/actor.py

+            asyncio_loop = asyncio.get_running_loop()
+        except RuntimeError:
+            return False
+        return self._worker.loop.asyncio_loop is asyncio_loop


This might be worthwhile simplifying then, but that's out of scope for this PR.

hendrikmakait · 2023-07-28T09:10:33Z

distributed/node.py

        logger.addHandler(self._deque_handler)
        weakref.finalize(self, logger.removeHandler, self._deque_handler)

+        loggers_to_register = dask.config.get("distributed.admin.log-events")


I think it would be good to have another test that checks that the configuration works as expected (e.g., changing the configuration will result in some log being propagated or not).

fjetter added 2 commits September 13, 2021 11:54

Feed error logs into event system

2a2e61e

Allow configuration of log event levels by logger name

c86e0af

fjetter force-pushed the error_logs_event_system branch from 382d468 to c86e0af Compare September 13, 2021 09:54

fjetter added 2 commits September 13, 2021 16:25

Cleanup eventhandler after test iteration

b87edc5

Fix test_schema_is_complete

fdf4488

jrbourbeau reviewed Sep 16, 2021

View reviewed changes

Merge branch 'main' of github.com:dask/distributed into error_logs_ev…

7f6714f

…ent_system

graingert force-pushed the error_logs_event_system branch from e8b7047 to 7f6714f Compare May 10, 2023 14:26

add type hints to distributed.utils.ServerNode

24397c6

graingert added 3 commits May 11, 2023 13:44

Merge branch 'main' of github.com:dask/distributed into error_logs_ev…

fd7d6fc

…ent_system

Merge branch 'main' of github.com:dask/distributed into error_logs_ev…

99ce610

…ent_system

graingert force-pushed the error_logs_event_system branch from 37e4584 to 03b0a5d Compare June 28, 2023 14:25

graingert reviewed Jun 29, 2023

View reviewed changes

distributed/pubsub.py Show resolved Hide resolved

graingert reviewed Jun 29, 2023

View reviewed changes

test calling await distributed.Sub(x).get() on a worker

c438030

graingert reviewed Jun 29, 2023

View reviewed changes

distributed/node.py Outdated Show resolved Hide resolved

graingert added 2 commits June 29, 2023 11:59

Update distributed/node.py

dc14516

Merge branch 'main' of github.com:dask/distributed into error_logs_ev…

3af381b

…ent_system

graingert requested a review from jrbourbeau July 26, 2023 10:44

hendrikmakait self-requested a review July 28, 2023 08:37

hendrikmakait reviewed Jul 28, 2023

View reviewed changes

fjetter closed this Jul 10, 2025

		loggers_to_register = dask.config.get("distributed.admin.log-events")

		for logger_name, level in loggers_to_register.items():

Uh oh!

Error logs in event system #5307

Error logs in event system #5307

Uh oh!

Conversation

fjetter commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

graingert commented Jun 29, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fjetter commented Sep 10, 2021 •

edited

Loading

github-actions bot commented May 10, 2023 •

edited

Loading