-
-
Notifications
You must be signed in to change notification settings - Fork 748
Error logs in event system #5307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
382d468 to
c86e0af
Compare
| loggers_to_register = dask.config.get("distributed.admin.log-events") | ||
|
|
||
| for logger_name, level in loggers_to_register.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each Node subclass (e.g. Worker) calls self._setup_logging(logger) with their own logger as input
distributed/distributed/worker.py
Line 523 in 05677bb
| self._setup_logging(logger) |
I think we want to only set up an EventHandler for the specific logger that's input to _setup_logging instead of looping through them all each time _setup_logging is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently very awkward but the loggers passed in this methods are only distributed.scheduler and distributed.worker. Most interesting exception logs actually pop up in distributed.core and tornado.application.
The proper way to do this would be to attach the EventHandler to the logger distributed and ensure that all loggers we're using are propagating their msgs https://docs.python.org/3.8/library/logging.html#logging.Logger.propagate This is the default of the stdlib but it is user configuration and we are overwriting it for some reason for "old style" logging config
distributed/distributed/config.py
Line 101 in 5c3eacd
| logger.propagate = False |
Without this knowledge, the only sane way I see is to list the registered loggers explicitly. The same goes for the deque handler, btw, but I didn't touch it, yet.
e8b7047 to
7f6714f
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 20 files ± 0 20 suites ±0 11h 9m 51s ⏱️ + 2m 5s For more details on these failures, see this check. Results for commit 3af381b. ± Comparison against base commit 566fd1f. ♻️ This comment has been updated with latest results. |
asyncio.get_running_loop() is worker.loop.asyncio_loop is a more reliable way of determining if we are on the worker's event loop thread this also binds self.loop early so that log_event can be called immediately after _setup_logging()
37e4584 to
03b0a5d
Compare
|
@fjetter this is working now - can you review the changes I've added, especially around the use of |
| asyncio_loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| return False | ||
| return self._worker.loop.asyncio_loop is asyncio_loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wasn't covered either that's odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is possible to cover, if the Actor binds self._worker then it has self._client bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be worthwhile simplifying then, but that's out of scope for this PR.
hendrikmakait
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code generally looks good to me. I have some small nits, and I would like us to test that the configuration works as expected.
| Deadline, | ||
| DequeHandler, | ||
| EventHandler, | ||
| TimeoutError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the import of TimeoutError needed?
|
|
||
| class SupportsLogEvent(Protocol): | ||
| def log_event(self, topic: list[str], msg: str) -> object: | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ... | |
| ... # pragma: nocover |
| asyncio_loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| return False | ||
| return self._worker.loop.asyncio_loop is asyncio_loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be worthwhile simplifying then, but that's out of scope for this PR.
| logger.addHandler(self._deque_handler) | ||
| weakref.finalize(self, logger.removeHandler, self._deque_handler) | ||
|
|
||
| loggers_to_register = dask.config.get("distributed.admin.log-events") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have another test that checks that the configuration works as expected (e.g., changing the configuration will result in some log being propagated or not).
This uses the event forwarding system introduced in #5217 to allow for forwarding logging information. Since tornado typically logs uncaught exceptions, this allows us for a cheap way to forward exceptions to users and/or keep them on the scheduler for dashboard visualizations.
Closes #5184
Builds on top of #5217