-
-
Notifications
You must be signed in to change notification settings - Fork 748
Ensure exceptions are raised if workers are incorrectly started #4733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fjetter -- this would close #4640 (cc @gjoseph92)
distributed/core.py
Outdated
| if kwargs: | ||
| logger.warning("Provided unused arguments %s", list(kwargs.keys())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think the call to super().__init__ was intentional -- see https://github.com/dask/distributed/pull/4365/files/2212edc11b424c17f2f656eb73b2ca206742705a#r557630488 for an earlier discussion on this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple inheritance is a wonderful monstrosity :D
In [1]: class A:
...: def __init__(self, a, **kwargs):
...: print(f"A: {kwargs}")
...:
In [2]: class B(A):
...: def __init__(self, b, **kwargs):
...: print(f"B: {kwargs}")
...: super().__init__(**kwargs)
...:
In [3]: class C(A):
...: def __init__(self, c, **kwargs):
...: print(f"C: {kwargs}")
...: super().__init__(**kwargs)
...:
In [4]: class D(B, C):
...: def __init__(self, d, **kwargs):
...: print(f"D: {kwargs}")
...: super().__init__(**kwargs)
...:
In [5]: D.mro()
Out[5]: [__main__.D, __main__.B, __main__.C, __main__.A, object]
In [6]: D(a="a", b="b", c="c", d="d")
D: {'a': 'a', 'b': 'b', 'c': 'c'}
B: {'a': 'a', 'c': 'c'}
C: {'a': 'a'}
A: {}kwargs will trickle down the mro. therefore we need to allow kwargs everwhere. This is true for all classes but the root. the root cannot call super with kwargs since object doesn't allow arguments. afaik, the init on object is a no-op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense. IIUC the call to super was added here to make us more resilient to the inheritance order. For example, today Scheduler has the following class inheritance structure
distributed/distributed/scheduler.py
Line 3169 in 42e3f22
| class Scheduler(SchedulerState, ServerNode): |
So Server is at the bottom of the mro. Though I don't think we're strictly enforcing this anywhere. Perhaps we should though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider adding a test like this (untested):
def test_server_inheritance():
subclasses = Server.__subclasses__()
for subcls in subclasses:
assert subcls.__mro__[-2:] == [Server, object], f"Server must always be the last class in the MRO. {subcls} has MRO: {subcls.__mro__}"Obviously this would depend on all Server subclasses being imported at the time the test runs.
Alternatively, if we wanted to over-engineer it a bit, we could use __init_subclass__ (added in 3.6) to enforce this. I've done something like this before here. Not sure how it would play with Cython though?
gjoseph92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @fjetter. I agree with you that removing super().__init__ is fine, since Server is always at the root of the MRO—so long as we don't significantly change the inheritance structure in the future.
|
If we're removing the |
Sure. I was assuming we had this to allow for easier forward compatible code but I would prefer to be strict as well. |
|
I removed the kwargs entirely. I was about to delete the tests but discovered this interesting edge case we didn't properly cover. If the local cluster is provided with foo, it is passed to the worker. If one uses a nanny this is entirely swallowed and the nanny simply doesn't come up. I'm not even sure if an exception was logged... |
694ee79 to
8d3e028
Compare
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jakirkham for visibility
|
One thing which came up in two comments is to add test about whether However, if people feel this is useful, I don't mind adding a test, that should be easy enough |
|
There are some related CI failures on OSX I'll need to check out :( (They are not Cython related) |
|
The failing test was an annoying one. There is a race condition between the process.on_exit / mark_stopped and the init_q exception forwarding/reraising. If the process is cleaned up too fast, the status of the nanny is set to closed and queue is cleaned up faster than it is able to read the message and raise the exception. That was also the cause for the other flaky test. I don't like my solution but it is a pragmatic one. Everything else I can come up with would either require more sync primitives or some bigger refactoring. If somebody smarter has a good idea, I'm all ears. |
5351389 to
50f3bd1
Compare
| ): # pragma: no cover | ||
| os.environ.update(env) | ||
| dask.config.set(config) | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This big diff is mostly wrapping the function in a try/except
| await worker.close( | ||
| report=True, | ||
| nanny=False, | ||
| safe=True, # TODO: Graceful or not? | ||
| executor_wait=executor_wait, | ||
| timeout=timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new
| # If we hit an exception here we need to wait for a least | ||
| # one interval for the outside to pick up this message. | ||
| # Otherwise we arrive in a race condition where the process | ||
| # cleanup wipes the queue before the exception can be | ||
| # properly handled. See also | ||
| # WorkerProcess._wait_until_connected (the 2 is for good | ||
| # measure) | ||
| sync_sleep(cls._init_msg_interval * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves the race condition in test_worker_start_exception
| init_result_q.close() | ||
| # If we hit an exception here we need to wait for a least one | ||
| # interval for the outside to pick up this message. Otherwise we | ||
| # arrive in a race condition where the process cleanup wipes the | ||
| # queue before the exception can be properly handled. See also | ||
| # WorkerProcess._wait_until_connected (the 2 is for good measure) | ||
| sync_sleep(cls._init_msg_interval * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves the race condition in test_failure_during_worker_initialization
|
This PR escalated pretty drastically, considering I simply wanted to cleanup the signature of a class 😅 I therefore changed the title to the more significant change |
50f3bd1 to
31fdfbd
Compare
|
If there are no further objections, I'd like to merge this. Friendly ping @jakirkham |
|
I haven't looked at the Nanny or Worker code, but the constructor change seems ok to me. Updated that thread and marked resolved |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fjetter! This is in
Serverinherits fromobjectwhich has no sensible initialiser. In particular, when provided with kwargs, it raises an obscureTypeErrorTypeError: object.__init__() takes exactly one argument (the instance to initialize)Instead of raising, this would issue a warning indicating that some arguments were not used