-
-
Notifications
You must be signed in to change notification settings - Fork 748
Ensure default client mechanism is threadsafe #5901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test Results 15 files + 2 15 suites +2 8h 0m 18s ⏱️ + 1h 34m 45s For more details on these failures, see this check. Results for commit 7fd04cf. ± Comparison against base commit a8a9a3f. ♻️ This comment has been updated with latest results. |
23e5546 to
b76dfb8
Compare
|
Possibly related failure in |
|
Turns out The problem is that we're still hitting #2058 because #2066 was only a partial fix. We're not resolving addresses everywhere and some internal checks are comparing |
distributed/client.py
Outdated
| ReplayTaskClient(self) | ||
|
|
||
| def _set_default_configs(self): | ||
| self._set_config_stack.push( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dask.config.set.__enter__ is a noop so push and enter_context do the same thing - but I think it's clearer to use enter_context anyway
| self._set_config_stack.push( | |
| self._set_config_stack.enter_context( |
distributed/client.py
Outdated
|
|
||
| _default_event_handlers = {"print": _handle_print, "warn": _handle_warn} | ||
|
|
||
| _set_config_stack = contextlib.ExitStack() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a ClassVar and so there's a global ExitStack at distributed.client.Client._set_config_stack and ExitStack isn't threadsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something like this?
class _ConfigureDask:
def __init__(self):
self._rlock = threading.RLock()
self._count = 0
def __enter__(self):
with self._rlock:
self._count += 1
if self._count == 1:
self._set = dask.config.set(scheduler="dask.distributed", shuffle="tasks")
def __exit__(self, *exc_info):
with self._rlock:
self._count -= 1
if self._count == 0:
self._set.__exit__(None, None, None)
del self._set
_configure_dask = _ConfigureDask()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe something like this:
class _GlobalClientManager:
def __init__(self):
self._lock = threading.RLock()
self._global_clients: weakref.WeakValueDictionary[
int, Client
] = weakref.WeakValueDictionary()
self._global_client_index = 0
self._set = None
def _get_global_client(self) -> Client | None:
with self._lock:
L = sorted(self._global_clients, reverse=True)
for k in L:
c = self._global_clients[k]
if c.status != "closed":
return c
else:
del self._global_clients[k]
return None
def _set_global_client(self, c: Client) -> None:
with self._lock:
if not self._set:
self._set = dask.config.set(
scheduler="dask.distributed", shuffle="tasks"
)
self._global_clients[self._global_client_index] = c
self._global_client_index += 1
def _del_global_client(self, c: Client) -> None:
with self._lock:
for k in list(self._global_clients):
try:
if _global_clients[k] is c:
del _global_clients[k]
except KeyError: # pragma: no cover
pass
if not self._global_clients:
self._set.__exit__(None, None, None)
self._set = None
_global_client_manager = _GlobalClientManager()
_set_global_client = _global_client_manager._set_global_client
_get_global_client = _global_client_manager._get_global_client
_del_global_client = _global_client_manager._del_global_clientand it would fix #5772 also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what our appetite for metaclasses in the code base is, but I've personally liked using Multiton's for heavyweight resource objects which should be unique per-process for the arguments used to create it.
import weakref
from threading import Lock
class ClientMultiton(type):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.__cache = weakref.WeakValueDictionary()
self.__lock = Lock()
def __call__(cls, *args, **kw):
key = args + (frozenset(kw.items()),) if kw else args
with cls.__lock:
try:
# Obtain existing instance for this key
instance = cls.__cache[key]
except KeyError:
pass
else:
# Return instance if it's not closed
if not instance.closed:
return instance
# Delete the closed cache entry
# and proceed with recreating the instance
# for this key
del cls.__cache[key]
cls.__cache[key] = instance = type.__call__(cls, *args, **kw)
return instance
class Client(metaclass=ClientMultiton):
def __init__(self, address=None, timeout=100):
self.address = address
self.timeout = timeout
self._closed = False
self._lock = Lock()
@property
def closed(self):
# Dangerous if self._closed can transition from False to True
with self._lock:
return self._closed
def close(self):
with self._lock:
if self._closed:
return
print(f"Closing {self.address}")
self._closed = True
def __del__(self):
self.close()
if __name__ == "__main__":
def inner():
c1 = Client(address="tcp://172.0.0.1")
assert Client(address="tcp://172.0.0.1") is c1
assert Client(address="tcp://172.0.0.1") is Client(address="tcp://172.0.0.1")
addresses = list(sorted(c.address for c in Client._ClientMultiton__cache.values()))
assert addresses == ["tcp://172.0.0.1"]
c2 = Client(address="tcp://172.0.0.2")
assert c2 is not c1
addresses = list(sorted(c.address for c in Client._ClientMultiton__cache.values()))
assert addresses == ["tcp://172.0.0.1", "tcp://172.0.0.2"]
c1.close()
addresses = list(sorted(c.address for c in Client._ClientMultiton__cache.values()))
assert addresses == ["tcp://172.0.0.1", "tcp://172.0.0.2"]
c3 = Client(address="tcp://172.0.0.1")
assert c3 is not c1
assert not c3.closed
addresses = list(sorted(c.address for c in Client._ClientMultiton__cache.values()))
assert addresses == ["tcp://172.0.0.1", "tcp://172.0.0.2"]
inner()
print("Done")
assert len(Client._ClientMultiton__cache) == 0One of the nice things about the above pattern is that one can defer to the garbage collector for cleanup by simply removing any references to the object (in a Future for instance?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what our appetite for metaclasses in the code base
Generally speaking I'm not a huge fan of metaclasses. Mostly because they are used very rarely and most people are having a hard time understanding it.
However, your suggestion looks pretty appealing and I'm intrigued. I'll look into this and would consider using the metaclass if it helps us to reduce code complexity in other areas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could help us cleaning up a bit of code around get_client and particularly in Worker._get_client it would come in handy to rely on this multiton.
The multiton pattern itself does not help us deal with the default client mechanism. under the assumption that there are multiple clients allowed the complexity around default clients would merely shift from this manager class to the metaclass.
We can't remove all of the default clients logic if we allow connection to multiple schedulers or even allow multiple distinct clients to connect to the same scheduler. The later is explicitly tested in a few tests.
Just having tests around is not necessarily an indication that this is a useful feature to have. For now, I don't see how this pattern would help us a lot without changing behaviour. I was hoping to stick with existing behaviour for now.
We should keep this in mind. I think having this would help in some cases and would allow us to write a cleaner handlign of clients overall. I expect this to be a bigger effort than what I feel comfortable doing in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this in mind. I think having this would help in some cases and would allow us to write a cleaner handlign of clients overall. I expect this to be a bigger effort than what I feel comfortable doing in this PR
There are also examples of a lazily initialised Multiton (with support for a finaliser) under a BSD3 license in the following locations
https://github.com/ratt-ru/dask-ms/blob/master/daskms/patterns.py
https://github.com/ratt-ru/dask-ms/blob/master/daskms/tests/test_patterns.py
| async def _start(self, timeout=no_default, **kwargs): | ||
| self.status = "connecting" | ||
| if self._set_as_default: | ||
| _set_global_client(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving _set_global_client into async def _start is worse - as this is likely to be called via Client(asynchronous=False) and so run in an off-main-thread eventloop thread
52f2519 to
30aad83
Compare
| ], | ||
| ) | ||
| @gen_cluster() | ||
| async def test_submit_different_names(s, a, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite it's being marked flaky recently, this test has been solid before the changes proposed in this PR. This was achieved rather coincidentally since Clients were already added to the global default_client dict after initialization, before they were actually started.
Running this test on main yields an exception pointing this out.
> assert c_inner.status == "running"
E AssertionError: assert 'newly-created' == 'running'
E - running
E + newly-created
What's happening on main is that
- As soon as the futures are deserialized on the worker, a
get_client("tcp://localhost:XXX")is called inFuture.__setstate__ - localhost is resolved as part of
get_client - since now the input address matches with the worker address, we call
Worker._get_client Worker._get_clienthowever does not match the default client (the one we created manually in the test)distributed/distributed/worker.py
Lines 4296 to 4306 in 30f0b60
client.scheduler and client.scheduler.address == self.scheduler.address # The below conditions should only happen in case a second # cluster is alive, e.g. if a submitted task spawned its onwn # LocalCluster, see gh4565 or ( isinstance(client._start_arg, str) and client._start_arg == self.scheduler.address or isinstance(client._start_arg, Cluster) and client._start_arg.scheduler_address == self.scheduler.address ) - It will create a new client using the Worker RPC which always uses the resolved address
distributed/distributed/worker.py
Lines 4314 to 4324 in 30f0b60
self._client = Client( self.scheduler, loop=self.loop, security=self.security, set_as_default=True, asynchronous=asynchronous, direct_to_workers=True, name="worker", timeout=timeout, ) Worker._initialized_clients.add(self._client) - This new client will be set to
Worker._clientand is the new
default_client.maindoes only filter global clients for!= closedand
notrunning - Since the new one is the default client, the compute call will get the same
one and it will always match addresses since it was created using the worker
scheduler RPC - This Client is initialized using
asynchronous=Truebut is not awaited. Therefore, it is not actually up, yet, which is why we receive the abovenewly-createdstatus
Diff to this branch:
- This branch requries a default client to be actually running. This leads us to
initialize potentially many clients causing this confusion
An ideal world would...
- Detect the default client as the manually created one using localhost
- Recognize that the localhost client is talking to the same scheduler
- Not initialize any further Clients
|
I only barely skimmed this, but just wondering: how does this relate to #5485 / #5467 (comment)? Basically, does this also handle asyncio correctly? Or is this is addressing an orthogonal problem? I guess I would have expected to see some contextvars in here, but I'm probably misunderstanding the problem this is solving. |
It relates but I believe is a different problem. The original problem I intended to solve is not related to threading at all, actually. Tom just pointed out that there are also race conditions in a threading context which is why I introduced the global client manager with a lock. I think the asyncio detection works reliably since we introduced the SyncMethodMixin with distributed/distributed/utils.py Lines 316 to 325 in 2d68dfc
|
d16d28b to
7e19f2a
Compare
7e19f2a to
1d8a357
Compare
|
I believe test failures are unrelated. @graingert any further comments before I merge? |
graingert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still some left-over _set_config_stack
a2f5d9d to
662b5c6
Compare
7fd04cf to
103db32
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 18 files ± 0 18 suites ±0 8h 14m 45s ⏱️ - 10m 9s For more details on these failures, see this check. Results for commit caf75f5. ± Comparison against base commit 19deee3. |
|
I see many failures in |
|
Indeed, test_package_install_restarts_on_nanny is related. It is starting a worker client and this client blocks something during shutdown |
I encountered a problem that the default client and respective config settings were not properly reset when multiple clients were used and they were not closed in the correct order.
For example
test_cancel_multi_clientcloses the clients in the same order they were started but that would leave a global config set, e.g. scheduler="dask.distributed"This led to leakage of state in test such that some tests would behave differently depending on what test was executed
I noticed this happening in #5791 but have no idea why we're not seeing this on main. I added a pytest autouse fixture to ensure no tests are leaking global clients.
Closes #5772
Note: There are more issues about thread safe get_clients that are not addressed by this: #3827, #5467
Shortcoming / out of scope
If a user defines multiple default clients, modifies any of the keys
schedulerorshufflethemselves, closing the last global/default client will overwrite the manual user setting to what it was before any Client was initialized. I'm fine with this since this should not happen