Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Mar 25, 2022

This is another problem that I encountered in #5910

Depending on the timing of whether or not a connection attempt is tried while the server is shutting down, this can cause failures, e.g. by getting stuck because the CancelledError is not properly propagated


The behaviour this PR suggests is to cancel all connection attempts once the CommPool is closed. Ideally, all tasks would already be stopped once this happens but this is by no means guaranteed to happen.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 25, 2022

Unit Test Results

       18 files  +       7         18 suites  +7   9h 29m 0s ⏱️ + 3h 48m 4s
  2 702 tests +     27    2 618 ✔️ +     27       81 💤  -     2  3 +2 
24 154 runs  +9 663  22 863 ✔️ +9 147  1 288 💤 +514  3 +2 

For more details on these failures, see this check.

Results for commit 88ca221. ± Comparison against base commit ed48736.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member

It looks like there is one genuine test failure in distributed/tests/test_client.py::test_reconnect_timeout

) as logger:
await s.close()
while c.status != "closed":
await c._update_scheduler_info()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these should ensure that while the scheduler is closing this particular call is silent. I don't think this is how we should test it.
These tests failed because the update_scheduler_info is basically a RPC which is cancelled since the client is closing. This cancellation will then propagate to the test itself and let it fail as what appears to be a TimeoutError

@fjetter
Copy link
Member Author

fjetter commented Mar 29, 2022

cc @graingert I would appreciate a review

@fjetter fjetter requested a review from graingert March 29, 2022 12:34
)

fut = asyncio.create_task(
task = asyncio.create_task(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you tracking the pending and connecting tasks separately?

can it be simplified to this?

async def _connect():
    async with self.semaphore:
        if self.status != Status.running:
            raise CommClosedError(
                f"ConnectionPool not running. Status: {self.status}"
            )
        comm = await connect(
                addr,
                timeout=timeout or self.timeout,
                deserialize=self.deserialize,
                **self.connection_args,
        )

        comm.name = "ConnectionPool"
        comm._pool = weakref.ref(self)
        comm.allow_offload = self.allow_offload
        self._created.add(comm)
        occupied.add(comm)

        return comm

    task = asyncio.create_task(connect)
    task._connecting.add(task)
    task.add_done_callback(self._connecting.discard)
    return await task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only limit the concurrency of simultaneous connections. The implementation tries to ensure that the number of connection attempts + the number of open connections is bound
See #3005

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge the two sets, though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see so more like this?

async def _connect():
    await self.semaphore.acquire()
    try:
        if self.status != Status.running:
            raise CommClosedError(
                f"ConnectionPool not running. Status: {self.status}"
            )
        comm = await connect(
                addr,
                timeout=timeout or self.timeout,
                deserialize=self.deserialize,
                **self.connection_args,
        )

        comm.name = "ConnectionPool"
        comm._pool = weakref.ref(self)
        comm.allow_offload = self.allow_offload
        self._created.add(comm)
        occupied.add(comm)
    except BaseException:
        self.semaphore.release()

    task = asyncio.create_task(connect)
    task._connecting.add(task)
    task.add_done_callback(self._connecting.discard)
    return await task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was wrong and we need to split these two. There was a bug in there before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what should happen if there are pending tasks (i.e. they are waiting to acquire the semaphore) and the pool is closing. We should ensure that the behaviour for an outside caller is the same regardless of whether they are pending or connecting.
Previously, the behaviour was to raise a CommClosed. Now we are cancelling the connect. What would you expect to happen in this situation?

If we want to go for a CommClosedError or similar like before, that's probably possible but will require a bit more work

@fjetter fjetter force-pushed the do_not_catch_cancellederrro_commpool branch from 0a5cf4e to 41b3f67 Compare March 29, 2022 15:23
fjetter and others added 2 commits March 29, 2022 17:41
@fjetter fjetter self-assigned this Mar 29, 2022
@fjetter
Copy link
Member Author

fjetter commented Mar 30, 2022

I had a chat with @graingert and we believe the previous behaviour of raising a CommClosed error in case of a pool shutdown is the preferred behaviour. We'll need to make sure that it does not intercept a CancelledError, though.

self._ongoing_coroutines.add(result)
result = await result
except (CommClosedError, asyncio.CancelledError):
except CommClosedError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what's supposed to happen here tbh

@fjetter fjetter requested a review from graingert March 30, 2022 10:27
@fjetter fjetter merged commit a8a9a3f into dask:main Apr 1, 2022
@fjetter fjetter deleted the do_not_catch_cancellederrro_commpool branch April 1, 2022 09:39
fjetter added a commit to fjetter/distributed that referenced this pull request Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants