Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Nov 9, 2020

Disclaimer: I don't have data to back this up, only a gut feeling.

We're still seeing connection timeouts (#4080), even after the removal of the fixed handshake timeouts in #4176. There we already discussed the increase of the default timeout since the connect no longer just needs to accommodate for the actual TCP (or whatever) connect but also for the handshake.

What's made this worse is that due to DNS instabilities it was requested to split up to favour multiple small attempts instead of one large attempt (#3104)

Putting everything together, I believe the initial default of 10s should be increased to counteract the added complexity of this operation. I don't have a good feeling about an appropriate value here, in #4176 there was a suggestion to increase it to 30s but I could also see higher values to make sense.

cc @jcrist

@jcrist
Copy link
Member

jcrist commented Nov 9, 2020

Increasing the default to 30s makes sense to me. Would like to get a 👍 from another maintainer before merging though.

@fjetter
Copy link
Member Author

fjetter commented Nov 9, 2020

I just checked, we've been running with a 60s default in our infrastructure at BY for about a year now. I think we set it up this way back then since we were debugging some weird dead comm failures and we never changed it back.

Anyhow,... just wanted to put a real world data point in here. timeouts are sometimes a sensitive, infra depending issue and 60s may be good for us but too much for other. Either way, I believe 10s is too small :)

@quasiben
Copy link
Member

quasiben commented Nov 9, 2020

On large benchmark runs we have set the timeout somewhat arbitrarily to 100s. Increasing to 30 or even 60 would be fine with me.

@iyawnis
Copy link

iyawnis commented Nov 12, 2020

I am not sure if my issue is related to this, I am seeing the following error about 2 -3 seconds after attempting to connect:
OSError: Timed out during handshake while connecting to tcp://scheduler:7777 after 10 s. Passing a timeout kwarg to Client will change the error string, but error still comes in 2-3 seconds after client instantiation. Distributred 2.30.1 dask 2.30

@iyawnis
Copy link

iyawnis commented Nov 12, 2020

Just confirmed that I can connect if I change distribured version to 2.30.0 just for Client (without changing the scheduler).

@fjetter
Copy link
Member Author

fjetter commented Nov 18, 2020

@latusaki I think your report is unconnected to the actual value of the timeout which is discussed in here. Can you open another ticket for this? At the very least the exception message might be misleading since it's probably something other than the timeout then. In particular the entire traceback would be interesting for this since the exception cause should usually also be logged which should reveal the actual exception

Base automatically changed from master to main March 8, 2021 19:04
@fjetter fjetter mentioned this pull request Jul 13, 2021
@crusaderky
Copy link
Collaborator

Superseded by #5022

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jul 13, 2021
@fjetter
Copy link
Member Author

fjetter commented Jul 14, 2021

We bumped to 30s in #5022. Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants