Skip to content

Conversation

@pentschev
Copy link
Member

Some time ago test_ucx_config_w_env_var started failing intermittently, and the causes were still unknown. After some investigation it seems in certain cases exchanging UCX-Py peer information causes some of the underlying communication calls to never complete and thus cause a hang that can't be recovered from by Distributed. With rapidsai/ucx-py#994, UCX-Py now has a timeout on those calls that allow Distributed to catch and retry establishing the connection, which seems to resolve the problem.

Closes #5229

  • Tests added / passed
  • Passes pre-commit run --all-files

Some time ago `test_ucx_config_w_env_var` started failing
intermittently, and the causes were still unknown. After some
investigation it seems in certain cases exchanging UCX-Py peer
information causes some of the underlying communication calls to never
complete and thus cause a hang that can't be recovered from by
Distributed. With rapidsai/ucx-py#994, UCX-Py
now has a timeout on those calls that allow Distributed to catch and
retry establishing the connection, which seems to resolve the problem.
@pentschev pentschev requested a review from fjetter as a code owner October 13, 2023 12:13
@pentschev
Copy link
Member Author

rerun tests

3 similar comments
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

I think this is now resolved for good, as you can see this is not xfailing nor marking the test as flaky and triggering reruns and after having gpuCI run 5 times in total, no failures have occurred. If there are no objections, this is probably good to merge from the gpuCI side.

cc @jrbourbeau @crusaderky @quasiben @charlesbluca

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       25 files  ±0         25 suites  ±0   14h 39m 11s ⏱️ - 21m 58s
  3 860 tests ±0    3 738 ✔️ +2     117 💤 ±0    5  - 2 
44 949 runs  ±0  42 801 ✔️ +1  2 121 💤 ±0  27  - 1 

For more details on these failures, see this check.

Results for commit 5ce2c91. ± Comparison against base commit 5cedc47.

@quasiben
Copy link
Member

quasiben commented Nov 9, 2023

planning to merge this afternoon if there is no further feedback

@quasiben
Copy link
Member

quasiben commented Nov 9, 2023

Merging in

@quasiben quasiben merged commit e98dcb1 into dask:main Nov 9, 2023
@pentschev pentschev deleted the reenable-test_ucx_config_w_env_var branch November 13, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var flaky

3 participants