Fix flaky tests in CI #5168

crusaderky · 2021-08-04T09:37:02Z

Attempt to fix some of the recent CI flakiness

crusaderky · 2021-08-04T10:12:00Z

distributed/diagnostics/tests/test_graph_layout.py

+        updates = {state for _, state in gl.state_updates}
+        if updates == {"waiting", "processing", "memory", "released"}:
+            break
+        await asyncio.sleep(0.01)


test was randomly failing as gl.state_updates was tested before the "released" state appeared

crusaderky · 2021-08-04T10:13:04Z

distributed/tests/test_client_executor.py

 def test_cancellation_wait(client):
    with client.get_executor(pure=False) as e:
-        fs = [e.submit(slowinc, i, delay=0.1) for i in range(10)]
+        fs = [e.submit(slowinc, i, delay=0.2) for i in range(10)]


test was randomly failing as the cancellation occasionally took more than 0.1s to be enacted and thus one of the tasks had enough time to finish

crusaderky · 2021-08-04T10:13:40Z

distributed/tests/test_client_executor.py

-            t1 = time.time()
-            e.shutdown(wait=False)
-            dt = time.time() - t1
-            assert dt < 0.5


test was randomly failing here as the shutdown took ~0.51s

crusaderky · 2021-08-04T10:14:05Z

distributed/tests/test_diskutils.py


 @pytest.mark.slow
 @pytest.mark.xfail(condition=MACOS, reason="extremely flaky")
+@pytest.mark.flaky(condition=not MACOS, reruns=10, reruns_delay=5)


recently found to have become highly unstable on Linux too

Locally these tests take about 5 and 8 seconds for me. On CI this is likely not as fast and rerunning them for 10 times might increase runtime of our test suite by multiple minutes for a test it seems we no longer have strong confidence in. Do we still want them to be part of our test suite if they are this unreliable, they eat up significant runtime and nobody knows what's wrong?

I never managed to reproduce this but inspecting the test setup, they spawn 8 and 16 processes and GH Actions runners have 2 CPUs each (3 on OSX). I don't know how the proc.start() actually behaves. Is it blocking until the process is up and running or does it release during the process is still in a startup phase? Is this behaviour different on different platforms? If the process startup time is a problem here, we could try a barrier to see if this helps. The failing tests often show n_purged to be zero and this is the only thing I can think of right now which would explain this

diff --git a/distributed/tests/test_diskutils.py b/distributed/tests/test_diskutils.py index 077a37d4..cb9d26b5 100644 --- a/distributed/tests/test_diskutils.py +++ b/distributed/tests/test_diskutils.py @@ -190,7 +190,8 @@ def test_locking_disabled(tmpdir): lock_file.assert_not_called() -def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt): +def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt, barrier): + barrier.wait() ws = WorkSpace(base_dir) n_purged = 0 with captured_logger("distributed.diskutils", "ERROR") as sio: @@ -229,15 +230,17 @@ def _test_workspace_concurrency(tmpdir, timeout, max_procs): # Run a bunch of child processes that will try to purge concurrently NPROCS = 2 if sys.platform == "win32" else max_procs + barrier = mp_context.Barrier(parties=NPROCS + 1) processes = [ mp_context.Process( - target=_workspace_concurrency, args=(base_dir, purged_q, err_q, stop_evt) + target=_workspace_concurrency, + args=(base_dir, purged_q, err_q, stop_evt, barrier), ) for i in range(NPROCS) ] for p in processes: p.start() - + barrier.wait() n_created = 0 n_purged = 0 try:

thanks, I'm giving it a try

it never failed so far! fingers crossed...

fjetter · 2021-08-04T14:38:18Z

@crusaderky ping once you're done. There are still failing tests but this looks already good.

distributed/comm/tests/test_ucx.py

crusaderky · 2021-08-05T09:17:29Z

I can't see any of the changed tests failing anymore. This is ready for merge.

fjetter

Looks great! Thanks for taking the time to look into this

crusaderky added 2 commits August 4, 2021 11:36

Fix flaky tests in CI

e385e4f

fix race condition in test

9038705

crusaderky commented Aug 4, 2021

View reviewed changes

crusaderky added 3 commits August 4, 2021 12:48

xfail test_workspace_concurrency_intense

54bcc0c

Attempt fixing test_workspace_concurrency

e7bf4f6

lint

6f591b4

crusaderky closed this Aug 4, 2021

crusaderky reopened this Aug 4, 2021

crusaderky added 2 commits August 4, 2021 17:19

avoid global pytest_timeout

f6af11e

Merge branch 'main' into ci

c516a7d

jrbourbeau reviewed Aug 4, 2021

View reviewed changes

distributed/comm/tests/test_ucx.py Outdated Show resolved Hide resolved

crusaderky added 2 commits August 5, 2021 09:44

revert

cbd90be

Merge branch 'main' into ci

4b73850

fjetter approved these changes Aug 5, 2021

View reviewed changes

fjetter merged commit 3d73623 into dask:main Aug 5, 2021

crusaderky deleted the ci branch August 5, 2021 09:30

crusaderky mentioned this pull request Jan 24, 2022

Fix flaky test_workspace_concurrency #5690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky tests in CI #5168

Fix flaky tests in CI #5168

Uh oh!

crusaderky commented Aug 4, 2021

Uh oh!

crusaderky Aug 4, 2021 •

edited

Loading

Uh oh!

crusaderky Aug 4, 2021

Uh oh!

crusaderky Aug 4, 2021

Uh oh!

crusaderky Aug 4, 2021

Uh oh!

fjetter Aug 4, 2021

Uh oh!

crusaderky Aug 4, 2021

Uh oh!

crusaderky Aug 5, 2021

Uh oh!

fjetter commented Aug 4, 2021

Uh oh!

Uh oh!

crusaderky commented Aug 5, 2021

Uh oh!

fjetter left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix flaky tests in CI #5168

Fix flaky tests in CI #5168

Uh oh!

Conversation

crusaderky commented Aug 4, 2021

Uh oh!

crusaderky Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

fjetter Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

fjetter commented Aug 4, 2021

Uh oh!

Uh oh!

crusaderky commented Aug 5, 2021

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crusaderky Aug 4, 2021 •

edited

Loading