Skip to content

Conversation

@crusaderky
Copy link
Collaborator

Attempt to fix some of the recent CI flakiness

updates = {state for _, state in gl.state_updates}
if updates == {"waiting", "processing", "memory", "released"}:
break
await asyncio.sleep(0.01)
Copy link
Collaborator Author

@crusaderky crusaderky Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test was randomly failing as gl.state_updates was tested before the "released" state appeared

def test_cancellation_wait(client):
with client.get_executor(pure=False) as e:
fs = [e.submit(slowinc, i, delay=0.1) for i in range(10)]
fs = [e.submit(slowinc, i, delay=0.2) for i in range(10)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test was randomly failing as the cancellation occasionally took more than 0.1s to be enacted and thus one of the tasks had enough time to finish

t1 = time.time()
e.shutdown(wait=False)
dt = time.time() - t1
assert dt < 0.5
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test was randomly failing here as the shutdown took ~0.51s


@pytest.mark.slow
@pytest.mark.xfail(condition=MACOS, reason="extremely flaky")
@pytest.mark.flaky(condition=not MACOS, reruns=10, reruns_delay=5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recently found to have become highly unstable on Linux too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally these tests take about 5 and 8 seconds for me. On CI this is likely not as fast and rerunning them for 10 times might increase runtime of our test suite by multiple minutes for a test it seems we no longer have strong confidence in. Do we still want them to be part of our test suite if they are this unreliable, they eat up significant runtime and nobody knows what's wrong?

I never managed to reproduce this but inspecting the test setup, they spawn 8 and 16 processes and GH Actions runners have 2 CPUs each (3 on OSX). I don't know how the proc.start() actually behaves. Is it blocking until the process is up and running or does it release during the process is still in a startup phase? Is this behaviour different on different platforms? If the process startup time is a problem here, we could try a barrier to see if this helps. The failing tests often show n_purged to be zero and this is the only thing I can think of right now which would explain this

diff --git a/distributed/tests/test_diskutils.py b/distributed/tests/test_diskutils.py
index 077a37d4..cb9d26b5 100644
--- a/distributed/tests/test_diskutils.py
+++ b/distributed/tests/test_diskutils.py
@@ -190,7 +190,8 @@ def test_locking_disabled(tmpdir):
         lock_file.assert_not_called()


-def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt):
+def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt, barrier):
+    barrier.wait()
     ws = WorkSpace(base_dir)
     n_purged = 0
     with captured_logger("distributed.diskutils", "ERROR") as sio:
@@ -229,15 +230,17 @@ def _test_workspace_concurrency(tmpdir, timeout, max_procs):

     # Run a bunch of child processes that will try to purge concurrently
     NPROCS = 2 if sys.platform == "win32" else max_procs
+    barrier = mp_context.Barrier(parties=NPROCS + 1)
     processes = [
         mp_context.Process(
-            target=_workspace_concurrency, args=(base_dir, purged_q, err_q, stop_evt)
+            target=_workspace_concurrency,
+            args=(base_dir, purged_q, err_q, stop_evt, barrier),
         )
         for i in range(NPROCS)
     ]
     for p in processes:
         p.start()
-
+    barrier.wait()
     n_created = 0
     n_purged = 0
     try:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'm giving it a try

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it never failed so far! fingers crossed...

@crusaderky crusaderky closed this Aug 4, 2021
@crusaderky crusaderky reopened this Aug 4, 2021
@fjetter
Copy link
Member

fjetter commented Aug 4, 2021

@crusaderky ping once you're done. There are still failing tests but this looks already good.

@crusaderky
Copy link
Collaborator Author

I can't see any of the changed tests failing anymore. This is ready for merge.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for taking the time to look into this

@fjetter fjetter merged commit 3d73623 into dask:main Aug 5, 2021
@crusaderky crusaderky deleted the ci branch August 5, 2021 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants