-
-
Notifications
You must be signed in to change notification settings - Fork 748
Fix flaky tests in CI #5168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky tests in CI #5168
Conversation
| updates = {state for _, state in gl.state_updates} | ||
| if updates == {"waiting", "processing", "memory", "released"}: | ||
| break | ||
| await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test was randomly failing as gl.state_updates was tested before the "released" state appeared
| def test_cancellation_wait(client): | ||
| with client.get_executor(pure=False) as e: | ||
| fs = [e.submit(slowinc, i, delay=0.1) for i in range(10)] | ||
| fs = [e.submit(slowinc, i, delay=0.2) for i in range(10)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test was randomly failing as the cancellation occasionally took more than 0.1s to be enacted and thus one of the tasks had enough time to finish
| t1 = time.time() | ||
| e.shutdown(wait=False) | ||
| dt = time.time() - t1 | ||
| assert dt < 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test was randomly failing here as the shutdown took ~0.51s
distributed/tests/test_diskutils.py
Outdated
|
|
||
| @pytest.mark.slow | ||
| @pytest.mark.xfail(condition=MACOS, reason="extremely flaky") | ||
| @pytest.mark.flaky(condition=not MACOS, reruns=10, reruns_delay=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recently found to have become highly unstable on Linux too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally these tests take about 5 and 8 seconds for me. On CI this is likely not as fast and rerunning them for 10 times might increase runtime of our test suite by multiple minutes for a test it seems we no longer have strong confidence in. Do we still want them to be part of our test suite if they are this unreliable, they eat up significant runtime and nobody knows what's wrong?
I never managed to reproduce this but inspecting the test setup, they spawn 8 and 16 processes and GH Actions runners have 2 CPUs each (3 on OSX). I don't know how the proc.start() actually behaves. Is it blocking until the process is up and running or does it release during the process is still in a startup phase? Is this behaviour different on different platforms? If the process startup time is a problem here, we could try a barrier to see if this helps. The failing tests often show n_purged to be zero and this is the only thing I can think of right now which would explain this
diff --git a/distributed/tests/test_diskutils.py b/distributed/tests/test_diskutils.py
index 077a37d4..cb9d26b5 100644
--- a/distributed/tests/test_diskutils.py
+++ b/distributed/tests/test_diskutils.py
@@ -190,7 +190,8 @@ def test_locking_disabled(tmpdir):
lock_file.assert_not_called()
-def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt):
+def _workspace_concurrency(base_dir, purged_q, err_q, stop_evt, barrier):
+ barrier.wait()
ws = WorkSpace(base_dir)
n_purged = 0
with captured_logger("distributed.diskutils", "ERROR") as sio:
@@ -229,15 +230,17 @@ def _test_workspace_concurrency(tmpdir, timeout, max_procs):
# Run a bunch of child processes that will try to purge concurrently
NPROCS = 2 if sys.platform == "win32" else max_procs
+ barrier = mp_context.Barrier(parties=NPROCS + 1)
processes = [
mp_context.Process(
- target=_workspace_concurrency, args=(base_dir, purged_q, err_q, stop_evt)
+ target=_workspace_concurrency,
+ args=(base_dir, purged_q, err_q, stop_evt, barrier),
)
for i in range(NPROCS)
]
for p in processes:
p.start()
-
+ barrier.wait()
n_created = 0
n_purged = 0
try:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'm giving it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it never failed so far! fingers crossed...
|
@crusaderky ping once you're done. There are still failing tests but this looks already good. |
|
I can't see any of the changed tests failing anymore. This is ready for merge. |
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for taking the time to look into this
Attempt to fix some of the recent CI flakiness