Rolling back some of #5103 to fix failing stress test #5215

madsbk · 2021-08-16T13:31:57Z

After #5103, test_stress_creation_and_deletion fails occasionally (approx. 1 out of 5 runs on my machine).

pytest distributed/tests/test_stress.py::test_stress_creation_and_deletion --runslow

I have tracked the issue to the scheduler's stimulus_task_finished() method, which will fail when ts._state is "waiting".
Rolling back that change from #5103 seems to fix the issue.

cc. @fjetter

Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

madsbk · 2021-08-16T14:57:14Z

Hmm, it works on my machine but the CI still fails:

async def _result(self, raiseit=True):
   await self._state.wait()
   if self.status == "error":
       exc = clean_exception(self._state.exception, self._state.traceback)
       if raiseit:
           typ, exc, tb = exc
>          raise exc.with_traceback(tb)
E          ValueError: Could not find dependent ('random_sample-da9e9015def3a74e21e17025dd69fb7c', 4, 5).  Check worker logs

fjetter · 2021-08-18T14:36:07Z

I can reproduce it locally. I'll have a look

fjetter · 2021-08-18T14:50:20Z

I found a few more minor places where missing workers are not properly handled. Thanks for revealing this, that's actually an interesting stress test.

FWIW, I'm not at all surprised to see this test fail with that exception. After all, what this test is doing is repeatedly killing workers. This exception triggers as soon as a worker is repeatedly (five times) unable to fetch a given dependency. if one of the many dependencies is repeatedly scheduled on the to-be-killed worker, this exception will raise. If anything, I'm happy to see this test fail since in the past it was almost impossible to trigger this bad_dep handler due to various worker state instabilities.

anyhow, I'll need to do some digging and find out why this test ever worked the way it is written

fjetter · 2021-08-18T14:57:42Z

FYI If you are ever faced with these "fails on CI but works locally fine" problems, this is likely a stress induced timing issue.

I often end op decorating the test with @pytest.mark.repeat(100) and keep running it until I see something. If single threaded execution doesn't provoke the failure, adding in pytest-xdist to parallelize the tests will usually provoke the failures eventually. if not, continue increasing the number of parallel running process until your local machine converges to CI utilization :)

madsbk · 2021-08-19T06:43:23Z

Yeah, I knew it was a race condition but couldn't trigger the failure locally even with many tries.
But didn't know about @pytest.mark.repeat(100) and pytest-xdist, which makes it a lot easier to hammer the tests, thanks :)

madsbk force-pushed the fix_stress_hang branch from 048d2f5 to e8e6f96 Compare August 16, 2021 13:53

Rolling back some of dask#5103 to fix hang

8e7d6a7

madsbk force-pushed the fix_stress_hang branch from e8e6f96 to 8e7d6a7 Compare August 16, 2021 14:07

madsbk added 3 commits August 17, 2021 14:10

Fixing error: Set changed size during iteration

1697c91

Clean up

e3bfb52

handle_release_data(): handle unknown worker

7358713

fjetter mentioned this pull request Aug 18, 2021

Increase worker.suspicious_counter threshold #5228

Merged

3 tasks

madsbk closed this Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rolling back some of #5103 to fix failing stress test #5215

Rolling back some of #5103 to fix failing stress test #5215

Uh oh!

madsbk commented Aug 16, 2021

Uh oh!

madsbk commented Aug 16, 2021 •

edited

Loading

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

madsbk commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Rolling back some of #5103 to fix failing stress test #5215

Rolling back some of #5103 to fix failing stress test #5215

Uh oh!

Conversation

madsbk commented Aug 16, 2021

Uh oh!

madsbk commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

fjetter commented Aug 18, 2021

Uh oh!

madsbk commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

madsbk commented Aug 16, 2021 •

edited

Loading