Skip to content

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Aug 16, 2021

After #5103, test_stress_creation_and_deletion fails occasionally (approx. 1 out of 5 runs on my machine).

pytest distributed/tests/test_stress.py::test_stress_creation_and_deletion --runslow

I have tracked the issue to the scheduler's stimulus_task_finished() method, which will fail when ts._state is "waiting".
Rolling back that change from #5103 seems to fix the issue.

cc. @fjetter

  • Tests added / passed
  • Passes black distributed / flake8 distributed / isort distributed

@madsbk
Copy link
Contributor Author

madsbk commented Aug 16, 2021

Hmm, it works on my machine but the CI still fails:

async def _result(self, raiseit=True):
   await self._state.wait()
   if self.status == "error":
       exc = clean_exception(self._state.exception, self._state.traceback)
       if raiseit:
           typ, exc, tb = exc
>          raise exc.with_traceback(tb)
E          ValueError: Could not find dependent ('random_sample-da9e9015def3a74e21e17025dd69fb7c', 4, 5).  Check worker logs

@fjetter
Copy link
Member

fjetter commented Aug 18, 2021

I can reproduce it locally. I'll have a look

@fjetter
Copy link
Member

fjetter commented Aug 18, 2021

I found a few more minor places where missing workers are not properly handled. Thanks for revealing this, that's actually an interesting stress test.

FWIW, I'm not at all surprised to see this test fail with that exception. After all, what this test is doing is repeatedly killing workers. This exception triggers as soon as a worker is repeatedly (five times) unable to fetch a given dependency. if one of the many dependencies is repeatedly scheduled on the to-be-killed worker, this exception will raise. If anything, I'm happy to see this test fail since in the past it was almost impossible to trigger this bad_dep handler due to various worker state instabilities.

anyhow, I'll need to do some digging and find out why this test ever worked the way it is written

@fjetter
Copy link
Member

fjetter commented Aug 18, 2021

FYI If you are ever faced with these "fails on CI but works locally fine" problems, this is likely a stress induced timing issue.

I often end op decorating the test with @pytest.mark.repeat(100) and keep running it until I see something. If single threaded execution doesn't provoke the failure, adding in pytest-xdist to parallelize the tests will usually provoke the failures eventually. if not, continue increasing the number of parallel running process until your local machine converges to CI utilization :)

@madsbk
Copy link
Contributor Author

madsbk commented Aug 19, 2021

Yeah, I knew it was a race condition but couldn't trigger the failure locally even with many tries.
But didn't know about @pytest.mark.repeat(100) and pytest-xdist, which makes it a lot easier to hammer the tests, thanks :)

@madsbk madsbk closed this Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants