Hold event loop while evicting #6174

gjoseph92 · 2022-04-21T22:32:35Z

#5878 undid #3424; this brings it back. This resolves #6110, but needs more discussion around whether this is actually the right way to fix it.

Closes Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110
Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-04-22T00:22:26Z

Unit Test Results

      16 files ±0       16 suites ±0 7h 36m 33s ⏱️ + 2m 37s
  2 727 tests ±0   2 646 ✔️ +1     81 💤 +1 0 ❌ - 2
21 701 runs ±0 20 663 ✔️ - 1 1 038 💤 +3 0 ❌ - 2

Results for commit 63622b3. ± Comparison against base commit 6947873.

mrocklin · 2022-04-22T14:16:37Z

distributed/worker_memory.py

+            # only pass on control if we spent at least 0.5s evicting
+            if monotonic() - start > 0.5:
+                await asyncio.sleep(0)
+                start = monotonic()


I'm confused by both the old and the new solution here. It seems like we're staying stuck in a while loop burning the CPU at 100%. Is this correct? If so, I'm against this design generally. (although I'm not objecting to this change, since it doesn't necessarily make things worse from my perspective.

Is my understanding correct here that we're blocking the event loop?

My understanding was flawed here. @crusaderky and @fjetter set me straight here. I didn't realize that data.evict was blocking.

Yes, it's blocking the event loop. No, it's not burning CPU at 100%.

I think the idea here is that it's a crude form of asyncio scheduling. You're basically saying "stop everything! there's nothing that could possibly be more important to do (launch tasks, fetch data, send data, talk to scheduler, etc.) than dumping data to disk right now." That's a bit of a dramatic approach, but not too wrong for this situation. Spilling to disk should probably be the highest priority thing we do at this moment. And ideally starting new tasks, fetching new data, or anything else that could produce memory probably should pause until eviction is done.

This only works because we don't have async disk #4424. With async writes, we'd have to figure out how to actually pause other subsystems without blocking the whole event loop in this situation.

Yeah, I don't disagree with any of that. My previous understanding was flawed. Please ignore my previous comment. I'm good here.

I'm totally fine passing control periodically.

No, it's not burning CPU at 100%.

This is hardly surprising - when a write() is stuck waiting on a slow network file system, it shouldn't burn CPU.

mrocklin · 2022-04-22T15:51:47Z

Are we ok merging this before the next release? I asked this in a recent meeting and @crusaderky and @fjetter both said "we'd like some time to look at this, but if by next friday nothing has changed then yes, merging makes sense" (paraphrasing)

gjoseph92 · 2022-04-22T19:04:26Z

I find it a bit odd that all the tests still pass with this change. Clearly this code path is not really being tested?

Since it was previous behavior for a while, it's probably safe to merge if nothing has changed. But I'd also hope we can look at it more first.

mrocklin · 2022-04-22T20:07:39Z

Yeah, I'm happy leaving it in this state for a few days while folks poke around. Mostly I'm just giving advance warning that I plan to push this through if nothing happens over the next week.

crusaderky · 2022-04-23T00:16:45Z

I find it a bit odd that all the tests still pass with this change.

Why wouldn't they? Only a test with a single sleep(0) in it would be impacted.

Clearly this code path is not really being tested?

It was before this PR. Somewhat worryingly, codecov is saying that there's at least one unit test somewhere that is spilling for more than 0.5s: https://app.codecov.io/gh/dask/distributed/compare/6174/changes

Maybe it's worth writing one that doesn't rely on sluggish host performance?
You should be able to write an artificial one by spilling

class C:
    def __getstate__(self):
        time.sleep(0.6)
        return {}

crusaderky · 2022-04-23T00:17:40Z

distributed/worker_memory.py

 from collections.abc import Callable, MutableMapping
 from contextlib import suppress
 from functools import partial
+from time import monotonic


On Windows, this has a granularity of 15ms. Could you use #6181 instead?

Maybe distributed.metrics.time would suffice?

crusaderky · 2022-04-23T00:25:30Z

The only thing that I can think of here is that the worker goes from 70% to well beyond 100% in less than 200ms (distributed.worker.memory.monitor-interval) and starts trashing the swap file so much that the Nanny never has a chance to terminate the worker.

Do the AWS hosts mount a swap file? If yes, is it on EBS? Would it be possible to unmount the swap file and see if we start seeing MemoryError instead?

crusaderky · 2022-04-23T00:31:40Z

I would also like to see if the issue disappears or is heavily mitigated by reducing distributed.worker.memory.monitor-interval from 200ms to 5ms.
Also I strongly suspect that #5702 has a part in this.

gjoseph92 · 2022-04-23T00:35:59Z

The only thing that I can think of here is that the worker goes from 70% to well beyond 100% in less than 200ms (distributed.worker.memory.monitor-interval) and starts trashing the swap file so much that the Nanny never has a chance to terminate the worker.

@crusaderky agreed. What do you think about #6177?

Also I strongly suspect that #5702 has a part in this.

Definitely. Once we have that and async disk fixed, we could probably just pause here.

crusaderky · 2022-04-23T00:41:50Z

....oh, crud.

A worker won't pause if it's busy spilling #6182

crusaderky · 2022-04-24T20:46:50Z

Superseded by #6189

Squashed commit of the following: commit e036bb0 Author: crusaderky <crusaderky@gmail.com> Date: Mon Apr 25 22:09:01 2022 +0100 High-res monotonic timer on windows commit 137a4f5 Merge: a27d869 b7fc7be Author: crusaderky <crusaderky@gmail.com> Date: Mon Apr 25 22:07:46 2022 +0100 Merge branch 'main' into pause_while_spilling commit a27d869 Author: crusaderky <crusaderky@gmail.com> Date: Mon Apr 25 21:59:22 2022 +0100 Code review commit 7d99d1a Merge: b9b945c 198522b Author: crusaderky <crusaderky@gmail.com> Date: Mon Apr 25 21:48:33 2022 +0100 Merge branch 'main' into pause_while_spilling commit b9b945c Author: crusaderky <crusaderky@gmail.com> Date: Mon Apr 25 11:37:42 2022 +0100 harden test commit 3ffa721 Merge: 8156313 b934ae6 Author: crusaderky <crusaderky@gmail.com> Date: Sun Apr 24 21:38:15 2022 +0100 Merge branch 'main' into pause_while_spilling commit 8156313 Author: crusaderky <crusaderky@gmail.com> Date: Sun Apr 24 21:37:54 2022 +0100 Fix test commit c150bd7 Author: crusaderky <crusaderky@gmail.com> Date: Sun Apr 24 21:33:29 2022 +0100 Merge in dask#6174 commit bf66a7c Merge: 87b534e 63622b3 Author: crusaderky <crusaderky@gmail.com> Date: Sun Apr 24 20:59:20 2022 +0100 Merge remote-tracking branch 'gjoseph92/worker-memory-hold-event-loop' into pause_while_spilling commit 87b534e Author: crusaderky <crusaderky@gmail.com> Date: Sun Apr 24 20:44:55 2022 +0100 Pause in the middle of spilling commit 63622b3 Author: Gabe Joseph <gjoseph92@gmail.com> Date: Thu Apr 21 14:47:02 2022 -0700 Hold event loop while evicting

Hold event loop while evicting

63622b3

gjoseph92 mentioned this pull request Apr 21, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

mrocklin reviewed Apr 22, 2022

View reviewed changes

crusaderky mentioned this pull request Apr 23, 2022

Add distributed.metrics.monotonic #6181

Merged

crusaderky reviewed Apr 23, 2022

View reviewed changes

crusaderky mentioned this pull request Apr 24, 2022

Allow pausing and choke event loop while spilling #6189

Merged

crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 24, 2022

Merge in dask#6174

c150bd7

gjoseph92 closed this Apr 25, 2022

gjoseph92 deleted the worker-memory-hold-event-loop branch April 25, 2022 19:48

Uh oh!

Hold event loop while evicting #6174

Hold event loop while evicting #6174

Uh oh!

Conversation

gjoseph92 commented Apr 21, 2022

Uh oh!

github-actions bot commented Apr 22, 2022

Unit Test Results

Uh oh!

mrocklin Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky Apr 23, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 22, 2022

Uh oh!

gjoseph92 commented Apr 22, 2022

Uh oh!

mrocklin commented Apr 22, 2022

Uh oh!

crusaderky commented Apr 23, 2022

Uh oh!

crusaderky Apr 23, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 24, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Apr 23, 2022

Uh oh!

crusaderky commented Apr 23, 2022

Uh oh!

gjoseph92 commented Apr 23, 2022

Uh oh!

crusaderky commented Apr 23, 2022

Uh oh!

crusaderky commented Apr 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants