Skip to content

Conversation

@gjoseph92
Copy link
Collaborator

#5878 undid #3424; this brings it back. This resolves #6110, but needs more discussion around whether this is actually the right way to fix it.

@github-actions
Copy link
Contributor

Unit Test Results

       16 files  ±0         16 suites  ±0   7h 36m 33s ⏱️ + 2m 37s
  2 727 tests ±0    2 646 ✔️ +1       81 💤 +1  0  - 2 
21 701 runs  ±0  20 663 ✔️  - 1  1 038 💤 +3  0  - 2 

Results for commit 63622b3. ± Comparison against base commit 6947873.

# only pass on control if we spent at least 0.5s evicting
if monotonic() - start > 0.5:
await asyncio.sleep(0)
start = monotonic()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by both the old and the new solution here. It seems like we're staying stuck in a while loop burning the CPU at 100%. Is this correct? If so, I'm against this design generally. (although I'm not objecting to this change, since it doesn't necessarily make things worse from my perspective.

Is my understanding correct here that we're blocking the event loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was flawed here. @crusaderky and @fjetter set me straight here. I didn't realize that data.evict was blocking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's blocking the event loop. No, it's not burning CPU at 100%.

I think the idea here is that it's a crude form of asyncio scheduling. You're basically saying "stop everything! there's nothing that could possibly be more important to do (launch tasks, fetch data, send data, talk to scheduler, etc.) than dumping data to disk right now." That's a bit of a dramatic approach, but not too wrong for this situation. Spilling to disk should probably be the highest priority thing we do at this moment. And ideally starting new tasks, fetching new data, or anything else that could produce memory probably should pause until eviction is done.

This only works because we don't have async disk #4424. With async writes, we'd have to figure out how to actually pause other subsystems without blocking the whole event loop in this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't disagree with any of that. My previous understanding was flawed. Please ignore my previous comment. I'm good here.

I'm totally fine passing control periodically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not burning CPU at 100%.

This is hardly surprising - when a write() is stuck waiting on a slow network file system, it shouldn't burn CPU.

@mrocklin
Copy link
Member

Are we ok merging this before the next release? I asked this in a recent meeting and @crusaderky and @fjetter both said "we'd like some time to look at this, but if by next friday nothing has changed then yes, merging makes sense" (paraphrasing)

@gjoseph92
Copy link
Collaborator Author

I find it a bit odd that all the tests still pass with this change. Clearly this code path is not really being tested?

Since it was previous behavior for a while, it's probably safe to merge if nothing has changed. But I'd also hope we can look at it more first.

@mrocklin
Copy link
Member

Yeah, I'm happy leaving it in this state for a few days while folks poke around. Mostly I'm just giving advance warning that I plan to push this through if nothing happens over the next week.

@crusaderky
Copy link
Collaborator

I find it a bit odd that all the tests still pass with this change.

Why wouldn't they? Only a test with a single sleep(0) in it would be impacted.

Clearly this code path is not really being tested?

It was before this PR. Somewhat worryingly, codecov is saying that there's at least one unit test somewhere that is spilling for more than 0.5s: https://app.codecov.io/gh/dask/distributed/compare/6174/changes

image

Maybe it's worth writing one that doesn't rely on sluggish host performance?
You should be able to write an artificial one by spilling

class C:
    def __getstate__(self):
        time.sleep(0.6)
        return {}

from collections.abc import Callable, MutableMapping
from contextlib import suppress
from functools import partial
from time import monotonic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, this has a granularity of 15ms. Could you use #6181 instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe distributed.metrics.time would suffice?

@crusaderky
Copy link
Collaborator

The only thing that I can think of here is that the worker goes from 70% to well beyond 100% in less than 200ms (distributed.worker.memory.monitor-interval) and starts trashing the swap file so much that the Nanny never has a chance to terminate the worker.

Do the AWS hosts mount a swap file? If yes, is it on EBS? Would it be possible to unmount the swap file and see if we start seeing MemoryError instead?

@crusaderky
Copy link
Collaborator

I would also like to see if the issue disappears or is heavily mitigated by reducing distributed.worker.memory.monitor-interval from 200ms to 5ms.
Also I strongly suspect that #5702 has a part in this.

@gjoseph92
Copy link
Collaborator Author

The only thing that I can think of here is that the worker goes from 70% to well beyond 100% in less than 200ms (distributed.worker.memory.monitor-interval) and starts trashing the swap file so much that the Nanny never has a chance to terminate the worker.

@crusaderky agreed. What do you think about #6177?

Also I strongly suspect that #5702 has a part in this.

Definitely. Once we have that and async disk fixed, we could probably just pause here.

@crusaderky
Copy link
Collaborator

crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 24, 2022
@crusaderky
Copy link
Collaborator

Superseded by #6189

@gjoseph92 gjoseph92 closed this Apr 25, 2022
@gjoseph92 gjoseph92 deleted the worker-memory-hold-event-loop branch April 25, 2022 19:48
crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 25, 2022
Squashed commit of the following:

commit e036bb0
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 25 22:09:01 2022 +0100

    High-res monotonic timer on windows

commit 137a4f5
Merge: a27d869 b7fc7be
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 25 22:07:46 2022 +0100

    Merge branch 'main' into pause_while_spilling

commit a27d869
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 25 21:59:22 2022 +0100

    Code review

commit 7d99d1a
Merge: b9b945c 198522b
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 25 21:48:33 2022 +0100

    Merge branch 'main' into pause_while_spilling

commit b9b945c
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 25 11:37:42 2022 +0100

    harden test

commit 3ffa721
Merge: 8156313 b934ae6
Author: crusaderky <crusaderky@gmail.com>
Date:   Sun Apr 24 21:38:15 2022 +0100

    Merge branch 'main' into pause_while_spilling

commit 8156313
Author: crusaderky <crusaderky@gmail.com>
Date:   Sun Apr 24 21:37:54 2022 +0100

    Fix test

commit c150bd7
Author: crusaderky <crusaderky@gmail.com>
Date:   Sun Apr 24 21:33:29 2022 +0100

    Merge in dask#6174

commit bf66a7c
Merge: 87b534e 63622b3
Author: crusaderky <crusaderky@gmail.com>
Date:   Sun Apr 24 20:59:20 2022 +0100

    Merge remote-tracking branch 'gjoseph92/worker-memory-hold-event-loop' into pause_while_spilling

commit 87b534e
Author: crusaderky <crusaderky@gmail.com>
Date:   Sun Apr 24 20:44:55 2022 +0100

    Pause in the middle of spilling

commit 63622b3
Author: Gabe Joseph <gjoseph92@gmail.com>
Date:   Thu Apr 21 14:47:02 2022 -0700

    Hold event loop while evicting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Computation deadlocks due to worker rapidly running out of memory instead of spilling

3 participants