Skip to content

Conversation

@crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Sep 23, 2022

Allow the Active Memory Manager to use a measure other than optimistic memory (managed + unmanaged that appeared more than 30s ago) in its heuristics.

This is particularly useful on MacOSX, where memory deallocation is not as responsive as on Windows or Linux, and on Linux when allocators other than malloc are being used.

This also allows to write the AMM unit tests using Worker instead of Nanny and make them much more robust and fast.

This PR finally enables all AMM stress tests in CI.

Stress test evidence: https://github.com/crusaderky/distributed/actions/runs/3114670981/jobs/5050785452#step:18:1674
There was one failure which I don't believe to be attributed to AMM. Follow-up: #7063

with warnings.catch_warnings():
warnings.simplefilter("ignore")
b = (a @ a.T).sum().round(3)
assert await c.compute(b) == 245.394
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, (20, 20) would take forever or hang on my PC with Worker=Worker, but took as little as 5s with Worker=Nanny. On CI, Worker=Nanny hangs because of #5371.

@crusaderky crusaderky marked this pull request as draft September 23, 2022 22:23
@github-actions
Copy link
Contributor

github-actions bot commented Sep 23, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   5h 59m 36s ⏱️ - 7m 26s
  3 121 tests +  6    3 035 ✔️ +  7    85 💤 ±0  1  - 1 
23 098 runs  +42  22 190 ✔️ +44  907 💤 +1  1  - 3 

For more details on these failures, see this check.

Results for commit d8c9586. ± Comparison against base commit b40c03d.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looking forward to having all these tests running. I assume the only blocker is #7065?

x = c.submit(lambda: 123, key="x", workers=[w1.address])
await wait(x)
# Fill w2 with dummy data so that it's got the highest memory usage
clutter = await c.scatter(456, workers=[w2.address])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clutter = await c.scatter(456, workers=[w2.address])
clutter = await c.scatter("c" * 10, workers=[w2.address])

Small note, I would expect 123 and 456 to have the same memory usage. So the comment above is slightly misleading. Something like this would make it definitively larger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w2 has got the highest memory usage among the workers that aren't being retired, meaning w2 and w3. I updated the comment to clarify.

@crusaderky crusaderky merged commit 162a7c0 into dask:main Sep 28, 2022
@crusaderky crusaderky deleted the AMM/measure branch September 28, 2022 11:56
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make AMM memory measure configurable

2 participants