Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Oct 7, 2022

The test test_tick_interval recently started flaking. It was accompanied by multiple warnings of the GC duration claiming that GC took too long.
While this can certainly be true, I realized that the GC callback is also inspecting process memory. I believe the process memory check actually can be slow-ish and I believe we should not include this in the measurement of the gc callback.

In fact, I'm wondering if this should even be there in the first place. I also wonder if we should implement a similar logic for get_process_memory as we did for gc.collect that ensures it doesn't run too frequently.

cc @crusaderky @gjoseph92

@fjetter fjetter changed the title Do not include psutil memory_info in GC duration warning Do not include psutil memory_info in GC duration measurement Oct 7, 2022
# measurement
self._fractional_timer.stop_timing()
if self._proc is not None:
rss = self._proc.memory_info().rss
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least if gc runs frequently, this could be expensive. I'm wondering if we want to add yet another timer for this specifically

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 45m 55s ⏱️ + 25m 25s
  3 146 tests ±0    3 060 ✔️ +3    85 💤  - 1  1  - 2 
23 272 runs  ±0  22 355 ✔️ +1  916 💤 +1  1  - 2 

For more details on these failures, see this check.

Results for commit 22b27e2. ± Comparison against base commit 07e2259.

@crusaderky
Copy link
Collaborator

psutil.Process.memory_info() takes 7.76 µs ± 47.4 ns on my Linux box. Where did you observe it being slow?

@fjetter
Copy link
Member Author

fjetter commented Oct 10, 2022

Where did you observe it being slow?

I can't tell if this specific call is slow or not. I remember interactions with processes via psutil to be slow occasionally but can't point you to an example.

Where I noticed the GC warning was in #7075
https://github.com/dask/distributed/actions/runs/3184216360/jobs/5192365157

which times out after 30s. I assume the test logic is not rock solid but I also see the warnings below

2022-10-04 17:56:04,506 - distributed.utils_perf - WARNING - full garbage collections took 49% CPU time recently (threshold: 10%)
/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
2022-10-04 17:56:05,376 - distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
    return future.result()
2022-10-04 17:56:06,295 - distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%)

In this test, nothing is happening really. Specifically we're not accumulating any objects that should clog GC (of course, we may be dealing with leaks from another test) so I was wondering how we're measuring this in the first place...

I'm also surprised that we're running collection three times within two seconds even if this fraction is high (Considering that during runtime we're only running ThrottledGC). I assume this connects to a couple of explicit gc.collect calls we have in the test suite.

Following that logic, maybe the cause for these warnings is not the callback but rather one of the following two loops

@fjetter fjetter closed this Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants