-
-
Notifications
You must be signed in to change notification settings - Fork 748
Description
What happened:
We are getting the following exceptions occasionally from our workers resulting the whole process to stall eventually:
FileNotFoundError: [Errno 2] No such file or directory: '/proc/12/statm'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/usr/local/lib/python3.8/site-packages/distributed/system_monitor.py", line 121, in update
read_bytes_disk = (disk_ioc.read_bytes - last_disk.read_bytes) / (
AttributeError: 'NoneType' object has no attribute 'read_bytes'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 441, in wrapper
ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/usr/local/lib/python3.8/site-packages/distributed/worker_memory.py", line 322, in memory_monitor
memory = proc.memory_info().rss
File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 444, in wrapper
return fun(self)
File "/usr/local/lib/python3.8/site-packages/psutil/__init__.py", line 1061, in memory_info
return self._proc.memory_info()
File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1661, in wrapper
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1895, in memory_info
with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 711, in open_binary
return open(fname, "rb", **kwargs)
If I understand the code correctly the following is happening:
psutil.disk_io_counters() returns a named tuple:
https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L39
so that the internal self._collect_disk_io_counters is not set to False but psutil.disk_io_counters() returns None instead of an expected named tuple when called within the update method:
https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L115
later thus causing the Nanny to crash.
It seems to be an issue of psutil in the first place but I think the SystemMonitor could be more resilient if that happens.
What you expected to happen:
SystemMonitor should not raise an exception if psutil.disk_io_counters() returns None:
distributed/distributed/system_monitor.py
Line 115 in 034b4d4
| disk_ioc = psutil.disk_io_counters() |
Minimal Complete Verifiable Example:
# Put your MCVE code hereAnything else we need to know?:
Should I prepare a PR with the suggested changes?
Environment:
- Dask version: 2022.4.0
- Python version: 3.8
- Operating System: Debian GNU/Linux 10 (buster)
- Install method (conda, pip, source): pip