Skip to content

psutil causes Nanny to crash #6089

@ungarj

Description

@ungarj

What happened:

We are getting the following exceptions occasionally from our workers resulting the whole process to stall eventually:

FileNotFoundError: [Errno 2] No such file or directory: '/proc/12/statm'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/usr/local/lib/python3.8/site-packages/distributed/system_monitor.py", line 121, in update
    read_bytes_disk = (disk_ioc.read_bytes - last_disk.read_bytes) / (
AttributeError: 'NoneType' object has no attribute 'read_bytes'

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 441, in wrapper
    ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/usr/local/lib/python3.8/site-packages/distributed/worker_memory.py", line 322, in memory_monitor
    memory = proc.memory_info().rss
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 444, in wrapper
    return fun(self)
  File "/usr/local/lib/python3.8/site-packages/psutil/__init__.py", line 1061, in memory_info
    return self._proc.memory_info()
  File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1661, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1895, in memory_info
    with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 711, in open_binary
    return open(fname, "rb", **kwargs)

If I understand the code correctly the following is happening:

psutil.disk_io_counters() returns a named tuple:

https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L39

so that the internal self._collect_disk_io_counters is not set to False but psutil.disk_io_counters() returns None instead of an expected named tuple when called within the update method:

https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L115

later thus causing the Nanny to crash.

It seems to be an issue of psutil in the first place but I think the SystemMonitor could be more resilient if that happens.

What you expected to happen:

SystemMonitor should not raise an exception if psutil.disk_io_counters() returns None:

disk_ioc = psutil.disk_io_counters()

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Should I prepare a PR with the suggested changes?

Environment:

  • Dask version: 2022.4.0
  • Python version: 3.8
  • Operating System: Debian GNU/Linux 10 (buster)
  • Install method (conda, pip, source): pip
Cluster Dump State:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions