Skip to content

Potential race condition in Nanny #3955

@FabioLuporini

Description

@FabioLuporini

Hi everybody, since a few days we're seeing "random" failures in our CI due to distributed emitting:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This feels like a race condition in some situation, e.g. closing of the Nanny because the periodic callbacks are still running but Nanny.process is already None.

I think (still investigating) we've started seeing this only after 2.20 was released.

I cannot attach an MFE yet simply because we don't have one 😬 we only experience this on CI. I'd welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize from a DASK worker (cell 14 here -- look for optimize.minimize). I doubt it's ever going to be useful, but here's an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines...

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions