Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jun 29, 2022

I realized we don't actually need the TemporaryDirectory here since the diskutils are taking care of cleanup.

The diff is mostly reindent. I didn't change any logic of gen_cluster

This could be responsible for many of our CI failures lately since I expect there to be a problem if the workdir/diskutils is using the finalize

@classmethod
def _finalize(cls, workspace, lock_path, lock_file, dir_path):
try:
workspace._purge_directory(dir_path)
finally:
if lock_file is not None:
lock_file.release()
if lock_path is not None:
workspace._known_locks.remove(lock_path)
safe_unlink(lock_path)

This should be safe to be called if there is no directory anymore, but I don't know for sure.
If the worker is closing gracefully, we call this
self._workdir.release()

I don't think any of this should matter but I am not absolutely sure. If any of these or the tempdir implementation doesn't handle empty directories well, this could cause conflicts

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   10h 20m 15s ⏱️ - 12m 26s
  2 899 tests ±0    2 814 ✔️ +2    82 💤 ±0  3  - 1 
21 475 runs  +2  20 506 ✔️  - 2  963 💤 +6  6  - 1 

For more details on these failures, see this check.

Results for commit 104d1ec. ± Comparison against base commit bd7f43a.

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Regarding the concerns: Given that CI looks green, I'd say we risk it. Those problems, if existent, should be fairly easy to spot.

@crusaderky
Copy link
Collaborator

crusaderky commented Jun 30, 2022

There are a few reasons why I introduced the top-level temporary directory in #6335:

  1. The test suite splattered the project dir wth 'dask_worker_space' directories, frequently with leftover contents
  2. The test suite had many failures if the project directory was checked out on NFS (very old issue - can't find it now)
  3. On CI, the project directory is most likely a virtual network mount. I was hoping that */tmp or equivalent on MacOSX and Windows) would instead be a local SSD. Note that I have no evidence to back either of these statements, other than the fact that Windows CI frequently dumps out many warnings about the disk being very slow.

I understand that CI is failing because the tests are leaking open file handles, but on Windows you can't delete a file if there are open file handles to it. I feel that this PR just sweeps the problem under the carpet - I'd much rather have proper cleanup, or even just a "retry untl timeout" design in the diskutils teardown.

Thinking about it again now, n.2 above (local test failures due to open file handles to an NFS directory) are problably the same issue as the failures on Windows CI.


with dask.config.set(
{
"local_directory": tempfile.gettempdir(),
Copy link
Member Author

@fjetter fjetter Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky I'm still setting the local_directory option to this value, e.g. on my machine this yields

In [2]: import tempfile

In [3]: tempfile.gettempdir()
Out[3]: '/var/folders/h0/kd1gdptx7gzb6cbywplfrx1r0000gn/T'

which is the same location as TMPDIR so on linux you'll likely get a /tmp directory.

▶ echo $TMPDIR
/var/folders/h0/kd1gdptx7gzb6cbywplfrx1r0000gn/T/

when using tempfile.TemporaryDirectory it will also create a directory in this folder. The only actual change I am introducing is that the gen_cluster util will not actually create the directory and own the directory but instead the workers will.
We should get all the same benefits from this as with the TemporaryDirectory but we will not collide during exit/cleanup

Copy link
Member Author

@fjetter fjetter Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diskutils is

  • creating the directory
    def __init__(self, workspace, name=None, prefix=None):
    assert name is None or prefix is None
    if name is None:
    self.dir_path = tempfile.mkdtemp(prefix=prefix, dir=workspace.base_dir)
    else:
    self.dir_path = os.path.join(workspace.base_dir, name)
    os.mkdir(self.dir_path) # it shouldn't already exist
  • creating a bunch of lock files, e.g. global and purge lock and a directory specific lock
    self._lock_file = locket.lock_file(self._lock_path)
  • Most importantly, it is also cleaning up the workspace again
    def release(self):
    """
    Dispose of this directory.
    """
    self._finalizer()
    @classmethod
    def _finalize(cls, workspace, lock_path, lock_file, dir_path):
    try:
    workspace._purge_directory(dir_path)
    finally:
    if lock_file is not None:
    lock_file.release()
    if lock_path is not None:
    workspace._known_locks.remove(lock_path)
    safe_unlink(lock_path)
    which may conflict with the TemporaryDirectory context manager on exit, particularly when it is called as a finalizer

In a follow up step, I also think we can think of disabling purging on CI since we don't really care about his but this may significantly slow us down (e.g. introduce a config option)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker is initializing WorksSpace and directory here

if not local_directory:
local_directory = dask.config.get("temporary-directory") or os.getcwd()
os.makedirs(local_directory, exist_ok=True)
local_directory = os.path.join(local_directory, "dask-worker-space")
with warn_on_duration(
"1s",
"Creating scratch directories is taking a surprisingly long time. ({duration:.2f}s) "
"This is often due to running workers on a network file system. "
"Consider specifying a local-directory to point workers to write "
"scratch data to a local disk.",
):
self._workspace = WorkSpace(os.path.abspath(local_directory))
self._workdir = self._workspace.new_work_dir(prefix="worker-")
self.local_directory = self._workdir.dir_path

where we read the temporary-directory variable in. the os.getcwd() there should probably be replaced with a tempfile.gettempdir() as well

@crusaderky crusaderky merged commit 1bfd99c into dask:main Jul 1, 2022
@fjetter fjetter deleted the do_not_use_tempfile_in_utils_test branch July 4, 2022 09:02
@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

Interestingly, this PR seemed to have fixed test_chaos_rechunk but introduced failures in #6393 (comment)

image
image

It also had an effect on distributed.diagnostics.tests.test_cluster_dump_plugin.test_cluster_dump_plugin where the failure case is still the same but the rate drastically reduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants