Skip to content

The moving of files in exp_manager may cause crashes in other processes #7460

@odelalleau

Description

@odelalleau

Describe the bug

exp_manager() calls check_resume() which moves current files to a run_x folder (link to code)

However, this may cause issues due to these files being accessed by other processes, here's an example of a stack trace:

  File "my_script.py", line 94, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 565, in _fit_impl
    ckpt_path = self._checkpoint_connector._select_ckpt_path(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 108, in _select_ckpt_path
    ckpt_path = self._parse_ckpt_path(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 121, in _parse_ckpt_path
    if ckpt_path is None and SLURMEnvironment.detect() and self._hpc_resume_path is not None:
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 60, in _hpc_resume_path
    max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 537, in __max_ckpt_version_in_folder
    files = [os.path.basename(f["name"]) for f in fs.listdir(uri)]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1448, in listdir
    return self.ls(path, detail=detail, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in ls
    return [self.info(f) for f in it]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in <listcomp>
    return [self.info(f) for f in it]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 75, in info
    out = path.stat(follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/log_dir/nemo_log_globalrank-81_localrank-1.txt'

Note that in this example trainer.fit(model) is called after exp_manager(), but since execution is asynchronous across processes, one process may run it while rank 0 is still moving files around.

Steps/Code to reproduce bug

This is a bit tricky since it's somewhat random and also depends on how fast your filesystem is.

Environment overview (please complete the following information)

  • Environment location: Docker (recent image)

Additional context

To fix this, my first suggestion would be to add some kind of barrier around the code that moves files. It's a bit tricky though since at this point torch.distributed is not initialized, and I'm not even sure if we can tell how many processes are running. I haven't given it much thought yet but please let me know if you have a better idea!

Edit: actually, at least on SLURM we can tell how many processes there are from trainer.num_nodes * trainer.numdevices. We can't always assume this is the case because if for instance we run it locally, then PTL will spawn additional processes after exp_manager() is called (but in that case there should be no risk of such conflict between processes). I am thus planning to submit a SLURM-only fix, with a filesystem-based synchronization between processes. If anyone has a better idea please chime in!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions