The moving of files in `exp_manager` may cause crashes in other processes

**Describe the bug**

`exp_manager()` calls `check_resume()` which moves current files to a `run_x` folder ([link to code](https://github.com/NVIDIA/NeMo/blob/cfb1f449831a9a0309aab71187e3192dbd26d1a7/nemo/utils/exp_manager.py#L630))

However, this may cause issues due to these files being accessed by other processes, here's an example of a stack trace:
```
  File "my_script.py", line 94, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 565, in _fit_impl
    ckpt_path = self._checkpoint_connector._select_ckpt_path(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 108, in _select_ckpt_path
    ckpt_path = self._parse_ckpt_path(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 121, in _parse_ckpt_path
    if ckpt_path is None and SLURMEnvironment.detect() and self._hpc_resume_path is not None:
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 60, in _hpc_resume_path
    max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 537, in __max_ckpt_version_in_folder
    files = [os.path.basename(f["name"]) for f in fs.listdir(uri)]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1448, in listdir
    return self.ls(path, detail=detail, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in ls
    return [self.info(f) for f in it]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in <listcomp>
    return [self.info(f) for f in it]
  File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 75, in info
    out = path.stat(follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/log_dir/nemo_log_globalrank-81_localrank-1.txt'
```

Note that in this example `trainer.fit(model)` is called *after* `exp_manager()`, but since execution is asynchronous across processes, one process may run it while rank 0 is still moving files around.

**Steps/Code to reproduce bug**

This is a bit tricky since it's somewhat random and also depends on how fast your filesystem is.

**Environment overview (please complete the following information)**

 - Environment location: Docker (recent image)

**Additional context**

To fix this, my first suggestion would be to add some kind of barrier around the code that moves files. It's a bit tricky though since at this point `torch.distributed` is not initialized, and I'm not even sure if we can tell how many processes are running. I haven't given it much thought yet but please let me know if you have a better idea!

Edit: actually, at least on SLURM we can tell how many processes there are from `trainer.num_nodes * trainer.numdevices`. We can't always assume this is the case because if for instance we run it locally, then PTL will spawn additional processes *after* `exp_manager()` is called (but in that case there should be no risk of such conflict between processes). I am thus planning to submit a SLURM-only fix, with a filesystem-based synchronization between processes. If anyone has a better idea please chime in!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The moving of files in `exp_manager` may cause crashes in other processes #7460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The moving of files in exp_manager may cause crashes in other processes #7460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The moving of files in `exp_manager` may cause crashes in other processes #7460