-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Describe the bug
exp_manager() calls check_resume() which moves current files to a run_x folder (link to code)
However, this may cause issues due to these files being accessed by other processes, here's an example of a stack trace:
File "my_script.py", line 94, in main
trainer.fit(model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 565, in _fit_impl
ckpt_path = self._checkpoint_connector._select_ckpt_path(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 108, in _select_ckpt_path
ckpt_path = self._parse_ckpt_path(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 121, in _parse_ckpt_path
if ckpt_path is None and SLURMEnvironment.detect() and self._hpc_resume_path is not None:
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 60, in _hpc_resume_path
max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 537, in __max_ckpt_version_in_folder
files = [os.path.basename(f["name"]) for f in fs.listdir(uri)]
File "/usr/local/lib/python3.10/dist-packages/fsspec/spec.py", line 1448, in listdir
return self.ls(path, detail=detail, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in ls
return [self.info(f) for f in it]
File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 64, in <listcomp>
return [self.info(f) for f in it]
File "/usr/local/lib/python3.10/dist-packages/fsspec/implementations/local.py", line 75, in info
out = path.stat(follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/log_dir/nemo_log_globalrank-81_localrank-1.txt'
Note that in this example trainer.fit(model) is called after exp_manager(), but since execution is asynchronous across processes, one process may run it while rank 0 is still moving files around.
Steps/Code to reproduce bug
This is a bit tricky since it's somewhat random and also depends on how fast your filesystem is.
Environment overview (please complete the following information)
- Environment location: Docker (recent image)
Additional context
To fix this, my first suggestion would be to add some kind of barrier around the code that moves files. It's a bit tricky though since at this point torch.distributed is not initialized, and I'm not even sure if we can tell how many processes are running. I haven't given it much thought yet but please let me know if you have a better idea!
Edit: actually, at least on SLURM we can tell how many processes there are from trainer.num_nodes * trainer.numdevices. We can't always assume this is the case because if for instance we run it locally, then PTL will spawn additional processes after exp_manager() is called (but in that case there should be no risk of such conflict between processes). I am thus planning to submit a SLURM-only fix, with a filesystem-based synchronization between processes. If anyone has a better idea please chime in!