Trying to resume training `enc_dec_nmt` fails

**Describe the bug**

I used `enc_dec_nmt.py` to build a NLP/MT model based on `aayn_base.yml` (`nemo:1.8.2` based on `pytorch:22.04-py3`); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint  by passing 
`+exp_manager.resume_if_exists=true` to the `enc_dec_nmt.py` call fails with the following trace

```python
Traceback (most recent call last):
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 147, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 140, in main
    trainer.fit(mt_model)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
    self.val_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    output = self.on_run_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 187, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 309, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook("validation_epoch_end", output_or_outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 493, in validation_epoch_end
    self.eval_epoch_end(outputs, 'val', self.global_rank)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 407, in eval_epoch_end
    if isinstance(outputs[0], dict):
IndexError: list index out of range
```

**Steps/Code to reproduce bug**

Train with `examples/enc_dec_nmt.py` long enough to produce at least one checkpoint, interrupt training and rerun with the `exp_manager.resume_if_exists` flag set.

**Expected behavior**

Training resumes from last checkpoint. This used to work in `nemo:1.3.0`.

**Environment overview (please complete the following information)**

pytorch:22.04-py3
nemo:1.8.2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to resume training `enc_dec_nmt` fails #4224

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trying to resume training enc_dec_nmt fails #4224

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Trying to resume training `enc_dec_nmt` fails #4224