Skip to content

Trying to resume training enc_dec_nmt fails #4224

@itzsimpl

Description

@itzsimpl

Describe the bug

I used enc_dec_nmt.py to build a NLP/MT model based on aayn_base.yml (nemo:1.8.2 based on pytorch:22.04-py3); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint by passing
+exp_manager.resume_if_exists=true to the enc_dec_nmt.py call fails with the following trace

Traceback (most recent call last):
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 147, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 140, in main
    trainer.fit(mt_model)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
    self.val_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    output = self.on_run_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 187, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 309, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook("validation_epoch_end", output_or_outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 493, in validation_epoch_end
    self.eval_epoch_end(outputs, 'val', self.global_rank)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 407, in eval_epoch_end
    if isinstance(outputs[0], dict):
IndexError: list index out of range

Steps/Code to reproduce bug

Train with examples/enc_dec_nmt.py long enough to produce at least one checkpoint, interrupt training and rerun with the exp_manager.resume_if_exists flag set.

Expected behavior

Training resumes from last checkpoint. This used to work in nemo:1.3.0.

Environment overview (please complete the following information)

pytorch:22.04-py3
nemo:1.8.2

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions