Skip to content

Enhancement: NaN stops evaluation in detection with a small dataset #6478

@mingxin-zheng

Description

@mingxin-zheng

Describe the bug
When training bundle monai_lung_nodule_ct_detection_v0.5.5 with a relatively small dataset, it exposes the behavior that, when the training data is not good enough, the results of classification in the DetectionEvaluator may get NaN and the evaluation fails, and training does not continue.

The issue is mitigated by setting amp=False.

Output:
When doing evaluation after the first epoch (subset used in the training), randomly the evaluation fails like below:
2023-04-26 06:32:17,370 - ignite.engine.engine.DetectionTrainer - INFO - Epoch: 1/2, Iter: 35/39 -- train_loss: 1.4668
2023-04-26 06:32:17,792 - ignite.engine.engine.DetectionTrainer - INFO - Epoch: 1/2, Iter: 36/39 -- train_loss: 5.2734
2023-04-26 06:32:19,089 - ignite.engine.engine.DetectionTrainer - INFO - Epoch: 1/2, Iter: 37/39 -- train_loss: 1.5029
2023-04-26 06:32:20,773 - ignite.engine.engine.DetectionTrainer - INFO - Epoch: 1/2, Iter: 38/39 -- train_loss: 0.9600
2023-04-26 06:32:21,192 - ignite.engine.engine.DetectionTrainer - INFO - Epoch: 1/2, Iter: 39/39 -- train_loss: 1.0332
2023-04-26 06:32:21,193 - ignite.engine.engine.DetectionTrainer - INFO - Current learning rate: 0.01
2023-04-26 06:32:21,193 - ignite.engine.engine.DetectionEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2023-04-26 06:32:24,229 - ignite.engine.engine.DetectionEvaluator - ERROR - Current run is terminating due to exception: cls_logits is NaN or Inf.
2023-04-26 06:32:24,229 - ignite.engine.engine.DetectionEvaluator - ERROR - Exception: cls_logits is NaN or Inf.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ignite/engine/engine.py", line 1068, in _run_once_on_dataset_as_gen
self.state.output = self._process_function(self, self.state.batch)
File "/usr/local/lib/python3.8/dist-packages/monai/engines/evaluator.py", line 302, in _iteration
engine.state.output[Keys.PRED] = engine.inferer(inputs, engine.network, *args, **kwargs)
File "/workspace/bundles/monai_lung_nodule_ct_detection_v0.5.5/scripts/detection_inferer.py", line 59, in call
return self.detector(inputs, use_inferer=use_inferer, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/monai/apps/detection/networks/retinanet_detector.py", line 524, in forward
head_outputs = predict_with_inferer(
File "/usr/local/lib/python3.8/dist-packages/monai/apps/detection/utils/predict_utils.py", line 136, in predict_with_inferer
head_outputs_sequence = inferer(images, _network_sequence_output, network, keys=keys)
File "/usr/local/lib/python3.8/dist-packages/monai/inferers/inferer.py", line 468, in call
return sliding_window_inference(
File "/usr/local/lib/python3.8/dist-packages/monai/inferers/utils.py", line 223, in sliding_window_inference
seg_prob_out = predictor(win_data, *args, **kwargs) # batched patch
File "/usr/local/lib/python3.8/dist-packages/monai/apps/detection/utils/predict_utils.py", line 75, in _network_sequence_output
head_outputs = network(images)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/monai/apps/detection/networks/retinanet_network.py", line 339, in forward
head_outputs = {self.cls_key: self.classification_head(feature_maps)}
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/monai/apps/detection/networks/retinanet_network.py", line 128, in forward
raise ValueError("cls_logits is NaN or Inf.")
ValueError: cls_logits is NaN or Inf.
2023-04-26 06:32:26,276 - ignite.engine.engine.DetectionEvaluator - ERROR - Engine run is terminating due to exception: cls_logits is NaN or Inf.
2023-04-26 06:32:26,276 - ignite.engine.engine.DetectionEvaluator - ERROR - Exception: cls_logits is NaN or Inf.
And training stopped.
Train more epochs before the first evaluation reduced the likelihood.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions