-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Dear authors,
Congratulations for the very nice work! I ran your code for SGDET and I got an assertion error. In particular, I ran this command:
CUDA_VISIBLE_DEVICES=6 \
python tools/relation_train_net.py \
--config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" \
MODEL.ROI_RELATION_HEAD.USE_GT_BOX False \
MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False \
MODEL.ROI_RELATION_HEAD.PREDICTOR RUNetPredictor \
SOLVER.IMS_PER_BATCH 1 \
TEST.IMS_PER_BATCH 1 \
DTYPE "float16" \
SOLVER.PRE_VAL True \
SOLVER.BASE_LR 0.0025 \
MODEL.ROI_RELATION_HEAD.L21_LOSS 0.7 \
MODEL.PRETRAINED_DETECTOR_CKPT ~/checkpoints/pretrained_faster_rcnn/model_final.pth \
OUTPUT_DIR ~/checkpoints/runet-sgdet
and I got the exception:
maskrcnn_benchmark INFO: -------------------------------
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Traceback (most recent call last):
File "tools/relation_train_net.py", line 379, in <module>
main()
File "tools/relation_train_net.py", line 372, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 147, in train
loss_dict = model(images, targets)
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/ru_net/RU-Net/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/ru_net/RU-Net/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/ru_net/RU-Net/maskrcnn_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 94, in forward
refine_logits, relation_logits, add_losses = self.predictor(proposals, rel_pair_idxs, full_pair_idxs, rel_labels, rel_binarys, roi_features, union_features, logger)
File "/anaconda3/envs/ru_net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/ru_net/RU-Net/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 819, in forward
assert bool((rel_pair_idx == pair_idx[vr_indices]).all())
Notice that I got the same assertion error when trying with multiple GPUs, i.e., when running this command:
CUDA_VISIBLE_DEVICES=6,7 \
python -m torch.distributed.launch \
--master_port 15026 \
--nproc_per_node=2 \
tools/relation_train_net.py \
--config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" \
MODEL.ROI_RELATION_HEAD.USE_GT_BOX False \
MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False \
MODEL.ROI_RELATION_HEAD.PREDICTOR RUNetPredictor \
SOLVER.IMS_PER_BATCH 2 \
TEST.IMS_PER_BATCH 2 \
DTYPE "float16" \
SOLVER.PRE_VAL True \
SOLVER.BASE_LR 0.0025 \
MODEL.ROI_RELATION_HEAD.L21_LOSS 0.7 \
MODEL.PRETRAINED_DETECTOR_CKPT ~/checkpoints/pretrained_faster_rcnn/model_final.pth \
OUTPUT_DIR ~/checkpoints/runet-sgdet-2gpus
Any suggestions for fixing the issue?
Many thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels