This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
SSD Training fails with free pointer issue during end of training #19024
Copy link
Copy link
Open
Labels
Description
- The SSD training script fails with either
free(): invalid pointerorcorrupted size vs. prev_size - Tried running the script with and without horovod mode on p3dn. Below are details:
1. Without Horovod:
Cmd:
python gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3,4,5,6,7 -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20
Failure:
free(): invalid pointer
Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd-log
2. With Horovod:
Cmd:
horovodrun -np 8 python gluon-cv/scripts/detection/ssd/train_ssd.py -j 32 --network resnet50_v1 --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 1 --horovod --batch-size 64 --log-interval 100 --val-interval 20 --save-interval 20
Failure:
[1,1]<stderr>:corrupted size vs. prev_size
[1,1]<stderr>:[ip-100-64-13-241:09515] *** Process received signal ***
[1,1]<stderr>:[ip-100-64-13-241:09515] Signal: Aborted (6)
[1,1]<stderr>:[ip-100-64-13-241:09515] Signal code: (-6)
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7fb2d87948a0]
[1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 1] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fb2d83cff47]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 2] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fb2d83d18b1]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 3] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x89907)[0x7fb2d841a907]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 4] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x9097a)[0x7fb2d842197a]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 5] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x90b7c)[0x7fb2d8421b7c]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 6] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x94848)[0x7fb2d8425848]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 7] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x27d)[0x7fb2d842835d]
[1,1]<stderr>:[ip-100-64-13-241:09515] [1,1]<stderr>:[ 8] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/bin/../lib/libstdc++.so.6(_Znwm+0x15)[0x7fb269b344e5]
[1,1]<stderr>:[ip-100-64-13-241:09515] [ 9] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b43cd)[0x7fb28d8dd3cd]
[1,1]<stderr>:[ip-100-64-13-241:09515] [10] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38ba8c6)[0x7fb28d8e38c6]
[1,1]<stderr>:[ip-100-64-13-241:09515] [11] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bac16)[0x7fb28d8e3c16]
[1,1]<stderr>:[ip-100-64-13-241:09515] [12] [1,1]<stderr>:/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bfe60)[0x7fb28d8e8e60]
Output log file: https://gist.github.com/karan6181/2ce3d8c68406aae5cd1e208aaf7dd5fd#file-mxnet_ssd_horovod_single_node-log
GluonCV: 0.8.0 (build from source)
Horovod:
Horovod v0.19.5:
Available Frameworks:
[ ] TensorFlow
[ ] PyTorch
[X] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
MXNet Diagnosis:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping: 4
CPU MHz: 1200.041
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version : 3.6.10
Compiler : GCC 7.3.0
Build : ('default', 'Mar 25 2020 23:51:54')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 20.0.2
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.6.0
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash : 6de57440b792dca716f1214a81edf557c345fddb
Library : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
system : Linux
node : ip-100-64-13-241
release : 5.3.0-1032-aws
version : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
----------Hardware Info----------
machine : x86_64
processor : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 0.3844 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0220 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.0184 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 0.1442 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0035 sec, LOAD: 0.0546 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004246234893798828 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
KMP_INIT_AT_FORK="FALSE"
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping: 4
CPU MHz: 1305.290
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version : 3.6.10
Compiler : GCC 7.3.0
Build : ('default', 'Mar 25 2020 23:51:54')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 20.0.2
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.6.0
Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash : 6de57440b792dca716f1214a81edf557c345fddb
Library : ['/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform : Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
system : Linux
node : ip-100-64-13-241
release : 5.3.0-1032-aws
version : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
----------Hardware Info----------
machine : x86_64
processor : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0026 sec, LOAD: 0.3870 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.0253 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0005 sec, LOAD: 0.3219 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1079 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0008 sec, LOAD: 0.0563 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0004470348358154297 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"
KMP_INIT_AT_FORK="FALSE"