-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Hybridized SoftReLU on the GPU is not numerically stable #17844
Description
Description
This is a really weird issue, which causes the SoftReLU activation function to overflow on even fairly small input values. This issue only happens, when the activation function is executed as part of a hybridized graph on the GPU.
To Reproduce
import mxnet as mx
import numpy as np
class SoftReLU(mx.gluon.HybridBlock):
def hybrid_forward(self, F, x):
x = F.identity(x)
x = F.Activation(x, act_type='softrelu')
x = F.identity(x)
return x
fun_normal = SoftReLU()
fun_hybrid = SoftReLU()
fun_hybrid.hybridize()
inp = 100
for f_name, fun in [('Normal', fun_normal), ('Hybrid', fun_hybrid)]:
for c_name, ctx in [('CPU', mx.cpu()), ('GPU', mx.gpu())]:
for dtype in [np.float16, np.float32, np.float64]:
data = mx.nd.array([inp], ctx=ctx, dtype=dtype)
result = fun(data)
i_dtype = np.dtype(dtype).name
o_dtype = np.dtype(result.dtype).name
print(f"{f_name} on context {c_name} ({i_dtype} => {o_dtype}) SoftReLU({inp}) = {result.asnumpy().item()}")which prints
Normal on context CPU (float16 => float16) SoftReLU(100) = 100.0
Normal on context CPU (float32 => float32) SoftReLU(100) = 100.0
Normal on context CPU (float64 => float64) SoftReLU(100) = 100.0
Normal on context GPU (float16 => float16) SoftReLU(100) = 100.0
Normal on context GPU (float32 => float32) SoftReLU(100) = 100.0
Normal on context GPU (float64 => float64) SoftReLU(100) = 100.0
Hybrid on context CPU (float16 => float16) SoftReLU(100) = 100.0
Hybrid on context CPU (float32 => float32) SoftReLU(100) = 100.0
Hybrid on context CPU (float64 => float64) SoftReLU(100) = 100.0
Hybrid on context GPU (float16 => float16) SoftReLU(100) = inf
Hybrid on context GPU (float32 => float32) SoftReLU(100) = inf
Hybrid on context GPU (float64 => float64) SoftReLU(100) = inf
I am suspecting, that the hybridized GPU implementation doesn't utilize the log-sum-exp trick for numerical stability.
I am not 100% sure why, but if the SoftReLU is right at the start or end of the graph, then this issue doesn't happen, so I had to add F.identity no-ops around it, but any other operation seems to also work, so it's not identity specific. Maybe it doesn't get Hybridized properly if it's at the edge of the graph or something.
Environment
I've reproduced this on 2 machines with different mxnet versions:
mxnet 1.5.1 installed from pip
curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
----------Python Info----------
Version : 3.7.5
Compiler : GCC 8.3.0
Build : ('default', 'Nov 7 2019 10:50:52')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 20.0.2
Directory : /usr/local/lib/python3.7/dist-packages/pip
----------MXNet Info-----------
Version : 1.5.1
Directory : /usr/local/lib/python3.7/dist-packages/mxnet
Num GPUs : 6
Commit Hash : c9818480680f84daa6e281a974ab263691302ba8
----------System Info----------
Platform : Linux-4.20.0-042000-generic-x86_64-with-Ubuntu-18.04-bionic
system : Linux
node : redacted
release : 4.20.0-042000-generic
version : #201812232030 SMP Mon Dec 24 01:32:58 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Stepping: 4
CPU MHz: 3210.788
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 6000.01
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0012 sec, LOAD: 0.7693 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0006 sec, LOAD: 0.6054 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.2659 sec, LOAD: 0.2331 sec.
Timing for D2L: http://d2l.ai, DNS: 0.1060 sec, LOAD: 0.0806 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.2956 sec, LOAD: 0.0992 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2427 sec, LOAD: 0.9397 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0194 sec, LOAD: 0.5106 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.2071 sec, LOAD: 0.9205 sec.
mxnet built from source from commit 34010ea (recent master)
curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
----------Python Info----------
Version : 3.8.1
Compiler : GCC 9.2.0
Build : ('default', 'Jan 22 2020 06:38:00')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 19.3.1
Directory : /usr/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version : 1.6.0
Directory : /usr/lib/python3.8/site-packages/mxnet
Num GPUs : 1
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-5.4.24-1-MANJARO-x86_64-with-glibc2.2.5
system : Linux
node : redacted
release : 5.4.24-1-MANJARO
version : #1 SMP PREEMPT Thu Mar 5 20:29:25 UTC 2020
----------Hardware Info----------
machine : x86_64
processor :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping: 9
CPU MHz: 3294.305
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 5602.18
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
tscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx e
st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpre
fetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_wind
ow hwp_epp md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0233 sec, LOAD: 0.8981 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0160 sec, LOAD: 0.5953 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0159 sec, LOAD: 0.2458 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0401 sec, LOAD: 0.3481 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0384 sec, LOAD: 0.2148 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1041 sec, LOAD: 0.2033 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0381 sec, LOAD: 1.0081 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0170 sec, LOAD: 0.3241 sec.
Research
There was a similar issue with mkldnn, but in this case, the GPU implementation is wrong, not the CPU one.
I've looked into the source code and the mshadow implementation looks correct and the cudnn implementation looks like it's supposed to fallback on mshadow?
I am not sure, what is going wrong here tbh. Maybe there is some optimization at play here? That would somewhat explain, why regular GPU forward pass works fine, while the hybridized one fails.