Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN softmax outputs NaN in mkldnn 0.14 #13141

@azai91

Description

@azai91

Description

Extremely negative softmax inputs output NaN. This is an error caught detected in MKLDNN already (uxlfoundation/oneDNN#106) with a fix (https://gist.github.com/emfomenk/0386c529c5df21ae308b00d16454c48e) in MKLDNN v0.15+ (we are v0.14).

The fix is either to:

  1. patch MKLDNN v0.14 with the earlier fix
  2. to upgrade the MKLDNN version in mxnet (Update MKL-DNN dependency #12953).

Environment info (Required)

ubuntu@ip-172-31-3-217:~$ python diagnose.py
----------Python Info----------
Version      : 3.6.4
Compiler     : GCC 7.2.0
Build        : ('default', 'Jan 16 2018 18:10:19')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Version      : 1.3.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
----------System Info----------
Platform     : Linux-4.4.0-1065-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-3-217
release      : 4.4.0-1065-aws
version      : #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0012 sec, LOAD: 0.4806 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1717 sec, LOAD: 0.5293 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1596 sec, LOAD: 0.3734 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0262 sec, LOAD: 0.1173 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0013 sec, LOAD: 0.3264 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0118 sec, LOAD: 0.0690 sec.

Package used (Python/R/Scala/Julia):
Python

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
6b5d9f9

Build config:
MKLDNN (pip install mxnet-mkl)

Error Message:

ubuntu@ip-172-31-3-217:~/incubator-mxnet$ python tt.py
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[
[[[[nan nan]]]]
<NDArray 1x1x1x2 @cpu(0)>]

Minimum reproducible example

import mxnet as mx
input_data = mx.nd.array([[[[-1e30,-1e30]]]])
data = mx.sym.Variable('data')
out1 = data.softmax(axis=1)
exec1 = out1.bind(mx.cpu(), args={'data': input_data, 'softmax_label': mx.nd.ones([1]), 'fc_weight': mx.nd.ones([2,2]), 'fc1_weight': mx.nd.ones([2,2])})
exec1.forward()[0].wait_to_read()
print(exec1.outputs)

Steps to reproduce

Run the following script.

What have you tried to solve it?

Applying this one line fix (https://gist.github.com/emfomenk/0386c529c5df21ae308b00d16454c48e) in mkldnn fixes the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions