MKLDNN batch_norm doesn't work with Large Tensor Support

## Description
When run running `batch_norm` with large inputs for e.g.:
```
    import mxnet as mx
    from mxnet import np, npx
    A = np.ones((2, 1000000000))
    gamma = np.ones((2))
    beta = np.zeros((2))
    mov_mean = np.ones((2))
    mov_var = np.ones((2))
    A.attach_grad() 
    with mx.autograd.record():
        B = npx.batch_norm(A, gamma, beta, mov_mean, mov_var)
    print("output={}".format(B))
    B.backward()
    print("gradient={}".format(A.grad))
```

the program errors out giving following error:
```
(pytest) ubuntu@ip-172-31-0-156 ~/workspace/incubator-mxnet (mx2lts) $ python test_batch_norm.py
curr_path=/home/ubuntu/workspace/incubator-mxnet
sys_path=['/home/ubuntu/workspace/incubator-mxnet', '/home/ubuntu/workspace/incubator-mxnet/python', '/home/ubuntu/anaconda3/envs/pytest/lib/python38.zip', '/home/ubuntu/anaconda3/envs/pytest/lib/python3.8', '/home/ubuntu/anaconda3/envs/pytest/lib/python3.8/lib-dynload', '/home/ubuntu/anaconda3/envs/pytest/lib/python3.8/site-packages', '/home/ubuntu/workspace/incubator-mxnet/tests/python/unittest/']
[15:27:26] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
malloc_consolidate(): invalid chunk size
Aborted (core dumped)
```

## To Reproduce
Build MXNet from source and enable Large Tensor Support by turning `ON` the flag `USE_INT64_TENSOR_SIZE` and  run the above sample python script 

## Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
```
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python

# paste outputs here
(pytest) ubuntu@ip-172-31-0-156 ~/workspace/incubator-mxnet (mx2lts) $ curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python
----------Python Info----------
Version      : 3.8.5
Compiler     : GCC 7.3.0
Build        : ('default', 'Aug  5 2020 08:36:46')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.2.2
Directory    : /home/ubuntu/anaconda3/envs/pytest/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/ubuntu/workspace/incubator-mxnet/python/mxnet
Commit hash file "/home/ubuntu/workspace/incubator-mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/ubuntu/workspace/incubator-mxnet/python/mxnet/../../build/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ DIST_KVSTORE
✔ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✔ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.3.0-1032-aws-x86_64-with-glibc2.10
system       : Linux
node         : ip-172-31-0-156
release      : 5.3.0-1032-aws
version      : #34~18.04.2-Ubuntu SMP Fri Jul 24 10:06:28 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2456.934
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.00
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0028 sec, LOAD: 0.4208 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1488 sec, LOAD: 0.2241 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2317 sec, LOAD: 0.4453 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0043 sec, LOAD: 0.1646 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0047 sec, LOAD: 0.1006 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.01151585578918457 sec.
----------Environment----------
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MKLDNN batch_norm doesn't work with Large Tensor Support #19065

Description

To Reproduce

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MKLDNN batch_norm doesn't work with Large Tensor Support #19065

Description

Description

To Reproduce

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions