Slow CPU inference in Gluon GRU module

## Description
Gluon.GRU is slow on the CPU comparing to ndarray.RNN GRU for the same input.

## Environment info
Deep Learning AMI 19, Tesla V100
```
----------Python Info----------
Version      : 3.7.1
Compiler     : GCC 7.3.0
Build        : ('default', 'Oct 23 2018 19:19:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/mxnet
Commit Hash   : b45e1273ece8eba1a011107ce12032af58efe661
----------System Info----------
Platform     : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.10
system       : Linux
node         : ip-172-31-44-214
release      : 4.14.77-70.59.amzn1.x86_64
version      : #1 SMP Mon Nov 12 22:02:45 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2701.073
BogoMIPS:              4600.18
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.7860 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0006 sec, LOAD: 0.5938 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.0175 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 1.0119 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0114 sec, LOAD: 0.4352 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0866 sec.
```

## Minimum reproducible example
```
from time import time

import mxnet as mx
from mxnet import nd
from mxnet import gluon
from mxnet.gluon import rnn

inp_dim = 1024
hid_dim = 1024
n_layers = 1
n_parameters = (inp_dim * hid_dim + hid_dim + hid_dim * hid_dim + hid_dim) * 3
n_steps = 100

for ctx in [mx.cpu(), mx.gpu()]:
    gru_params = nd.random.uniform(low=-1, high=1, shape=(n_parameters,), ctx=ctx)
    gru_ndarray = lambda x, h_0: nd.RNN(x, gru_params, h_0, num_layers=n_layers,
                                        state_size=hid_dim, mode='gru', state_outputs=True)
    gru_gluon = rnn.GRU(hid_dim, n_layers, input_size=inp_dim)
    gru_gluon.collect_params().initialize(ctx=ctx)
    gru_gluon.hybridize()

    x = nd.random_normal(0, 1, (1, 1, inp_dim), ctx=ctx)
    h_0 = x

    # JIC: warm-up
    _, _ = gru_gluon(x, h_0)
    nd.waitall()

    for method, gru in [('ndarray', gru_ndarray), ('gluon', gru_gluon)]:
        h = h_0
        start = time()
        for step in range(n_steps):
            _, h = gru(x, h)
            if method == 'gluon':
                h = h[0]
        nd.waitall()
        dt = time() - start
        print(ctx, method, dt)
```

## Steps to reproduce
Run the above script with python

## Output
Gluon.GRU is significantly slower than ndarray.RNN
device,method,time:
cpu(0) ndarray 0.07194805145263672
cpu(0) gluon 4.735473394393921
gpu(0) ndarray 0.013593673706054688
gpu(0) gluon 0.04437994956970215


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow CPU inference in Gluon GRU module #13634

Description

Environment info

Minimum reproducible example

Steps to reproduce

Output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Slow CPU inference in Gluon GRU module #13634

Description

Description

Environment info

Minimum reproducible example

Steps to reproduce

Output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions