Description
Gluon.GRU is slow on the CPU comparing to ndarray.RNN GRU for the same input.
Environment info
Deep Learning AMI 19, Tesla V100
----------Python Info----------
Version : 3.7.1
Compiler : GCC 7.3.0
Build : ('default', 'Oct 23 2018 19:19:42')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 18.1
Directory : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version : 1.5.0
Directory : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/mxnet
Commit Hash : b45e1273ece8eba1a011107ce12032af58efe661
----------System Info----------
Platform : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.10
system : Linux
node : ip-172-31-44-214
release : 4.14.77-70.59.amzn1.x86_64
version : #1 SMP Mon Nov 12 22:02:45 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2701.073
BogoMIPS: 4600.18
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.7860 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0006 sec, LOAD: 0.5938 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.0175 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 1.0119 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0114 sec, LOAD: 0.4352 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0866 sec.
Minimum reproducible example
from time import time
import mxnet as mx
from mxnet import nd
from mxnet import gluon
from mxnet.gluon import rnn
inp_dim = 1024
hid_dim = 1024
n_layers = 1
n_parameters = (inp_dim * hid_dim + hid_dim + hid_dim * hid_dim + hid_dim) * 3
n_steps = 100
for ctx in [mx.cpu(), mx.gpu()]:
gru_params = nd.random.uniform(low=-1, high=1, shape=(n_parameters,), ctx=ctx)
gru_ndarray = lambda x, h_0: nd.RNN(x, gru_params, h_0, num_layers=n_layers,
state_size=hid_dim, mode='gru', state_outputs=True)
gru_gluon = rnn.GRU(hid_dim, n_layers, input_size=inp_dim)
gru_gluon.collect_params().initialize(ctx=ctx)
gru_gluon.hybridize()
x = nd.random_normal(0, 1, (1, 1, inp_dim), ctx=ctx)
h_0 = x
# JIC: warm-up
_, _ = gru_gluon(x, h_0)
nd.waitall()
for method, gru in [('ndarray', gru_ndarray), ('gluon', gru_gluon)]:
h = h_0
start = time()
for step in range(n_steps):
_, h = gru(x, h)
if method == 'gluon':
h = h[0]
nd.waitall()
dt = time() - start
print(ctx, method, dt)
Steps to reproduce
Run the above script with python
Output
Gluon.GRU is significantly slower than ndarray.RNN
device,method,time:
cpu(0) ndarray 0.07194805145263672
cpu(0) gluon 4.735473394393921
gpu(0) ndarray 0.013593673706054688
gpu(0) gluon 0.04437994956970215
Description
Gluon.GRU is slow on the CPU comparing to ndarray.RNN GRU for the same input.
Environment info
Deep Learning AMI 19, Tesla V100
Minimum reproducible example
Steps to reproduce
Run the above script with python
Output
Gluon.GRU is significantly slower than ndarray.RNN
device,method,time:
cpu(0) ndarray 0.07194805145263672
cpu(0) gluon 4.735473394393921
gpu(0) ndarray 0.013593673706054688
gpu(0) gluon 0.04437994956970215