Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Autograd fails when using take operator repeatedly #11599

@junrushao

Description

@junrushao

Description

Here provides a strange example that a backward pass on the accumulated sum of a 2-d NDArray sc, may potentially cause autograd to fail, depending on the shape of sc.

Minimum reproducible example

TL;DR. This code calculates x = sum(sc[i, j] for each i, j), and then invokes a backward pass x.backward(). The error message says that it could not calculate the gradient w.r.t. some variables, but I didn't require gradient for non-differentiable variables like i or j.

import mxnet as mx

def _array(shape):
    """Create an NDArray with random entries and the given shape
    """
    return mx.nd.random.uniform(-1.0, 1.0, shape=shape, dtype="float32")

def index(sc, i, j):
    """Equivalent to sc[i, j] in numpy
    """
    return sc.take(i).squeeze(axis=0)  \
             .take(j).squeeze(axis=0)

# tweaking `sc` in different shapes, the code sometimes fails, sometimes not
row_len, col_len = 2, 8

# the scanned variable
sc = _array((row_len, col_len))
sc.attach_grad()  # we only require the gradient w.r.t. `sc`

# `i`, `j` are loop variables which don't require grad
i = mx.nd.array([0], dtype="int64")
j = mx.nd.array([0], dtype="int64")

with mx.autograd.record(train_mode=True):
    xs = []
    for _ in range(row_len):
        x_i = []
        for _ in range(col_len):
            x_ij = index(sc, i, j)
            x_i.append(x_ij)
            j = j + 1
        i = i + 1
        j = j - col_len  # reset j
        xs.append(mx.nd.stack(*x_i))
    x = mx.nd.stack(*xs)
    x = x.sum()

x.backward()
print(sc.grad.asnumpy())   # the expected result should be all `1`s

Error Message

>> python2 test.py
/home/ubuntu/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    print(sc.grad.asnumpy())
  File "/home/ubuntu/Projects/mxnet/python/mxnet/ndarray/ndarray.py", line 1910, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/Projects/mxnet/python/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:43:28] src/operator/tensor/./indexing_op.h:829: Check failed: req[take_::kIdx] == kNullOp (1 vs. 0) take layer doesn't support gradient into index

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f740cb6231b]
[bt] (1) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f740cb62b58]
[bt] (2) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::TakeOpBackward<mshadow::cpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x220) [0x7f740e46c710]
[bt] (3) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x291) [0x7f740f26a7b1]
[bt] (4) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x68) [0x7f740f7a7538]
[bt] (5) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x47) [0x7f740f7a7517]
[bt] (6) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x47) [0x7f740f7a7517]
[bt] (7) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x47) [0x7f740f7a7517]
[bt] (8) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x47) [0x7f740f7a7517]
[bt] (9) /home/ubuntu/Projects/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::BulkAppend(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x47) [0x7f740f7a7517]

Steps to reproduce

  1. Save the code as test.py, and run python2 test.py

What have you tried to solve it?

I have no idea where it comes from. That by tweaking row_len and col_len, it behaves differently seems weird to me. I only observed that

  1. When row_len = 1, it never fails
  2. When row_len = 2, and col_len < 8, it doesn't fails; when col_len >= 8, it fails
  3. When row_len = 3, and col_len < 7, it doesn't fails; when col_len >= 7, it fails
  4. When row_len = 4, and col_len < 7, it doesn't fails; when col_len >= 7, it fails

Environment info

Not relevant though.

----------Python Info----------
('Version      :', '2.7.15')
('Compiler     :', 'GCC 7.2.0')
('Build        :', ('default', 'May  1 2018 23:32:55'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '10.0.1')
('Directory    :', '/home/ubuntu/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
/home/ubuntu/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
('Version      :', '1.3.0')
('Directory    :', '/home/ubuntu/Projects/junru-mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-4.4.0-1062-aws-x86_64-with-debian-stretch-sid')
('system       :', 'Linux')
('node         :', 'ip-172-31-42-30')
('release      :', '4.4.0-1062-aws')
('version      :', '#71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0019 sec, LOAD: 0.5934 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0070 sec, LOAD: 0.1024 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0099 sec, LOAD: 0.5567 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0035 sec, LOAD: 0.0305 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0994 sec, LOAD: 0.5750 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1169 sec, LOAD: 0.5358 sec.

Build info

Compiler: gcc

MXNet commit hash: dd954b4 (current HEAD)

Build config: default

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions