From 3bf48b8e35bf119ba3f12caa4f5931e6e6692ac8 Mon Sep 17 00:00:00 2001
From: Olivier <coolivie@amazon.com>
Date: Mon, 16 Oct 2017 11:52:25 -0700
Subject: [PATCH 01/23] CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes
---
 src/operator/activation-inl.h         |  34 +++-
 src/operator/mxnet_op.h               |  14 ++
 tests/cpp/include/test_op.h           |  48 ++++-
 tests/cpp/include/test_op_runner.h    | 269 ++++++++++++++++++++++++++
 tests/cpp/include/test_util.h         |   3 +-
 tests/cpp/operator/activation_perf.cc |  99 ++++++++++
 tests/cpp/operator/batchnorm_test.cc  |  14 +-
 tests/cpp/operator/fully_conn_perf.cc |  84 ++++++++
 tests/cpp/test_main.cc                |   9 +-
 9 files changed, 545 insertions(+), 29 deletions(-)
 create mode 100644 tests/cpp/include/test_op_runner.h
 create mode 100644 tests/cpp/operator/activation_perf.cc
 create mode 100644 tests/cpp/operator/fully_conn_perf.cc

diff --git a/src/operator/activation-inl.h b/src/operator/activation-inl.h
index 8b1a229250df..679105b8b0d1 100644
--- a/src/operator/activation-inl.h
+++ b/src/operator/activation-inl.h
@@ -34,6 +34,7 @@
 #include <vector>
 #include <utility>
 #include "./operator_common.h"
+#include "./mxnet_op.h"
 
 namespace mxnet {
 namespace op {
@@ -75,9 +76,16 @@ class ActivationOp : public Operator {
     CHECK_EQ(in_data.size(), 1U);
     CHECK_EQ(out_data.size(), 1U);
     Stream<xpu> *s = ctx.get_stream<xpu>();
-    Tensor<xpu, 2, DType> data = in_data[activation::kData].FlatTo2D<xpu, DType>(s);
-    Tensor<xpu, 2, DType> out = out_data[activation::kOut].FlatTo2D<xpu, DType>(s);
-    Assign(out, req[activation::kOut], F<ForwardOp>(data));
+    const TBlob& input = in_data[activation::kData];
+    const size_t sz = input.shape_.Size();
+    if(sz) {
+      MXNET_ASSIGN_REQ_SWITCH(req[activation::kOut], Req, {
+        mxnet_op::Kernel<mxnet_op::op_with_req<ForwardOp, Req>, xpu>::Launch(
+          s, sz,
+          out_data[activation::kOut].dptr<DType>(),
+          input.dptr<DType>());
+      });
+    }
   }
 
   virtual void Backward(const OpContext &ctx,
@@ -93,14 +101,24 @@ class ActivationOp : public Operator {
     CHECK(in_data.size() == 1 && in_grad.size() == 1);
     CHECK_EQ(req.size(), 1U);
     Stream<xpu> *s = ctx.get_stream<xpu>();
-    Tensor<xpu, 2, DType> m_out_grad = out_grad[activation::kOut].FlatTo2D<xpu, DType>(s);
-    Tensor<xpu, 2, DType> m_out_data = out_data[activation::kOut].FlatTo2D<xpu, DType>(s);
-    Tensor<xpu, 2, DType> m_in_grad = in_grad[activation::kData].FlatTo2D<xpu, DType>(s);
-    Assign(m_in_grad, req[activation::kData], F<BackwardOp>(m_out_data) * m_out_grad);
+    const TBlob& m_out_grad = out_grad[activation::kOut];
+    const TBlob& m_out_data = out_data[activation::kOut];
+    const TBlob&  m_in_grad = in_grad[activation::kData];
+    const size_t sz = m_out_data.shape_.Size();
+    if(sz) {
+      MXNET_ASSIGN_REQ_SWITCH(req[activation::kData], Req, {
+        mxnet_op::Kernel<mxnet_op::op_with_req<
+          mxnet::op::mxnet_op::backward_grad<BackwardOp>, Req>, xpu>::Launch(
+          s, sz,
+          m_in_grad.dptr<DType>(),
+          m_out_grad.dptr<DType>(),
+          m_out_data.dptr<DType>());
+      });
+    }
   }
 };  // class ActivationOp
 
-// Decalre Factory function, used for dispatch specialization
+// Declare Factory function, used for dispatch specialization
 template<typename xpu>
 Operator* CreateOp(ActivationParam type, int dtype, const TShape& dshape);
 
diff --git a/src/operator/mxnet_op.h b/src/operator/mxnet_op.h
index 329f71c66c08..bd3b0f27c51a 100644
--- a/src/operator/mxnet_op.h
+++ b/src/operator/mxnet_op.h
@@ -215,6 +215,20 @@ struct set_zero {
   }
 };
 
+/*! \brief Binary op backward gradient OP wrapper */
+template<typename GRAD_OP>
+struct backward_grad {
+  /* \brief Backward calc with grad
+   * \param a - output grad
+   * \param args... - data to grad calculation op (what this is -- input, output, etc. -- varies)
+   * \return input grad
+   */
+  template<typename DType, typename ...Args>
+  MSHADOW_XINLINE static DType Map(DType a, Args... args) {
+    return DType(a * GRAD_OP::Map(args...));
+  }
+};
+
 /*! \brief Select assignment operation based upon the req value
  * Also useful for mapping mshadow Compute (F<OP>) to Kernel<OP>::Launch
  */
diff --git a/tests/cpp/include/test_op.h b/tests/cpp/include/test_op.h
index f30fbe8e6981..4b46b80b597d 100644
--- a/tests/cpp/include/test_op.h
+++ b/tests/cpp/include/test_op.h
@@ -100,7 +100,8 @@ class BasicOperatorData {
 #endif
       , initializeForward_(0)   // unit testing may call inits in any order based
       , initializeBackward_(0)  // upon its use-case (ie may not want to run forward pass first)
-      , initializeCallback_(0) {
+      , initializeCallback_(0)
+      , generator_(new std::mt19937()) {
     opContext_.is_train = true;
     opContext_.run_ctx.stream = nullptr;
 
@@ -123,10 +124,14 @@ class BasicOperatorData {
       shape_input_vec_.resize(opProp.ListArguments().size());
       op_.reset(opProp.CreateOperatorEx(getContext(), &shape_input_vec_, in_type));
       if (op_) {
+        const size_t output_count = opProp.ListOutputs().size();
+        const size_t aux_count = opProp.ListAuxiliaryStates().size();
         // Figure out what sort of blobs we need to allocate
         std::vector<TShape> out_shape, aux_shape;
+        out_shape.resize(output_count);
+        aux_shape.resize(aux_count);
         opProp.InferShape(&shape_input_vec_, &out_shape, &aux_shape);
-        std::vector<int> out_type, aux_type;
+        std::vector<int> out_type(output_count, -1), aux_type(aux_count, -1);
         opProp.InferType(in_type, &out_type, &aux_type);
 
         // Allocate top blobs (input)
@@ -174,9 +179,9 @@ class BasicOperatorData {
     initForward(opProp, in_type);
     if (!initializeBackward_++) {
       for (size_t x = 0, n = static_cast<size_t>(opProp.NumVisibleOutputs()); x < n; ++x) {
-        CHECK_LT(x, c_.blob_input_vec_.size());
-        allocateBlob(&c_.blob_out_grad_, c_.blob_input_vec_[x].shape_,
-                     false, c_.blob_input_vec_[x].type_flag_);
+        CHECK_LT(x, c_.blob_output_vec_.size());
+        allocateBlob(&c_.blob_out_grad_, c_.blob_output_vec_[x].shape_,
+                     false, c_.blob_output_vec_[x].type_flag_);
       }
 
       for (size_t x = 0, n = c_.blob_input_vec_.size(); x < n; ++x) {
@@ -197,6 +202,7 @@ class BasicOperatorData {
 
   /*! \brief Run operator forward */
   void forward(const size_t count = 1) {
+    const std::vector<OpReqType> req(c_.blob_output_vec_.size(), kWriteTo);
     // Possibly move data to/from CPU and GPU (outside of timing scope)
     MXNET_CUDA_ONLY(std::unique_ptr<GPUOpData> gpuData(isGPU_ ?
                        new GPUOpData(c_, &opContext_) : nullptr));
@@ -206,7 +212,7 @@ class BasicOperatorData {
       for (size_t x = 0; x < count; ++x) {
         op()->Forward(opContext_,
                       c_.blob_input_vec_,
-                      {kWriteTo, kWriteTo, kWriteTo},
+                      req,
                       c_.blob_output_vec_,
                       c_.blob_aux_states_);
       }
@@ -214,7 +220,7 @@ class BasicOperatorData {
       for (size_t x = 0; x < count; ++x) {
         MXNET_CUDA_ONLY(op()->Forward(opContext_,
                                       gpuData->blob_input_vec_,
-                                      {kWriteTo, kWriteTo, kWriteTo},
+                                      req,
                                       gpuData->blob_output_vec_,
                                       gpuData->blob_aux_states_));
       }
@@ -223,6 +229,7 @@ class BasicOperatorData {
 
   /*! \brief Run operator backwards */
   void backward(const size_t count = 1) {
+    const std::vector<OpReqType> req(c_.blob_output_vec_.size(), kWriteTo);
     // Possibly move data to/from CPU and GPU (outside of timing scope)
     MXNET_CUDA_ONLY(std::unique_ptr<GPUOpData> gpuData(isGPU_ ?
                       new GPUOpData(c_, &opContext_) : nullptr));
@@ -234,7 +241,7 @@ class BasicOperatorData {
                        c_.blob_out_grad_,
                        c_.blob_input_vec_,
                        c_.blob_output_vec_,
-                       {kWriteTo, kWriteTo, kWriteTo},
+                       req,
                        c_.blob_in_grad_,
                        c_.blob_aux_states_);
       }
@@ -244,7 +251,7 @@ class BasicOperatorData {
                                        gpuData->blob_out_grad_,
                                        gpuData->blob_input_vec_,
                                        gpuData->blob_output_vec_,
-                                       {kWriteTo, kWriteTo, kWriteTo},
+                                       req,
                                        gpuData->blob_in_grad_,
                                        gpuData->blob_aux_states_));
       }
@@ -386,6 +393,21 @@ class BasicOperatorData {
     copy(blob, sourceData, 0, sourceDataSize);
   }
 
+  void FillRandom() {
+    std::uniform_real_distribution<DType> distribution(-1.0, 1.0);
+    for (size_t j = 0, jn = this->c_.all_blob_vects_.size(); j < jn; ++j) {
+      std::vector<TBlob> *data_vect = this->c_.all_blob_vects_[j];
+      if (data_vect) {
+        for (size_t i = 0, n = data_vect->size(); i < n; ++i) {
+          TBlob &blob = (*data_vect)[i];
+          test::patternFill<DType>(&blob, [this, &distribution]() -> DType {
+            return distribution(generator());
+          });
+        }
+      }
+    }
+  }
+
   /*! \brief Input and output blobs */
   OpContext                 opContext_;
 
@@ -520,6 +542,9 @@ class BasicOperatorData {
     return allocateBlob(&standalone_blobs_, dest, shape, isGPU, dtype);
   }
 
+  /*! \brief mt19937 generator for random number generator */
+  std::mt19937& generator() { return *generator_; }
+
   /*! \brief Performance timing categories */
   enum TimingId {
     Forward,
@@ -539,6 +564,9 @@ class BasicOperatorData {
   /*! \brief scoped lifecycle management of allocated blobs */
   std::list<std::unique_ptr<test::StandaloneBlob>> standalone_blobs_;
 
+  /*! \brief Per-test generator */
+  std::unique_ptr<std::mt19937> generator_;
+
  public:
   /*! Timing instrumentation */
   test::perf::TimingInstrument timing_;
@@ -675,7 +703,7 @@ class Validator {
     }
     const TBlob& b1 = bv1[idx];
     const TBlob& b2 = bv2[idx];
-    if (print && test::debugOutput) {
+    if (print && test::debug_output) {
       test::print(RunContext(), &(std::cout << "Blob 1:"), b1, true, true);
       test::print(RunContext(), &(std::cout << "Blob 2:"), b2, true, true);
     }
diff --git a/tests/cpp/include/test_op_runner.h b/tests/cpp/include/test_op_runner.h
new file mode 100644
index 000000000000..6d0b766eb378
--- /dev/null
+++ b/tests/cpp/include/test_op_runner.h
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file test_op_runner.h
+ * \brief Run a generic operator
+ * \author Chris Olivier
+*/
+#ifndef TEST_OP_RUNNER_H_
+#define TEST_OP_RUNNER_H_
+
+#include <string>
+#include <vector>
+#include <utility>
+#include "./test_op.h"
+
+namespace mxnet {
+namespace test {
+
+/*!
+ * \brief Generic operator random test data
+ * \tparam DType Main data type
+ * \tparam AccReal Secondary data type (if any)
+ */
+template <typename DType, typename AccReal>
+class GenericOperatorData : public test::op::BasicOperatorData<DType, AccReal> {
+ public:
+  typedef DType   DataType;
+  typedef AccReal AccRealType;
+
+  /*!
+   * \brief Constructor
+   * \param isGPU Is this to be used on GPU?
+   * \param inputShape Input shape to the operator
+   */
+  GenericOperatorData(const bool isGPU, const TShape& inputShape)
+    : test::op::BasicOperatorData<DType, AccReal>(isGPU, inputShape) {
+  }
+
+  /*!
+   * \brief Reset forward pass by filling everything with random values
+   */
+  void resetForward() override {
+    test::op::BasicOperatorData<DType, AccReal>::FillRandom();
+  }
+
+  /*!
+   * \brief Reset backward pass by filling everything with random values
+   */
+  void resetBackward() override {
+    test::op::BasicOperatorData<DType, AccReal>::FillRandom();
+  }
+};
+
+/*!
+ * \brief Generic operator runner
+ * \tparam OperatorProp property class for a given operator (i.e. FullyConnectedProp, BatchNormProp)
+ * \tparam OperatorDataContainer Data container for forward and backward passes for some given
+ *         data types
+ */
+template<typename OperatorProp, typename OperatorDataContainer>
+class OperatorRunner {
+ public:
+  typedef typename OperatorDataContainer::DataType    DType;
+  typedef typename OperatorDataContainer::AccRealType AccReal;
+
+  /*!
+   * \brief Test operator forward pass
+   * \param isGPU Whether this test is for GPU
+   * \param inputShape Input data shape
+   * \param kwargs Operator parameters
+   * \param OutShapeFunction Output shape function override
+   * \param count Number of times to run in each direction
+   * \return OpInfo object for further opereator analysis
+   */
+  test::op::OpInfo<OperatorProp, typename OperatorDataContainer::DataType, AccReal>
+  RunGenericOperatorForward(
+    bool isGPU,
+    const TShape &inputShape,
+    const std::vector<std::pair<std::string, std::string> > &kwargs,
+    const size_t count = 1) {
+#if MXNET_USE_CUDA
+    if (isGPU && !test::unitTestsWithCuda) {
+      LOG(INFO) << "GPU not found, running test as non-GPU";
+    }
+#else
+    isGPU = false;
+#endif
+    test::op::OpInfo<OperatorProp, DType, AccReal> info =
+      test::op::createOpAndInfoF<OperatorProp,
+        OperatorDataContainer, DType, AccReal>(isGPU, inputShape, kwargs);
+    info.data_->initForward(*info.prop_, &info.in_type_);
+    info.data_->forward(count);
+    return info;
+  }
+
+  /*!
+   * \brief Test operator backward pass
+   * \param info OpInfo object from forward pass
+   * \param count
+   * \return OpInfo object for further opereator analysis
+   */
+  test::op::OpInfo<OperatorProp, DType, AccReal> RunGenericOperatorBackward(
+    test::op::OpInfo<OperatorProp, DType, AccReal> *info,
+    const size_t count = 1) {
+    info->data_->initBackward(*info->prop_, &info->in_type_);
+    info->data_->backward(count);
+    return *info;
+  }
+
+  /*!
+   * \brief Run operator forward and backward
+   * \param isGPU Whether this test is for GPU
+   * \param inputShape Input data shape
+   * \param kwargs Operator parameters
+   * \param OutShapeFunction Output shape function override
+   * \param count Number of times to run in each direction
+   * \return
+   */
+  test::op::OpInfo<OperatorProp, DType, AccReal> RunBidirectional(
+    bool isGPU,
+    const TShape &inputShape,
+    const std::vector<std::pair<std::string, std::string> > &kwargs,
+    const size_t count = 1) {
+    test::op::OpInfo<OperatorProp, DType, AccReal> info =
+      RunGenericOperatorForward(isGPU, inputShape, kwargs, count);
+    return RunGenericOperatorBackward(&info, count);
+  }
+
+  /*!
+   * \brief Timing test a generic operator
+   * \tparam PropType
+   * \tparam DType Data type
+   * \tparam AccReal Accumulative data type (if any)
+   * \param label Label for performance output
+   * \param isGPU Whether this test is for GPU
+   * \param stochastic Whether shape should be random (batch size, channels, hm, w)
+   * \param kwargs Operator parameters
+   * \param dim Data dimensions
+   * \param count Number of times to run in each direction
+   */
+  void TimingTest(const std::string& label,
+                  const bool isGPU,
+                  const bool stochastic,
+                  const test::op::kwargs_t& kwargs,
+                  int dim = 0,
+                  size_t count = 1,
+                  TShape timing_shape = TShape()) {
+    std::cout << std::endl << std::flush;
+
+#ifdef NDEBUG
+    size_t COUNT = 50;
+#else
+    size_t COUNT = 5;
+#endif
+    if (mxnet::test::quick_test) {
+      COUNT = 2;
+      count = 1;
+    }
+
+    test::perf::TimingInstrument timing;
+
+    std::stringstream ss;
+    ss << "Timing: " << COUNT << " iterations of " << count << " calls";
+    if (timing_shape.ndim()) {
+      ss << ", shape = " << timing_shape << std::endl << std::flush;
+    }
+    std::cout << ss.str();
+
+    for (size_t i = 0; i < COUNT; ++i) {
+      index_t batchSize = 1;
+      index_t channels = 1;
+      index_t depth = 1;
+      index_t height = 1;
+      index_t width = 1;
+
+      if (!timing_shape.ndim()) {
+        do {
+          batchSize = stochastic ? test::rangedRand(1U, TES_BATCH_SIZE * 2U) : TIMING_BATCH_SIZE;
+          channels = stochastic ? test::rangedRand(1U, TEST_CHANNELS * 2U) : TIMING_CHANNELS;
+          depth = stochastic ? test::rangedRand(1U, TEST_DEPTH * 2U) : TIMING_DEPTH;
+          height = stochastic ? test::rangedRand(1U, TEST_DH * 2U) : TIMING_DH;
+          width = stochastic ? test::rangedRand(1U, TEST_DW * 2U) : TIMING_DW;
+        } while (stochastic && (height * width) == 1U);
+      } else {
+        dim = timing_shape.ndim() - 1;
+      }
+
+      const size_t D = dim ? dim - 1U : test::rangedRand(0U, 2U);
+
+      test::op::OpInfo<OperatorProp, DType, AccReal> info;
+      switch (D) {
+        case 0:
+          info = RunGenericOperatorForward(isGPU,
+                                           timing_shape.ndim() ? timing_shape
+                                                               : TShape({batchSize,
+                                                                         channels,
+                                                                         width}),
+                                           kwargs,
+                                           count);
+          break;
+        case 1:
+          info = RunGenericOperatorForward(isGPU,
+                                           timing_shape.ndim()? timing_shape
+                                                              : TShape({batchSize,
+                                                                        channels,
+                                                                        height,
+                                                                        width}),
+                                           kwargs,
+                                           count);
+          break;
+        case 2:
+          info = RunGenericOperatorForward(isGPU,
+                                           timing_shape.ndim() ? timing_shape
+                                                               : TShape({batchSize,
+                                                                         channels,
+                                                                         depth,
+                                                                         height,
+                                                                         width}),
+                                           kwargs,
+                                           count);
+          break;
+        default:
+          CHECK(false) << "Unsupported dimension count: " << (D + 1);
+      }
+      if (info.data_.get()) {
+        RunGenericOperatorBackward(&info, count);
+        timing += info.data_->timing_;
+      }
+    } while (false);
+
+    timing.print(&std::cout, label);
+    std::cout << std::endl << std::flush;
+  }
+
+ protected:
+  static constexpr int TES_BATCH_SIZE = 5;
+  static constexpr int TEST_CHANNELS = 3;
+  static constexpr int TEST_DEPTH = 2;
+  static constexpr int TEST_DH = 2;
+  static constexpr int TEST_DW = 3;
+
+  static constexpr int TIMING_BATCH_SIZE = 128;
+  static constexpr int TIMING_CHANNELS = 3;
+  static constexpr int TIMING_DEPTH = 2;
+  static constexpr int TIMING_DH = 64;
+  static constexpr int TIMING_DW = 64;
+};
+
+}  // namespace test
+}  // namespace mxnet
+
+#endif  // TEST_OP_RUNNER_H_
diff --git a/tests/cpp/include/test_util.h b/tests/cpp/include/test_util.h
index a788bf389be8..492a0783d227 100644
--- a/tests/cpp/include/test_util.h
+++ b/tests/cpp/include/test_util.h
@@ -40,8 +40,9 @@ namespace mxnet {
 namespace test {
 
 extern bool unitTestsWithCuda;
-extern bool debugOutput;
+extern bool debug_output;
 extern bool quick_test;
+extern bool performance_run;
 
 /*! \brief Pause VTune analysis */
 struct VTunePause {
diff --git a/tests/cpp/operator/activation_perf.cc b/tests/cpp/operator/activation_perf.cc
new file mode 100644
index 000000000000..c0a42173c003
--- /dev/null
+++ b/tests/cpp/operator/activation_perf.cc
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  \file activation_perf.cc
+ *  \brief Perf/profile run of ActivationOp
+ *  \author Chris Olivier
+ */
+
+#include <gtest/gtest.h>
+#include <dmlc/logging.h>
+#include <mxnet/tensor_blob.h>
+#include "../../src/operator/activation-inl.h"
+#include "../include/test_op_runner.h"
+
+using namespace mxnet;
+
+typedef std::vector<std::pair<std::string, std::string> > kwargs_t;
+const kwargs_t basic_activation_args = { };
+
+/*!
+ * \brief Generic bidirectional sanity test
+ */
+TEST(ACTIVATION_PERF, ExecuteBidirectional) {
+  TShape shape({5, 5});
+  kwargs_t kwargs = basic_activation_args;
+  kwargs.push_back({"act_type", "tanh"});
+  test::OperatorRunner<mxnet::op::ActivationProp,
+    test::GenericOperatorData<float, float>> runner;
+  runner.RunBidirectional(false, shape, kwargs, 1);
+}
+
+/*!
+ * \brief ActivationOp timing test for CPU
+ */
+TEST(ACTIVATION_PERF, TimingCPU) {
+  kwargs_t kwargs = basic_activation_args;
+  // Which math function is arbitrary since it will have roughly constant timing among approaches
+  kwargs.push_back({"act_type", "tanh"});
+  test::OperatorRunner<mxnet::op::ActivationProp, test::GenericOperatorData<float, float>> runner;
+  runner.RunBidirectional(false, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
+  std::vector <TShape> shapes;
+  if (test::performance_run) {
+    shapes = {
+      {1,  1, 28,  28},
+      {1,  3, 28,  28},
+      {50, 1, 18,  32},
+      {50, 3, 18,  32},
+      {20, 3, 128, 128}
+    };
+  } else {
+    shapes = {
+      {1,  1, 28,  28},
+      {50, 3, 18,  32},
+    };
+  }
+  for (const TShape &shape : shapes) {
+    runner.TimingTest("Activation Operator CPU", false, false, kwargs, 2, 10, shape);
+  }
+}
+
+#if MXNET_USE_CUDA == 1
+/*!
+ * \brief ActivationOp timing test for GPU
+ */
+TEST(ACTIVATION_PERF, TimingGPU) {
+  kwargs_t kwargs = basic_activation_args;
+  // Which math function is arbitrary since it will have roughly constant timing among approaches
+  kwargs.push_back({"act_type", "tanh"});
+  test::OperatorRunner<mxnet::op::ActivationProp, test::GenericOperatorData<float, float>> runner;
+  runner.RunBidirectional(true, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
+  std::vector <TShape> shapes = {
+      {1,  1, 28,  28},
+      {1,  3, 28,  28},
+      {50, 1, 18,  32},
+      {50, 3, 18,  32},
+      {20, 3, 128, 128}
+    };
+  for (const TShape &shape : shapes) {
+    runner.TimingTest("Activation Operator GPU", true, false, kwargs, 2, 10, shape);
+  }
+}
+#endif  // MXNET_USE_CUDA == 1
diff --git a/tests/cpp/operator/batchnorm_test.cc b/tests/cpp/operator/batchnorm_test.cc
index f04593322858..0eca871c3e22 100644
--- a/tests/cpp/operator/batchnorm_test.cc
+++ b/tests/cpp/operator/batchnorm_test.cc
@@ -450,7 +450,7 @@ template<typename StreamType, typename Prop, typename DType, typename AccReal>
 static StreamType& dumpF(StreamType *os,
                          const test::op::OpInfo<Prop, DType, AccReal>& prop,
                          const size_t x = 0) {
-  if (test::debugOutput) {
+  if (test::debug_output) {
     *os << std::endl;
     if (x) {
       *os << "=============================" << std::endl;
@@ -476,7 +476,7 @@ template<typename StreamType, typename Prop, typename DType, typename AccReal>
 static StreamType& dumpB(StreamType *os,
                          const test::op::OpInfo<Prop, DType, AccReal>& prop,
                          const size_t x = 0) {
-  if (test::debugOutput) {
+  if (test::debug_output) {
     *os << std::endl;
     if (x) {
       *os << "=============================" << std::endl;
@@ -1019,7 +1019,7 @@ TEST(BATCH_NORM, Test2DBackward_Complex) {
   MSHADOW_REAL_TYPE_SWITCH_EX(
     mshadow::kFloat32, DType, AccReal,
     {
-      test::ScopeSet<bool> noDebugOutput(&test::debugOutput, false);
+      test::ScopeSet<bool> noDebugOutput(&test::debug_output, false);
       const TShape inputShape({9, 14, 16, 91});
       test::op::OpInfoPair<op::BatchNormV1Prop, op::BatchNormProp, DType, AccReal> bi =
         testForwardAndBackward<op::BatchNormV1Prop, op::BatchNormProp, DType, AccReal>(
@@ -1226,7 +1226,7 @@ class ChannelAxisTestData {
   std::vector<std::vector<DType>>   channel_data_;
 
   static void print(const std::string& label, const std::vector<std::vector<DType>>& m) {
-    if (test::debugOutput) {
+    if (test::debug_output) {
       if (!label.empty()) {
         std::cout << label << ": ";
       }
@@ -1248,7 +1248,7 @@ class ChannelAxisTestData {
   }
 
   static void print(const std::string& label, const TBlob& blob) {
-    if (test::debugOutput) {
+    if (test::debug_output) {
       if (!label.empty()) {
         std::cout << label << ": ";
       }
@@ -1364,7 +1364,7 @@ TEST(BATCH_NORM, TestChannelAxisSaveAndLoad) {
 
 /*! \brief Insert the channel field `channelCount` into the shape at `channelAxis` position */
 static TShape MakeShape(const std::vector<index_t>& shape,
-                        unsigned int channelAxis,
+                        signed int channelAxis,
                         const size_t channelCount) {
   if (channelAxis < 0) {
     channelAxis += shape.size() + 1;
@@ -1533,7 +1533,7 @@ TEST(BATCH_NORM, TestChannelAxisSimple) {
  *  backward result equivalence here implies correctness for other channel positions
  */
 TEST(BATCH_NORM, TestChannelAxis) {
-  test::ScopeSet<bool> noDebugOutput(&test::debugOutput, false);
+  test::ScopeSet<bool> noDebugOutput(&test::debug_output, false);
 
   test::op::kwargs_t kwargs;
   const std::vector<std::vector<index_t>> shapes =
diff --git a/tests/cpp/operator/fully_conn_perf.cc b/tests/cpp/operator/fully_conn_perf.cc
new file mode 100644
index 000000000000..29a5d35fc52f
--- /dev/null
+++ b/tests/cpp/operator/fully_conn_perf.cc
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  \file fully_conn_perf.cc
+ *  \brief Sample for running C++ performance tests on a single operator.  This method is also
+ *         useful for profiling with vtune or gprof, avoiding the "noise" of python and executor
+ *  \author Chris Olivier
+ */
+
+#include <dmlc/logging.h>
+#include <mxnet/tensor_blob.h>
+#include "../../src/operator/fully_connected-inl.h"
+#include "../include/test_op_runner.h"
+
+using namespace mxnet;
+
+typedef std::vector<std::pair<std::string, std::string> > kwargs_t;
+
+const kwargs_t basic_fullyconn_args = { {"num_hidden", "250"} };
+
+/*!
+ * \brief Generic bidirectional sanity test
+ */
+TEST(FULLY_CONNECTED, ExecuteBidirectionalFullyConnected) {
+  TShape shape({5, 5});
+  kwargs_t kwargs = basic_fullyconn_args;
+  test::OperatorRunner<mxnet::op::FullyConnectedProp,
+    test::GenericOperatorData<float, float>> runner;
+  runner.RunBidirectional(false, shape, kwargs, 1);
+}
+
+/*!
+ * \brief Timing test for CPU
+ */
+TEST(FULLY_CONNECTED, FullyConnectedTimingCPU) {
+  kwargs_t kwargs = basic_fullyconn_args;
+  test::OperatorRunner<mxnet::op::FullyConnectedProp, test::GenericOperatorData<float, float>>
+    runner;
+  runner.RunBidirectional(false, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
+  const std::vector<TShape> shapes = {
+    {1, 1, 28, 28}, {1, 3, 28, 28},
+    {50, 1, 18, 32}, {50, 3, 18, 32}
+  };
+  for (const TShape& shape : shapes) {
+    runner.TimingTest("Fully connected", false, false, kwargs, 2, 10, shape);
+  }
+}
+
+#if MXNET_USE_CUDA == 1
+/*!
+ * \brief Timing test for GPU
+ */
+TEST(FULLY_CONNECTED, FullyConnectedTimingGPU) {
+  kwargs_t kwargs = basic_fullyconn_args;
+  test::op::OpInfo<mxnet::op::FullyConnectedProp, float, float> info;
+  test::OperatorRunner<mxnet::op::FullyConnectedProp,
+    test::GenericOperatorData<float, float>> runner;
+  runner.RunBidirectional(false, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
+  const std::vector<TShape> shapes = {
+    {1, 1, 28, 28}, {1, 3, 28, 28},
+    {50, 1, 18, 32}, {50, 3, 18, 32}
+  };
+  for (const TShape& shape : shapes) {
+    runner.TimingTest("Fully connected", true, false, kwargs, 2, 10, shape);
+  }
+}
+#endif  // MXNET_USE_CUDA == 1
diff --git a/tests/cpp/test_main.cc b/tests/cpp/test_main.cc
index 5434a704c090..eaf9e3c21910 100644
--- a/tests/cpp/test_main.cc
+++ b/tests/cpp/test_main.cc
@@ -38,11 +38,12 @@ static bool dumpCallback(const google_breakpad::MinidumpDescriptor& descriptor,
 namespace mxnet { namespace test {
 bool unitTestsWithCuda = false;
 #ifdef NDEBUG
-bool debugOutput = false;
+bool debug_output = false;
 #else
-bool debugOutput = false;
+bool debug_output = false;
 #endif
 bool quick_test = false;
+bool performance_run = false;
 }}
 
 #if MXNET_USE_CUDA
@@ -85,7 +86,9 @@ int main(int argc, char ** argv) {
       // override (ie force attempt CUDA)
       mxnet::test::unitTestsWithCuda = true;
     } else if (!strcmp(argv[x], "--debug")) {
-      mxnet::test::debugOutput = true;
+      mxnet::test::debug_output = true;
+    } else if (!strcmp(argv[x], "--perf")) {
+      mxnet::test::performance_run = true;
     } else if (!strcmp(argv[x], "--quick") || !strcmp(argv[x], "-q")) {
       mxnet::test::quick_test = true;
     }

From 40b13f1bfbfca076497088fd3cc3754aaaec8ce1 Mon Sep 17 00:00:00 2001
From: Olivier <coolivie@amazon.com>
Date: Mon, 16 Oct 2017 12:00:13 -0700
Subject: [PATCH 02/23] lint

---
 src/operator/activation-inl.h         |  4 ++--
 tests/cpp/operator/fully_conn_perf.cc | 25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/src/operator/activation-inl.h b/src/operator/activation-inl.h
index 679105b8b0d1..ed605a1fedae 100644
--- a/src/operator/activation-inl.h
+++ b/src/operator/activation-inl.h
@@ -78,7 +78,7 @@ class ActivationOp : public Operator {
     Stream<xpu> *s = ctx.get_stream<xpu>();
     const TBlob& input = in_data[activation::kData];
     const size_t sz = input.shape_.Size();
-    if(sz) {
+    if (sz) {
       MXNET_ASSIGN_REQ_SWITCH(req[activation::kOut], Req, {
         mxnet_op::Kernel<mxnet_op::op_with_req<ForwardOp, Req>, xpu>::Launch(
           s, sz,
@@ -105,7 +105,7 @@ class ActivationOp : public Operator {
     const TBlob& m_out_data = out_data[activation::kOut];
     const TBlob&  m_in_grad = in_grad[activation::kData];
     const size_t sz = m_out_data.shape_.Size();
-    if(sz) {
+    if (sz) {
       MXNET_ASSIGN_REQ_SWITCH(req[activation::kData], Req, {
         mxnet_op::Kernel<mxnet_op::op_with_req<
           mxnet::op::mxnet_op::backward_grad<BackwardOp>, Req>, xpu>::Launch(
diff --git a/tests/cpp/operator/fully_conn_perf.cc b/tests/cpp/operator/fully_conn_perf.cc
index 29a5d35fc52f..4cb4b4522a96 100644
--- a/tests/cpp/operator/fully_conn_perf.cc
+++ b/tests/cpp/operator/fully_conn_perf.cc
@@ -54,12 +54,23 @@ TEST(FULLY_CONNECTED, FullyConnectedTimingCPU) {
   test::OperatorRunner<mxnet::op::FullyConnectedProp, test::GenericOperatorData<float, float>>
     runner;
   runner.RunBidirectional(false, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
-  const std::vector<TShape> shapes = {
-    {1, 1, 28, 28}, {1, 3, 28, 28},
-    {50, 1, 18, 32}, {50, 3, 18, 32}
-  };
+  std::vector <TShape> shapes;
+  if (test::performance_run) {
+    shapes = {
+      {1,  1, 28,  28},
+      {1,  3, 28,  28},
+      {50, 1, 18,  32},
+      {50, 3, 18,  32},
+      {20, 3, 128, 128}
+    };
+  } else {
+    shapes = {
+      {1,  1, 28,  28},
+      {50, 3, 18,  32},
+    };
+  }
   for (const TShape& shape : shapes) {
-    runner.TimingTest("Fully connected", false, false, kwargs, 2, 10, shape);
+    runner.TimingTest("Fully connected CPU", false, false, kwargs, 2, 10, shape);
   }
 }
 
@@ -72,13 +83,13 @@ TEST(FULLY_CONNECTED, FullyConnectedTimingGPU) {
   test::op::OpInfo<mxnet::op::FullyConnectedProp, float, float> info;
   test::OperatorRunner<mxnet::op::FullyConnectedProp,
     test::GenericOperatorData<float, float>> runner;
-  runner.RunBidirectional(false, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
+  runner.RunBidirectional(true, {10, 10, 10, 10}, kwargs, 1);  // prime code and cache
   const std::vector<TShape> shapes = {
     {1, 1, 28, 28}, {1, 3, 28, 28},
     {50, 1, 18, 32}, {50, 3, 18, 32}
   };
   for (const TShape& shape : shapes) {
-    runner.TimingTest("Fully connected", true, false, kwargs, 2, 10, shape);
+    runner.TimingTest("Fully connected GPU", true, false, kwargs, 2, 10, shape);
   }
 }
 #endif  // MXNET_USE_CUDA == 1

From 6d4a2bb721ea634f37c94f7331c7737750cf47b5 Mon Sep 17 00:00:00 2001
From: cjolivier01 <cjolivier01@gmail.com>
Date: Mon, 16 Oct 2017 17:26:24 -0700
Subject: [PATCH 03/23] Trigger build


From db2767d014e0f106b0f59b2363edb49480c69abd Mon Sep 17 00:00:00 2001
From: Chris Olivier <cjolivier01@gmail.com>
Date: Wed, 18 Oct 2017 08:16:37 -0700
Subject: [PATCH 04/23] Trigger build

---
 src/operator/activation-inl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/operator/activation-inl.h b/src/operator/activation-inl.h
index ed605a1fedae..bb5a37fc8794 100644
--- a/src/operator/activation-inl.h
+++ b/src/operator/activation-inl.h
@@ -22,6 +22,7 @@
  * \brief Activation operator
  * \author Bing Xu
 */
+
 #ifndef MXNET_OPERATOR_ACTIVATION_INL_H_
 #define MXNET_OPERATOR_ACTIVATION_INL_H_
 

From bf58bee86b3da12391dfd3d7d2b603ca81850e6e Mon Sep 17 00:00:00 2001
From: Ziyue Huang <zyhuang94@gmail.com>
Date: Tue, 17 Oct 2017 00:17:21 -0500
Subject: [PATCH 05/23] Negative begin and end support for csr slice (#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update
---
 python/mxnet/ndarray/sparse.py               | 20 ++++++++++++++------
 src/operator/tensor/matrix_op-inl.h          |  3 +++
 tests/python/unittest/test_sparse_ndarray.py |  3 +++
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/python/mxnet/ndarray/sparse.py b/python/mxnet/ndarray/sparse.py
index 72ce04c64503..a1a3ba83b4ba 100644
--- a/python/mxnet/ndarray/sparse.py
+++ b/python/mxnet/ndarray/sparse.py
@@ -302,7 +302,7 @@ def __getitem__(self, key):
 
         Parameters
         ----------
-        key : slice
+        key : int or slice
             Indexing key.
 
         Examples
@@ -312,14 +312,22 @@ def __getitem__(self, key):
         >>> data = np.array([1, 2, 3, 4, 5, 6])
         >>> a = mx.nd.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
         >>> a.asnumpy()
-        array([[1, 0, 2],
-               [0, 0, 3],
-               [4, 5, 6]])
+        array([[ 1.,  0.,  2.],
+               [ 0.,  0.,  3.],
+               [ 4.,  5.,  6.]], dtype=float32)
         >>> a[1:2].asnumpy()
-        array([[0, 0, 3]], dtype=float32)
+        array([[ 0.,  0.,  3.]], dtype=float32)
+        >>> a[1].asnumpy()
+        array([[ 0.,  0.,  3.]], dtype=float32)
+        >>> a[-1].asnumpy()
+        array([[ 4.,  5.,  6.]], dtype=float32)
         """
         if isinstance(key, int):
-            raise ValueError("__getitem__ with int key is not implemented for CSRNDArray")
+            if key == -1:
+                begin = self.shape[0] - 1
+            else:
+                begin = key
+            return op.slice(self, begin=begin, end=begin+1)
         if isinstance(key, py_slice):
             if key.step is not None:
                 raise ValueError('CSRNDArray only supports continuous slicing on axis 0')
diff --git a/src/operator/tensor/matrix_op-inl.h b/src/operator/tensor/matrix_op-inl.h
index b9c898e6bb80..455ddcbb6254 100644
--- a/src/operator/tensor/matrix_op-inl.h
+++ b/src/operator/tensor/matrix_op-inl.h
@@ -559,8 +559,11 @@ void SliceCsrImpl(const SliceParam &param, const OpContext& ctx,
   if (req == kNullOp) return;
   CHECK_NE(req, kAddTo) << "kAddTo for Slice on CSR input is not supported";
   CHECK_NE(req, kWriteInplace) << "kWriteInplace for Slice on CSR input is not supported";
+  const TShape ishape = in.shape();
   int begin = *param.begin[0];
+  if (begin < 0) begin += ishape[0];
   int end = *param.end[0];
+  if (end < 0) end += ishape[0];
   int indptr_len = end - begin + 1;
   out.CheckAndAllocAuxData(kIndPtr, Shape1(indptr_len));
   if (!in.storage_initialized()) {
diff --git a/tests/python/unittest/test_sparse_ndarray.py b/tests/python/unittest/test_sparse_ndarray.py
index cb3f97921ca4..8da24ef82106 100644
--- a/tests/python/unittest/test_sparse_ndarray.py
+++ b/tests/python/unittest/test_sparse_ndarray.py
@@ -111,8 +111,11 @@ def check_sparse_nd_csr_slice(shape):
         start = rnd.randint(0, shape[0] - 1)
         end = rnd.randint(start + 1, shape[0])
         assert same(A[start:end].asnumpy(), A2[start:end])
+        assert same(A[start - shape[0]:end].asnumpy(), A2[start:end])
         assert same(A[start:].asnumpy(), A2[start:])
         assert same(A[:end].asnumpy(), A2[:end])
+        ind = rnd.randint(-shape[0], shape[0] - 1)
+        assert same(A[ind].asnumpy(), A2[ind][np.newaxis, :])
     
     def check_slice_nd_csr_fallback(shape):
         stype = 'csr'

From 4ecb76390d90c84a49a65cae22fe956497bf07ff Mon Sep 17 00:00:00 2001
From: mbaijal <30911248+mbaijal@users.noreply.github.com>
Date: Tue, 17 Oct 2017 07:58:50 -0700
Subject: [PATCH 06/23] Preparing for 0.12.0.rc0: Final changes before RC
 (#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates
---
 NEWS.md                                 | 52 ++++++++++++++++---------
 README.md                               |  1 +
 setup-utils/install-mxnet-osx-python.sh |  2 +-
 3 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/NEWS.md b/NEWS.md
index bdd4f48d1091..0652624eddb1 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,34 +1,48 @@
 MXNet Change Log
 ================
 ## 0.12.0
-### New Features - Sparse Tensor Support
-  - Added limited cpu support for two sparse formats for `Symbol` and `NDArray` - `CSRNDArray` and `RowSparseNDArray`
-  - Added a sparse dot product operator and many element-wise sparse operators
-  - Added a data iterator for sparse data input - `LibSVMIter`
-  - Added three optimizers for sparse gradient updates: `Ftrl`, `SGD` and `Adam`
-  - Added `push` and `row_sparse_pull` with `RowSparseNDArray` in distributed kvstore
-### New Features - Autograd and Gluon
-  - New loss functions added - `SigmoidBinaryCrossEntropyLoss`, `CTCLoss`, `HuberLoss`, `HingeLoss`, `SquaredHingeLoss`, `LogisticLoss`, `TripletLoss`
+### Performance
+  - Added full support for NVIDIA Volta GPU Architecture and CUDA 9. Training is up to 3.5x faster than Pascal when using float16.
+  - Enabled JIT compilation. Autograd and Gluon hybridize now use less memory and has faster speed. Performance is almost the same with old symbolic style code.
+  - Improved ImageRecordIO image loading performance and added indexed RecordIO support.
+  - Added better openmp thread management to improve CPU performance.
+### New Features - Gluon
+  - Added enhancements to the Gluon package, a high-level interface designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively with minimal impact on performance. Neural networks (and other machine learning models) can be defined and trained with `gluon.nn` and `gluon.rnn` packages. 
+  - Added new loss functions - `SigmoidBinaryCrossEntropyLoss`, `CTCLoss`, `HuberLoss`, `HingeLoss`, `SquaredHingeLoss`, `LogisticLoss`, `TripletLoss`.
   - `gluon.Trainer` now allows reading and setting learning rate with `trainer.learning_rate` property.
-  - Added `mx.autograd.grad` and experimental second order gradient support (though most operators don't support second order gradient yet)
-  - Added `ConvLSTM` etc to `gluon.contrib`
+  - Added API `HybridBlock.export` for exporting gluon models to MXNet format.
+  - Added `gluon.contrib` package.
+    - Convolutional recurrent network cells for RNN, LSTM and GRU.
+    - `VariationalDropoutCell`
+### New Features - Autograd
+  - Added enhancements to `autograd` package, which enables automatic differentiation of NDArray operations.
+  - `autograd.Function` allows defining both forward and backward computation for custom operators.
+  - Added `mx.autograd.grad` and experimental second order gradient support (most operators don't support second order gradient yet).
   - Autograd now supports cross-device graphs. Use `x.copyto(mx.gpu(i))` and `x.copyto(mx.cpu())` to do computation on multiple devices.
+### New Features - Sparse Tensor Support
+  - Added support for sparse matrices. 
+  - Added limited cpu support for two sparse formats in `Symbol` and `NDArray` - `CSRNDArray` and `RowSparseNDArray`.
+  - Added a sparse dot product operator and many element-wise sparse operators.
+  - Added a data iterator for sparse data input - `LibSVMIter`.
+  - Added three optimizers for sparse gradient updates: `Ftrl`, `SGD` and `Adam`.
+  - Added `push` and `row_sparse_pull` with `RowSparseNDArray` in distributed kvstore.
 ### Other New Features
-  - Limited support for fancy indexing. x[idx_arr0, idx_arr1, ..., idx_arrn] is now supported. Full support coming soon in next release. Checkout master to get a preview.
-  - Random number generators in `mx.nd.random.*` and `mx.sym.random.*` now supports both CPU and GPU
-  - `NDArray` and `Symbol` now supports "fluent" methods. You can now use `x.exp()` etc instead of `mx.nd.exp(x)` or `mx.sym.exp(x)`
-  - Added `mx.rtc.CudaModule` for writing and running CUDA kernels from python
-  - Added `multi_precision` option to optimizer for easier float16 training
-### Performance
-  - Enabled JIT compilation. Autograd and Gluon hybridize now use less memory and has faster speed. Performance is almost the same with old symbolic style code.
-  - Full support for NVidia Volta GPU Architecture and Cuda 9. Training is up to 3.5x faster than Pascal when using float16.
+  - Added limited support for fancy indexing, which allows you to very quickly access and modify complicated subsets of an array's values. `x[idx_arr0, idx_arr1, ..., idx_arrn]` is now supported. Features such as combining and slicing are planned for the next release. Checkout master to get a preview.
+  - Random number generators in `mx.nd.random.*` and `mx.sym.random.*` now support both CPU and GPU.
+  - `NDArray` and `Symbol` now supports "fluent" methods. You can now use `x.exp()` etc instead of `mx.nd.exp(x)` or `mx.sym.exp(x)`.
+  - Added `mx.rtc.CudaModule` for writing and running CUDA kernels from python. 
+  - Added `multi_precision` option to optimizer for easier float16 training.
+  - Better support for IDE auto-completion. IDEs like PyCharm can now correctly parse mxnet operators.
 ### API Changes
   - Operators like `mx.sym.linalg_*` and `mx.sym.random_*` are now moved to `mx.sym.linalg.*` and `mx.sym.random.*`. The old names are still available but deprecated.
   - `sample_*` and `random_*` are now merged as `random.*`, which supports both scalar and  `NDArray` distribution parameters.
 ### Bug-fixes
   - Fixed a bug that causes `argsort` operator to fail on large tensors.
   - Fixed numerical stability issues when summing large tensors.
-For more information see [full release notes](https://cwiki.apache.org/confluence/display/MXNET/MXNet+0.12.0+Release+Notes)
+  - Fixed a bug that causes arange operator to output wrong results for large ranges.
+  - Improved numerical precision for unary and binary operators on `float64` inputs.
+
+For more information and examples, see [full release notes](https://cwiki.apache.org/confluence/display/MXNET/MXNet+0.12.0+Release+Notes)
 
 
 ## 0.11.0
diff --git a/README.md b/README.md
index 8a65b4060c71..fc252a7a72b6 100644
--- a/README.md
+++ b/README.md
@@ -22,6 +22,7 @@ deep learning systems, and interesting insights of DL systems for hackers.
 
 What's New
 ----------
+* [Version 0.12.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/0.12.0) - MXNet 0.12.0 Release.
 * [Version 0.11.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/0.11.0) - MXNet 0.11.0 Release.
 * [Apache Incubator](http://incubator.apache.org/projects/mxnet.html) - We are now an Apache Incubator project.
 * [Version 0.10.0 Release](https://github.com/dmlc/mxnet/releases/tag/v0.10.0) - MXNet 0.10.0 Release.
diff --git a/setup-utils/install-mxnet-osx-python.sh b/setup-utils/install-mxnet-osx-python.sh
index 8bfb7dade7b1..25a44796cb2f 100755
--- a/setup-utils/install-mxnet-osx-python.sh
+++ b/setup-utils/install-mxnet-osx-python.sh
@@ -33,7 +33,7 @@ then
 	# TODO: Change this to latest tag
 	#       to avoid updating this value for every release
 	#
-	export MXNET_TAG="v0.10.0"
+	export MXNET_TAG="0.12.0"
 fi
 
 export TARIKH=`/bin/date +%Y-%m-%d-%H:%M:%S`

From 618c2cc5490aa22b85e86700dffa6fb3f1a4fe35 Mon Sep 17 00:00:00 2001
From: Kellen Sunderland <kellen.sunderland@gmail.com>
Date: Tue, 17 Oct 2017 17:00:52 +0200
Subject: [PATCH 07/23] Enable smoothing in softmax operator (#8125)

---
 src/operator/softmax_output-inl.h      | 21 ++++++++-
 tests/python/unittest/test_operator.py | 61 +++++++++++++++++++++++---
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/src/operator/softmax_output-inl.h b/src/operator/softmax_output-inl.h
index 5f8203e824a3..7216c76dc2bb 100644
--- a/src/operator/softmax_output-inl.h
+++ b/src/operator/softmax_output-inl.h
@@ -53,6 +53,7 @@ struct SoftmaxOutputParam : public dmlc::Parameter<SoftmaxOutputParam> {
   bool preserve_shape;
   int normalization;
   bool out_grad;
+  float smooth_alpha;
   DMLC_DECLARE_PARAMETER(SoftmaxOutputParam) {
     DMLC_DECLARE_FIELD(grad_scale).set_default(1.0f)
     .describe("Scales the gradient by a float factor.");
@@ -78,6 +79,13 @@ struct SoftmaxOutputParam : public dmlc::Parameter<SoftmaxOutputParam> {
     DMLC_DECLARE_FIELD(out_grad)
     .set_default(false)
     .describe("Multiplies gradient with output gradient element-wise.");
+    DMLC_DECLARE_FIELD(smooth_alpha)
+    .set_default(0.0f)
+    .set_range(0.0f, 1.0f)
+    .describe("Constant for computing a label smoothed version of cross-entropy"
+              "for the backwards pass.  This constant gets subtracted from the"
+              "one-hot encoding of the gold label and distributed uniformly to"
+              "all other labels.");
   };
 };
 
@@ -215,9 +223,18 @@ class SoftmaxOutputOp : public Operator {
           in_grad[softmaxout_enum::kData].get_with_shape<xpu, 2, DType>(data_shape, s);
       index_t valid_cnt = label.shape_.Size();
       if (param_.use_ignore) {
-        SoftmaxGrad(grad, out, label, static_cast<DType>(param_.ignore_label));
+        if (param_.smooth_alpha == 0.0f) {
+          SoftmaxGrad(grad, out, label, static_cast<DType>(param_.ignore_label));
+        } else {
+          SmoothSoftmaxGrad(grad, out, label, static_cast<DType>(param_.ignore_label),
+                            param_.smooth_alpha);
+        }
       } else {
-        SoftmaxGrad(grad, out, label);
+        if (param_.smooth_alpha == 0.0f) {
+          SoftmaxGrad(grad, out, label);
+        } else {
+          SmoothSoftmaxGrad(grad, out, label, param_.smooth_alpha);
+        }
       }
       if (param_.normalization == softmaxout_enum::kBatch) {
         valid_cnt = label.size(0);
diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index 105d1ce2c113..024e08983235 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -236,6 +236,53 @@ def test_regression():
                      lambda x, y : x - y)
 
 
+def check_softmax_grad(xpu):
+    x = mx.sym.Variable('x')
+    label = mx.sym.Variable('label')
+    x_nd = mx.nd.array([[1, 6, 4, 2]], ctx=xpu)
+    grad_x = mx.nd.zeros((1,4), ctx=xpu)
+    label_nd = mx.nd.array([1], ctx=xpu)
+
+    sym = mx.sym.SoftmaxOutput(data=x, label=label, ignore_label=0, use_ignore=False)
+    ex = sym.bind(ctx=xpu, args={'x': x_nd, 'label': label_nd}, args_grad={'x': grad_x})
+
+    ex.forward(is_train=True)
+    softmax_out = ex.outputs[0].asnumpy()
+    expected_softmax_out = [[0.005806628, 0.861780069, 0.116629249, 0.015784052]]
+    assert np.isclose(softmax_out, expected_softmax_out).all()
+
+    ex.backward(is_train=True)
+    grad_out = ex.grad_arrays[0].asnumpy()
+    k = int(label_nd[0].asscalar())
+    expected_grad_out = np.zeros((1,4))
+    expected_grad_out[0, k] = -1
+    assert np.isclose(grad_out - softmax_out, expected_grad_out).all()
+
+
+def check_smoothed_softmax_grad(xpu):
+    alpha = 0.2
+    x = mx.sym.Variable('x')
+    label = mx.sym.Variable('label')
+    x_nd = mx.nd.array([[1, 6, 4, 2]], ctx=xpu)
+    grad_x = mx.nd.zeros((1,4), ctx=xpu)
+    label_nd = mx.nd.array([1], ctx=xpu)
+
+    sym = mx.sym.SoftmaxOutput(data=x, label=label, ignore_label=0, use_ignore=False, smooth_alpha=alpha)
+    ex = sym.bind(ctx=xpu, args={'x': x_nd, 'label': label_nd}, args_grad={'x': grad_x})
+
+    ex.forward(is_train=True)
+    softmax_out = ex.outputs[0].asnumpy()
+    expected_softmax_out = [[0.005806628, 0.861780069, 0.116629249, 0.015784052]]
+    assert np.isclose(softmax_out, expected_softmax_out).all()
+
+    ex.backward(is_train=True)
+    grad_out = ex.grad_arrays[0].asnumpy()
+    k = int(label_nd[0].asscalar())
+    expected_grad_out = np.full((1,4), fill_value=-alpha/float(4-1))
+    expected_grad_out[0, k] = - (1 - alpha)
+    assert np.isclose(grad_out - softmax_out, expected_grad_out).all()
+
+
 def check_softmax_with_ignore_label(xpu):
     X = mx.symbol.Variable('X')
     L = mx.symbol.Variable('L')
@@ -286,12 +333,6 @@ def check_softmax_with_shape(shape, xpu, preserve_shape=False):
     assert_almost_equal(grad.asnumpy(), np_softmax(x.asnumpy()) - l.asnumpy(), rtol=1e-4)
 
 
-def test_softmax():
-    check_softmax_with_shape((3, 4), default_context(), preserve_shape=False)
-    check_softmax_with_shape((3, 4), default_context(), preserve_shape=True)
-    check_softmax_with_shape((3, 4, 2), default_context(), preserve_shape=True)
-
-
 def test_python_op():
     X = mx.symbol.Variable('X')
     op = mx.operator.NumpyOp()
@@ -4542,6 +4583,14 @@ def test_binary_math_operators():
             num_eps)
 
 
+def test_softmax():
+    check_softmax_with_shape((3, 4), default_context(), preserve_shape=False)
+    check_softmax_with_shape((3, 4), default_context(), preserve_shape=True)
+    check_softmax_with_shape((3, 4, 2), default_context(), preserve_shape=True)
+    check_softmax_grad(default_context())
+    check_smoothed_softmax_grad(default_context())
+
+
 if __name__ == '__main__':
     import nose
     nose.runmodule()

From cc93069063b83f5e1149bd07952c85f9232a4e17 Mon Sep 17 00:00:00 2001
From: Leonard Lausen <leonard@lausen.nl>
Date: Wed, 18 Oct 2017 00:02:45 +0900
Subject: [PATCH 08/23] v0.12 regression: Fix registration of children for
 Block (#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152
---
 python/mxnet/gluon/block.py         |  7 +++++--
 tests/python/unittest/test_gluon.py | 14 ++++++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/python/mxnet/gluon/block.py b/python/mxnet/gluon/block.py
index fb4ac8525299..73dbfc10fed7 100644
--- a/python/mxnet/gluon/block.py
+++ b/python/mxnet/gluon/block.py
@@ -191,9 +191,10 @@ def __setattr__(self, name, value):
                 for i, c in enumerate(self._children):
                     if c is existing:
                         self._children[i] = value
-        else:
-            if isinstance(value, Block):
+            elif isinstance(value, Block):
                 self.register_child(value)
+        elif isinstance(value, Block):
+            self.register_child(value)
 
         super(Block, self).__setattr__(name, value)
 
@@ -332,6 +333,8 @@ def __init__(self, prefix=None, params=None):
     def __setattr__(self, name, value):
         """Registers parameters."""
         super(HybridBlock, self).__setattr__(name, value)
+        if isinstance(value, HybridBlock):
+            self._clear_cached_op()
         if isinstance(value, Parameter):
             assert name not in self._reg_params or \
                 not isinstance(self._reg_params[name], Parameter), \
diff --git a/tests/python/unittest/test_gluon.py b/tests/python/unittest/test_gluon.py
index 60a0630c1665..c9bde39375d6 100644
--- a/tests/python/unittest/test_gluon.py
+++ b/tests/python/unittest/test_gluon.py
@@ -516,6 +516,20 @@ def test_hybrid_stale_cache():
     net.add(mx.gluon.nn.Flatten())
     assert net(mx.nd.ones((2,3,5))).shape == (2, 30)
 
+    net = mx.gluon.nn.HybridSequential()
+    with net.name_scope():
+        net.fc1 = mx.gluon.nn.Dense(10, weight_initializer='zeros',
+                                    bias_initializer='ones', flatten=False)
+        net.fc2 = mx.gluon.nn.Dense(10, weight_initializer='zeros',
+                                    bias_initializer='ones', flatten=False)
+    net.hybridize()
+    net.initialize()
+    net(mx.nd.ones((2,3,5)))
+
+    net.fc2 = mx.gluon.nn.Dense(10, weight_initializer='zeros',
+                                bias_initializer='ones', flatten=True)
+    net.initialize()
+    assert net(mx.nd.ones((2,3,5))).shape == (2, 10)
 
 if __name__ == '__main__':
     import nose

From 8730f7a3aa5d4c11617d0f1eb506525962dc73a3 Mon Sep 17 00:00:00 2001
From: Chris Olivier <cjolivier01@gmail.com>
Date: Tue, 17 Oct 2017 10:58:04 -0700
Subject: [PATCH 09/23] Revert "[CMAKE] Fix windows cmake build" (#8311)

* Revert "Added my code signing key (#8293)"

This reverts commit 22ab185bbfde0ac2d801ec700ac4705ef0ee8daa.

* Revert "[CMAKE] Fix windows cmake build (#8227)"

This reverts commit 1c1c788916d672ee3cafdc4c91d7002a94a59d13.
---
 CMakeLists.txt                            | 10 ++----
 cpp-package/scripts/OpWrapperGenerator.py | 11 +++---
 nnvm                                      |  2 +-
 python/mxnet/visualization.py             | 44 +++++++++++------------
 4 files changed, 29 insertions(+), 38 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 63bc8d740b74..76ef5afa57ac 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -275,14 +275,7 @@ FILE(GLOB_RECURSE SOURCE "src/*.cc" "src/*.h" "include/*.h")
 FILE(GLOB_RECURSE CUDA "src/*.cu" "src/*.cuh")
 
 # add nnvm to source
-FILE(GLOB_RECURSE NNVMSOURCE
-  nnvm/src/c_api/*.cc
-  nnvm/src/core/*.cc
-  nnvm/src/pass/*.cc
-  nnvm/src/c_api/*.h
-  nnvm/src/core/*.h
-  nnvm/src/pass/*.h
-  nnvm/include/*.h)
+FILE(GLOB_RECURSE NNVMSOURCE "nnvm/src/*.cc" "nnvm/src/*.h" "nnvm/include/*.h")
 list(APPEND SOURCE ${NNVMSOURCE})
 
 # add mshadow file
@@ -534,3 +527,4 @@ if(MSVC)
 endif()
 set(LINT_DIRS include src scripts python tests cpp-package)
 add_custom_target(mxnet_lint COMMAND ${CMAKE_COMMAND} -DMSVC=${MSVC} -DPYTHON_EXECUTABLE=${PYTHON_EXECUTABLE} -DLINT_DIRS=${LINT_DIRS} -DPROJECT_SOURCE_DIR=${CMAKE_CURRENT_SOURCE_DIR} -DPROJECT_NAME=mxnet -P ${CMAKE_CURRENT_SOURCE_DIR}/dmlc-core/cmake/lint.cmake)
+
diff --git a/cpp-package/scripts/OpWrapperGenerator.py b/cpp-package/scripts/OpWrapperGenerator.py
index ac957730d689..83495febcc63 100644
--- a/cpp-package/scripts/OpWrapperGenerator.py
+++ b/cpp-package/scripts/OpWrapperGenerator.py
@@ -124,15 +124,12 @@ def __init__(self, opName = '', argName = '', typeString = '', descString = ''):
                 self.defaultString = self.enum.GetDefaultValueString(self.defaultString)
             elif self.defaultString == 'None':
                 self.defaultString = self.type + '()'
-            elif self.type == "bool":
-                if self.defaultString == "1" or self.defaultString == "True":
-                    self.defaultString = "true"
-                else:
-                    self.defaultString = "false"
+            elif self.defaultString == 'False':
+                self.defaultString = 'false'
+            elif self.defaultString == 'True':
+                self.defaultString = 'true'
             elif self.defaultString[0] == '(':
                 self.defaultString = 'Shape' + self.defaultString
-            elif self.defaultString[0] == '[':
-                self.defaultString = 'Shape(' + self.defaultString[1:-1] + ")"
             elif self.type == 'dmlc::optional<int>':
                 self.defaultString = self.type + '(' + self.defaultString + ')'
             elif typeString.startswith('caffe-layer-parameter'):
diff --git a/nnvm b/nnvm
index ef0adc813e8e..c86afa8f17a4 160000
--- a/nnvm
+++ b/nnvm
@@ -1 +1 @@
-Subproject commit ef0adc813e8e11fad2380f46f4e0d63b451d7cbf
+Subproject commit c86afa8f17a44bcd4e6eec41cd49ba87e4f7a635
diff --git a/python/mxnet/visualization.py b/python/mxnet/visualization.py
index 8c4cc3b920d7..aa00488d96a7 100644
--- a/python/mxnet/visualization.py
+++ b/python/mxnet/visualization.py
@@ -134,20 +134,20 @@ def print_layer_summary(node, out_shape):
                             pre_filter = pre_filter + int(shape[0])
         cur_param = 0
         if op == 'Convolution':
-            if ("no_bias" in node["attrs"]) and int(node["attrs"]["no_bias"]):
-                cur_param = pre_filter * int(node["attrs"]["num_filter"])
-                for k in _str2tuple(node["attrs"]["kernel"]):
+            if ("no_bias" in node["attr"]) and (node["attr"]["no_bias"] == 'True'):
+                cur_param = pre_filter * int(node["attr"]["num_filter"])
+                for k in _str2tuple(node["attr"]["kernel"]):
                     cur_param *= int(k)
             else:
-                cur_param = pre_filter * int(node["attrs"]["num_filter"])
-                for k in _str2tuple(node["attrs"]["kernel"]):
+                cur_param = pre_filter * int(node["attr"]["num_filter"])
+                for k in _str2tuple(node["attr"]["kernel"]):
                     cur_param *= int(k)
-                cur_param += int(node["attrs"]["num_filter"])
+                cur_param += int(node["attr"]["num_filter"])
         elif op == 'FullyConnected':
-            if ("no_bias" in node["attrs"]) and int(node["attrs"]["no_bias"]):
-                cur_param = pre_filter * (int(node["attrs"]["num_hidden"]))
+            if ("no_bias" in node["attr"]) and (node["attr"]["no_bias"] == 'True'):
+                cur_param = pre_filter * (int(node["attr"]["num_hidden"]))
             else:
-                cur_param = (pre_filter+1) * (int(node["attrs"]["num_hidden"]))
+                cur_param = (pre_filter+1) * (int(node["attr"]["num_hidden"]))
         elif op == 'BatchNorm':
             key = node["name"] + "_output"
             if show_shape:
@@ -291,24 +291,24 @@ def looks_like_weight(name):
             label = node["name"]
             attr["fillcolor"] = cm[0]
         elif op == "Convolution":
-            label = r"Convolution\n%s/%s, %s" % ("x".join(_str2tuple(node["attrs"]["kernel"])),
-                                                 "x".join(_str2tuple(node["attrs"]["stride"]))
-                                                 if "stride" in node["attrs"] else "1",
-                                                 node["attrs"]["num_filter"])
+            label = r"Convolution\n%s/%s, %s" % ("x".join(_str2tuple(node["attr"]["kernel"])),
+                                                 "x".join(_str2tuple(node["attr"]["stride"]))
+                                                 if "stride" in node["attr"] else "1",
+                                                 node["attr"]["num_filter"])
             attr["fillcolor"] = cm[1]
         elif op == "FullyConnected":
-            label = r"FullyConnected\n%s" % node["attrs"]["num_hidden"]
+            label = r"FullyConnected\n%s" % node["attr"]["num_hidden"]
             attr["fillcolor"] = cm[1]
         elif op == "BatchNorm":
             attr["fillcolor"] = cm[3]
         elif op == "Activation" or op == "LeakyReLU":
-            label = r"%s\n%s" % (op, node["attrs"]["act_type"])
+            label = r"%s\n%s" % (op, node["attr"]["act_type"])
             attr["fillcolor"] = cm[2]
         elif op == "Pooling":
-            label = r"Pooling\n%s, %s/%s" % (node["attrs"]["pool_type"],
-                                             "x".join(_str2tuple(node["attrs"]["kernel"])),
-                                             "x".join(_str2tuple(node["attrs"]["stride"]))
-                                             if "stride" in node["attrs"] else "1")
+            label = r"Pooling\n%s, %s/%s" % (node["attr"]["pool_type"],
+                                             "x".join(_str2tuple(node["attr"]["kernel"])),
+                                             "x".join(_str2tuple(node["attr"]["stride"]))
+                                             if "stride" in node["attr"] else "1")
             attr["fillcolor"] = cm[4]
         elif op == "Concat" or op == "Flatten" or op == "Reshape":
             attr["fillcolor"] = cm[5]
@@ -317,7 +317,7 @@ def looks_like_weight(name):
         else:
             attr["fillcolor"] = cm[7]
             if op == "Custom":
-                label = node["attrs"]["op_type"]
+                label = node["attr"]["op_type"]
 
         dot.node(name=name, label=label, **attr)
 
@@ -338,8 +338,8 @@ def looks_like_weight(name):
                     if draw_shape:
                         if input_node["op"] != "null":
                             key = input_name + "_output"
-                            if "attrs" in input_node:
-                                params = input_node["attrs"]
+                            if "attr" in input_node:
+                                params = input_node["attr"]
                                 if "num_outputs" in params:
                                     key += str(int(params["num_outputs"]) - 1)
                             shape = shape_dict[key][1:]

From 252227ee19fc9d860d5eaf16f903ba531bc99bf8 Mon Sep 17 00:00:00 2001
From: thinksanky <31976455+thinksanky@users.noreply.github.com>
Date: Tue, 17 Oct 2017 15:27:25 -0700
Subject: [PATCH 10/23] fixed broken links. https was pointing to http for
 mxnet.io (#8300)

---
 docs/tutorials/r/symbol.md          | 2 +-
 docs/tutorials/sparse/row_sparse.md | 2 +-
 docs/tutorials/sparse/train.md      | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/tutorials/r/symbol.md b/docs/tutorials/r/symbol.md
index 63f3a53bcaaa..6ab4dc2d3d31 100644
--- a/docs/tutorials/r/symbol.md
+++ b/docs/tutorials/r/symbol.md
@@ -104,7 +104,7 @@ In the example, *net* is used as a function to apply to an existing symbol
 
 ## Training a Neural Net
 
-The [model API](../../../R-package/R/model.R) is a thin wrapper around the symbolic executors to support neural net training.
+The [model API](https://github.com/apache/incubator-mxnet/blob/master/R-package/R/model.R) is a thin wrapper around the symbolic executors to support neural net training.
 
 We encourage you to read [Symbolic Configuration and Execution in Pictures for python package](../../api/python/symbol_in_pictures/symbol_in_pictures.md)for a detailed explanation of concepts in pictures.
 
diff --git a/docs/tutorials/sparse/row_sparse.md b/docs/tutorials/sparse/row_sparse.md
index e2f0a12c0fda..6a69341da985 100644
--- a/docs/tutorials/sparse/row_sparse.md
+++ b/docs/tutorials/sparse/row_sparse.md
@@ -271,7 +271,7 @@ rsp_retained = mx.nd.sparse.retain(rsp, mx.nd.array([0, 1]))
 
 ## Sparse Operators and Storage Type Inference
 
-Operators that have specialized implementation for sparse arrays can be accessed in ``mx.nd.sparse``. You can read the [mxnet.ndarray.sparse API documentation](https://mxnet.io/versions/master/api/python/ndarray/sparse.html) to find what sparse operators are available.
+Operators that have specialized implementation for sparse arrays can be accessed in ``mx.nd.sparse``. You can read the [mxnet.ndarray.sparse API documentation](http://mxnet.io/versions/master/api/python/ndarray/sparse.html) to find what sparse operators are available.
 
 
 ```python
diff --git a/docs/tutorials/sparse/train.md b/docs/tutorials/sparse/train.md
index d6e3f4e82af2..22ce039ee7f5 100644
--- a/docs/tutorials/sparse/train.md
+++ b/docs/tutorials/sparse/train.md
@@ -99,7 +99,7 @@ f = mx.sym.sparse.elemwise_add(c, c)
 ### Storage Type Inference
 
 What will be the output storage types of sparse symbols? In MXNet, for any sparse symbol, the result storage types are inferred based on storage types of inputs.
-You can read the [Sparse Symbol API](https://mxnet.io/versions/master/api/python/symbol/sparse.html) documentation to find what output storage types are. In the example below we will try out the storage types introduced in the Row Sparse and Compressed Sparse Row tutorials: `default` (dense), `csr`, and `row_sparse`.
+You can read the [Sparse Symbol API](http://mxnet.io/versions/master/api/python/symbol/sparse.html) documentation to find what output storage types are. In the example below we will try out the storage types introduced in the Row Sparse and Compressed Sparse Row tutorials: `default` (dense), `csr`, and `row_sparse`.
 
 
 ```python

From 310bbeb7fa9633edaf25842587554a5febbccd83 Mon Sep 17 00:00:00 2001
From: Sheng Zha <szha@users.noreply.github.com>
Date: Tue, 17 Oct 2017 20:52:02 -0700
Subject: [PATCH 11/23] Update rnn.md (#8320)

---
 docs/api/python/gluon/rnn.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/api/python/gluon/rnn.md b/docs/api/python/gluon/rnn.md
index 073314dd956a..7a40c451bca5 100644
--- a/docs/api/python/gluon/rnn.md
+++ b/docs/api/python/gluon/rnn.md
@@ -21,7 +21,7 @@ with model.name_scope():
     model.add(mx.gluon.rnn.LSTM(20))
     model.add(mx.gluon.nn.Dense(5, flatten=False))
 model.initialize()
-model(mx.nd.ones((2,3,5)))
+model(mx.nd.ones((2,3)))
 ```
 
 ```eval_rst

From 83e96a96adc2e4a1f52d4f086bee7606e973407f Mon Sep 17 00:00:00 2001
From: Sheng Zha <szha@users.noreply.github.com>
Date: Tue, 17 Oct 2017 22:57:20 -0700
Subject: [PATCH 12/23] fluent methods for missed ops (#8329)

---
 docs/api/python/ndarray/ndarray.md    | 37 +++++++++++++-
 docs/api/python/symbol/symbol.md      | 38 ++++++++++++--
 python/mxnet/ndarray/ndarray.py       | 72 +++++++++++++++++++++++++++
 python/mxnet/symbol/symbol.py         | 72 +++++++++++++++++++++++++++
 tests/python/unittest/test_ndarray.py | 10 ++--
 tests/python/unittest/test_symbol.py  |  9 ++--
 6 files changed, 228 insertions(+), 10 deletions(-)

diff --git a/docs/api/python/ndarray/ndarray.md b/docs/api/python/ndarray/ndarray.md
index 615b9dc5a748..09564c2f2035 100644
--- a/docs/api/python/ndarray/ndarray.md
+++ b/docs/api/python/ndarray/ndarray.md
@@ -125,6 +125,7 @@ The `ndarray` package provides several classes:
 
     NDArray.T
     NDArray.reshape
+    NDArray.reshape_like
     NDArray.flatten
     NDArray.expand_dims
     NDArray.split
@@ -194,6 +195,7 @@ The `ndarray` package provides several classes:
     NDArray.topk
     NDArray.argmax
     NDArray.argmin
+    NDArray.argmax_channel
 ```
 
 ### Arithmetic operations
@@ -266,7 +268,22 @@ The `ndarray` package provides several classes:
 
     NDArray.sqrt
     NDArray.rsqrt
+    NDArray.cbrt
+    NDArray.rcbrt
     NDArray.square
+    NDArray.reciprocal
+```
+
+## Basic neural network functions
+
+```eval_rst
+.. autosummary::
+    :nosignatures:
+
+    NDArray.relu
+    NDArray.sigmoid
+    NDArray.softmax
+    NDArray.log_softmax
 ```
 
 ### In-place arithmetic operations
@@ -358,6 +375,7 @@ The `ndarray` package provides several classes:
 
     cast
     reshape
+    reshape_like
     flatten
     expand_dims
 ```
@@ -394,6 +412,7 @@ The `ndarray` package provides several classes:
 
     concat
     split
+    stack
 ```
 
 ### Indexing routines
@@ -514,11 +533,13 @@ The `ndarray` package provides several classes:
     power
     sqrt
     rsqrt
+    cbrt
+    rcbrt
     square
     reciprocal
 ```
 
-### Logic functions
+### Comparison
 
 ```eval_rst
 .. autosummary::
@@ -559,6 +580,18 @@ The `ndarray` package provides several classes:
     argsort
     argmax
     argmin
+    argmax_channel
+```
+
+### Sequence operation
+
+```eval_rst
+.. autosummary::
+    :nosignatures:
+
+    SequenceLast
+    SequenceMask
+    SequenceReverse
 ```
 
 ### Miscellaneous
@@ -592,6 +625,8 @@ The `ndarray` package provides several classes:
     SoftmaxOutput
     softmax
     log_softmax
+    relu
+    sigmoid
 ```
 
 ### More
diff --git a/docs/api/python/symbol/symbol.md b/docs/api/python/symbol/symbol.md
index 7570e18ba73a..e93976d6033a 100644
--- a/docs/api/python/symbol/symbol.md
+++ b/docs/api/python/symbol/symbol.md
@@ -143,9 +143,23 @@ Composite multiple symbols into a new one by an operator.
 
     Symbol.sqrt
     Symbol.rsqrt
+    Symbol.cbrt
+    Symbol.rcbrt
     Symbol.square
 ```
 
+## Basic neural network functions
+
+```eval_rst
+.. autosummary::
+    :nosignatures:
+
+    Symbol.relu
+    Symbol.sigmoid
+    Symbol.softmax
+    Symbol.log_softmax
+```
+
 #### Comparison operators
 
 ```eval_rst
@@ -178,6 +192,7 @@ Composite multiple symbols into a new one by an operator.
 
     Symbol.astype
     Symbol.reshape
+    Symbol.reshape_like
     Symbol.flatten
     Symbol.expand_dims
 ```
@@ -246,6 +261,7 @@ Composite multiple symbols into a new one by an operator.
     Symbol.topk
     Symbol.argmax
     Symbol.argmin
+    Symbol.argmax_channel
 ```
 
 ### Query information
@@ -355,6 +371,7 @@ Composite multiple symbols into a new one by an operator.
 
     cast
     reshape
+    reshape_like
     flatten
     expand_dims
 ```
@@ -391,6 +408,7 @@ Composite multiple symbols into a new one by an operator.
 
     concat
     split
+    stack
 ```
 
 ### Indexing routines
@@ -424,7 +442,6 @@ Composite multiple symbols into a new one by an operator.
     broadcast_div
     broadcast_mod
     negative
-    reciprocal
     dot
     batch_dot
     add_n
@@ -492,7 +509,6 @@ Composite multiple symbols into a new one by an operator.
     trunc
 ```
 
-
 ### Exponents and logarithms
 
 ```eval_rst
@@ -519,9 +535,10 @@ Composite multiple symbols into a new one by an operator.
     cbrt
     rcbrt
     square
+    reciprocal
 ```
 
-### Logic functions
+### Comparison
 
 ```eval_rst
 .. autosummary::
@@ -534,6 +551,7 @@ Composite multiple symbols into a new one by an operator.
     broadcast_lesser
     broadcast_lesser_equal
 ```
+
 ### Random sampling
 
 ```eval_rst
@@ -561,6 +579,18 @@ Composite multiple symbols into a new one by an operator.
     argsort
     argmax
     argmin
+    argmax_channel
+```
+
+### Sequence operation
+
+```eval_rst
+.. autosummary::
+    :nosignatures:
+
+    SequenceLast
+    SequenceMask
+    SequenceReverse
 ```
 
 ### Miscellaneous
@@ -596,6 +626,8 @@ Composite multiple symbols into a new one by an operator.
     SoftmaxOutput
     softmax
     log_softmax
+    relu
+    sigmoid
 ```
 
 ### More
diff --git a/python/mxnet/ndarray/ndarray.py b/python/mxnet/ndarray/ndarray.py
index 2f9972b21bd6..1cd9f40e520d 100644
--- a/python/mxnet/ndarray/ndarray.py
+++ b/python/mxnet/ndarray/ndarray.py
@@ -736,6 +736,14 @@ def reshape(self, shape):
                                          ctypes.byref(handle)))
         return NDArray(handle=handle, writable=self.writable)
 
+    def reshape_like(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`reshape_like`.
+
+        The arguments are the same as for :py:func:`reshape_like`, with
+        this array as data.
+        """
+        return op.reshape_like(self, *args, **kwargs)
+
     def zeros_like(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`zeros_like`.
 
@@ -864,6 +872,14 @@ def argmax(self, *args, **kwargs):
         """
         return op.argmax(self, *args, **kwargs)
 
+    def argmax_channel(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`argmax_channel`.
+
+        The arguments are the same as for :py:func:`argmax_channel`, with
+        this array as data.
+        """
+        return op.argmax_channel(self, *args, **kwargs)
+
     def argmin(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`argmin`.
 
@@ -1224,6 +1240,22 @@ def rsqrt(self, *args, **kwargs):
         """
         return op.rsqrt(self, *args, **kwargs)
 
+    def cbrt(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`cbrt`.
+
+        The arguments are the same as for :py:func:`cbrt`, with
+        this array as data.
+        """
+        return op.cbrt(self, *args, **kwargs)
+
+    def rcbrt(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`rcbrt`.
+
+        The arguments are the same as for :py:func:`rcbrt`, with
+        this array as data.
+        """
+        return op.rcbrt(self, *args, **kwargs)
+
     def square(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`square`.
 
@@ -1232,6 +1264,46 @@ def square(self, *args, **kwargs):
         """
         return op.square(self, *args, **kwargs)
 
+    def reciprocal(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`reciprocal`.
+
+        The arguments are the same as for :py:func:`reciprocal`, with
+        this array as data.
+        """
+        return op.reciprocal(self, *args, **kwargs)
+
+    def relu(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`relu`.
+
+        The arguments are the same as for :py:func:`relu`, with
+        this array as data.
+        """
+        return op.relu(self, *args, **kwargs)
+
+    def sigmoid(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`sigmoid`.
+
+        The arguments are the same as for :py:func:`sigmoid`, with
+        this array as data.
+        """
+        return op.sigmoid(self, *args, **kwargs)
+
+    def softmax(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`softmax`.
+
+        The arguments are the same as for :py:func:`softmax`, with
+        this array as data.
+        """
+        return op.softmax(self, *args, **kwargs)
+
+    def log_softmax(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`log_softmax`.
+
+        The arguments are the same as for :py:func:`log_softmax`, with
+        this array as data.
+        """
+        return op.log_softmax(self, *args, **kwargs)
+
     # pylint: disable= undefined-variable
     def broadcast_to(self, shape):
         """Broadcasts the input array to a new shape.
diff --git a/python/mxnet/symbol/symbol.py b/python/mxnet/symbol/symbol.py
index 3c76826cdd29..6903db0b0d00 100644
--- a/python/mxnet/symbol/symbol.py
+++ b/python/mxnet/symbol/symbol.py
@@ -1745,6 +1745,14 @@ def reshape(self, *args, **kwargs):
         """
         return op.reshape(self, *args, **kwargs)
 
+    def reshape_like(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`reshape_like`.
+
+        The arguments are the same as for :py:func:`reshape_like`, with
+        this array as data.
+        """
+        return op.reshape_like(self, *args, **kwargs)
+
     def astype(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`cast`.
 
@@ -1881,6 +1889,14 @@ def argmax(self, *args, **kwargs):
         """
         return op.argmax(self, *args, **kwargs)
 
+    def argmax_channel(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`argmax_channel`.
+
+        The arguments are the same as for :py:func:`argmax_channel`, with
+        this array as data.
+        """
+        return op.argmax_channel(self, *args, **kwargs)
+
     def argmin(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`argmin`.
 
@@ -2249,6 +2265,22 @@ def rsqrt(self, *args, **kwargs):
         """
         return op.rsqrt(self, *args, **kwargs)
 
+    def cbrt(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`cbrt`.
+
+        The arguments are the same as for :py:func:`cbrt`, with
+        this array as data.
+        """
+        return op.cbrt(self, *args, **kwargs)
+
+    def rcbrt(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`rcbrt`.
+
+        The arguments are the same as for :py:func:`rcbrt`, with
+        this array as data.
+        """
+        return op.rcbrt(self, *args, **kwargs)
+
     def square(self, *args, **kwargs):
         """Convenience fluent method for :py:func:`square`.
 
@@ -2257,6 +2289,46 @@ def square(self, *args, **kwargs):
         """
         return op.square(self, *args, **kwargs)
 
+    def reciprocal(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`reciprocal`.
+
+        The arguments are the same as for :py:func:`reciprocal`, with
+        this array as data.
+        """
+        return op.reciprocal(self, *args, **kwargs)
+
+    def relu(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`relu`.
+
+        The arguments are the same as for :py:func:`relu`, with
+        this array as data.
+        """
+        return op.relu(self, *args, **kwargs)
+
+    def sigmoid(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`sigmoid`.
+
+        The arguments are the same as for :py:func:`sigmoid`, with
+        this array as data.
+        """
+        return op.sigmoid(self, *args, **kwargs)
+
+    def softmax(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`softmax`.
+
+        The arguments are the same as for :py:func:`softmax`, with
+        this array as data.
+        """
+        return op.softmax(self, *args, **kwargs)
+
+    def log_softmax(self, *args, **kwargs):
+        """Convenience fluent method for :py:func:`log_softmax`.
+
+        The arguments are the same as for :py:func:`log_softmax`, with
+        this array as data.
+        """
+        return op.log_softmax(self, *args, **kwargs)
+
     def wait_to_read(self):
         raise NotImplementedForSymbol(self.wait_to_read, None)
 
diff --git a/tests/python/unittest/test_ndarray.py b/tests/python/unittest/test_ndarray.py
index e5dddeb95a74..576d9635406b 100644
--- a/tests/python/unittest/test_ndarray.py
+++ b/tests/python/unittest/test_ndarray.py
@@ -742,7 +742,9 @@ def test_ndarray_fluent():
                     'one_hot', 'pick', 'sort', 'topk', 'argsort', 'argmax', 'argmin',
                     'clip', 'abs', 'sign', 'sin', 'cos', 'tan', 'arcsin', 'arccos', 'arctan',
                     'degrees', 'radians', 'sinh', 'cosh', 'tanh', 'arcsinh', 'arccosh', 'arctanh',
-                    'exp', 'expm1', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt', 'square'])
+                    'exp', 'expm1', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt', 'square',
+                    'reshape_like', 'cbrt', 'rcbrt', 'relu', 'sigmoid', 'softmax', 'log_softmax',
+                    'reciprocal'])
     def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
         with mx.name.NameManager():
             data = mx.nd.random_uniform(shape=shape, ctx=default_context())
@@ -756,11 +758,12 @@ def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
 
     for func in ['flatten', 'norm', 'round', 'rint', 'fix', 'floor', 'ceil', 'trunc', 'zeros_like',
                  'ones_like', 'abs', 'sign', 'sin', 'cos', 'degrees', 'radians',
-                 'exp', 'expm1', 'square']:
+                 'exp', 'expm1', 'square', 'reciprocal', 'argmax_channel']:
         check_fluent_regular(func, {})
 
     for func in ['arccosh', 'arcsin', 'arccos', 'arctan', 'tan', 'sinh', 'cosh', 'tanh',
-                 'arcsinh', 'arctanh', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt']:
+                 'arcsinh', 'arctanh', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt',
+                 'cbrt', 'rcbrt', 'relu', 'sigmoid', 'softmax', 'log_softmax']:
         check_fluent_regular(func, {}, equal_nan=True)
 
     for func in ['expand_dims', 'flip', 'sort', 'topk', 'argsort', 'argmax', 'argmin']:
@@ -778,6 +781,7 @@ def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
     check_fluent_regular('clip', {'a_min': 0.25, 'a_max': 0.75})
     check_fluent_regular('broadcast_axes', {'axis': (2,), 'size': (5,)})
     check_fluent_regular('pad', {'mode': 'constant', 'pad_width': (0,0,0,0,3,0,0,4)}, shape=(5, 17, 2, 3))
+    check_fluent_regular('reshape_like', {'rhs': mx.nd.ones((30, 17))}, shape=(5, 17, 2, 3))
 
     for func in ['sum', 'nansum', 'prod', 'nanprod', 'mean', 'max', 'min']:
         check_fluent_regular(func, {'axis': (1, 2)})
diff --git a/tests/python/unittest/test_symbol.py b/tests/python/unittest/test_symbol.py
index 2d31c600a4ae..30e76a272e2a 100644
--- a/tests/python/unittest/test_symbol.py
+++ b/tests/python/unittest/test_symbol.py
@@ -168,7 +168,8 @@ def test_symbol_fluent():
                     'clip', 'abs', 'sign', 'sin', 'cos', 'tan', 'arcsin', 'arccos', 'arctan',
                     'degrees', 'radians', 'sinh', 'cosh', 'tanh', 'arcsinh', 'arccosh', 'arctanh',
                     'exp', 'expm1', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt',
-                    'square'])
+                    'square', 'reciprocal' 'reshape_like', 'cbrt', 'rcbrt', 'relu', 'sigmoid',
+                    'softmax', 'log_softmax'])
     def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
         with mx.name.NameManager():
             data = mx.symbol.Variable('data')
@@ -181,11 +182,12 @@ def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
 
     for func in ['flatten', 'norm', 'round', 'rint', 'fix', 'floor', 'ceil', 'trunc', 'zeros_like',
                  'ones_like', 'abs', 'sign', 'sin', 'cos', 'degrees', 'radians',
-                 'exp', 'expm1',  'square']:
+                 'exp', 'expm1',  'square', 'reciprocal', 'argmax_channel']:
         check_fluent_regular(func, {})
 
     for func in ['arccosh', 'arcsin', 'arccos', 'arctan', 'tan', 'sinh', 'cosh', 'tanh',
-                 'arcsinh', 'arctanh', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt']:
+                 'arcsinh', 'arctanh', 'log', 'log10', 'log2', 'log1p', 'sqrt', 'rsqrt',
+                 'cbrt', 'rcbrt', 'relu', 'sigmoid', 'softmax', 'log_softmax']:
         check_fluent_regular(func, {}, equal_nan=True)
 
     for func in ['expand_dims', 'flip', 'sort', 'topk', 'argsort', 'argmax', 'argmin']:
@@ -201,6 +203,7 @@ def check_fluent_regular(func, kwargs, shape=(5, 17, 1), equal_nan=False):
     check_fluent_regular('clip', {'a_min': 0.25, 'a_max': 0.75})
     check_fluent_regular('broadcast_axes', {'axis': (2,), 'size': (5,)})
     check_fluent_regular('pad', {'mode': 'constant', 'pad_width': (0,0,0,0,3,0,0,4)}, shape=(5, 17, 2, 3))
+    check_fluent_regular('reshape_like', {'rhs': mx.sym.ones((30, 17))}, shape=(5, 17, 2, 3))
 
     for func in ['sum', 'nansum', 'prod', 'nanprod', 'mean', 'max', 'min']:
         check_fluent_regular(func, {'axis': (1, 2)})

From dc4c3c833620c5f8814f56d7ebb917fa0861c668 Mon Sep 17 00:00:00 2001
From: Eric Junyuan Xie <piiswrong@users.noreply.github.com>
Date: Wed, 18 Oct 2017 01:49:04 -0700
Subject: [PATCH 13/23] update ps lite (#8327)

---
 ps-lite | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ps-lite b/ps-lite
index acdb698fa3bb..bdd4c67e9e34 160000
--- a/ps-lite
+++ b/ps-lite
@@ -1 +1 @@
-Subproject commit acdb698fa3bb80929ef83bb37c705f025e119b82
+Subproject commit bdd4c67e9e34dc0b8350ce306b0caa737eb31c83

From 28b76e35828d955d96e4d9f6af680090f8be486a Mon Sep 17 00:00:00 2001
From: Chris Olivier <cjolivier01@gmail.com>
Date: Wed, 18 Oct 2017 08:15:09 -0700
Subject: [PATCH 14/23] Fix unused type warning (#8316)

---
 src/operator/tensor/init_op.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/src/operator/tensor/init_op.h b/src/operator/tensor/init_op.h
index e08a682d94c3..97bda906c66f 100644
--- a/src/operator/tensor/init_op.h
+++ b/src/operator/tensor/init_op.h
@@ -325,11 +325,9 @@ inline bool RangeShape(const nnvm::NodeAttrs& attrs,
       << "Invalid range (start, stop, step)= "
       << "(" << param.start << "," << param.stop.value() << "," << param.step << ")";
   }
-  MSHADOW_TYPE_SWITCH(param.dtype, DType, {
-    double out_size = std::ceil((param.stop.value() - param.start) / param.step)
-                      * param.repeat;
-    SHAPE_ASSIGN_CHECK(*out_attrs, 0, TShape({static_cast<nnvm::dim_t>(out_size)}));
-  });
+  const double out_size = std::ceil((param.stop.value() - param.start) / param.step)
+                          * param.repeat;
+  SHAPE_ASSIGN_CHECK(*out_attrs, 0, TShape({static_cast<nnvm::dim_t>(out_size)}));
   return true;
 }
 

From 55068f702e644c7c28ae9ff688d3e5d1ceb57834 Mon Sep 17 00:00:00 2001
From: Olivier <coolivie@amazon.com>
Date: Fri, 20 Oct 2017 13:39:51 -0700
Subject: [PATCH 15/23] Trigger build


From 40656394f1d3cd4f5dd89b651868b0aff435cd31 Mon Sep 17 00:00:00 2001
From: cjolivier01 <cjolivier01@gmail.com>
Date: Fri, 20 Oct 2017 21:23:23 -0700
Subject: [PATCH 16/23] Trigger build


From 2cf83cbc390a0fb2176766f37aa6e54064f16703 Mon Sep 17 00:00:00 2001
From: Haibin Lin <linhaibin.eric@gmail.com>
Date: Sat, 21 Oct 2017 00:23:20 -0700
Subject: [PATCH 17/23] Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example
---
 example/sparse/linear_classification.py |  7 +++++
 src/kvstore/kvstore_dist.h              | 41 +++++++++++--------------
 src/operator/tensor/init_op.h           | 19 ++++++------
 tests/nightly/dist_sync_kvstore.py      | 31 ++++++++++---------
 tests/python/unittest/test_ndarray.py   |  2 ++
 tests/python/unittest/test_optimizer.py |  4 +++
 6 files changed, 58 insertions(+), 46 deletions(-)

diff --git a/example/sparse/linear_classification.py b/example/sparse/linear_classification.py
index b173d04139aa..70f896386cbc 100644
--- a/example/sparse/linear_classification.py
+++ b/example/sparse/linear_classification.py
@@ -96,6 +96,7 @@
     # get the sparse weight parameter
     weight_index = mod._exec_group.param_names.index('weight')
     weight_param = mod._exec_group.param_arrays[weight_index]
+    all_row_ids = mx.nd.arange(0, num_features, dtype='int64')
     speedometer = mx.callback.Speedometer(batch_size, 100)
 
     logging.info('Training started ...')
@@ -118,9 +119,15 @@
             speedometer_param = mx.model.BatchEndParam(epoch=epoch, nbatch=nbatch,
                                                        eval_metric=metric, locals=locals())
             speedometer(speedometer_param)
+        # pull all rows before making a checkpoint
+        if kv:
+            kv.row_sparse_pull('weight', weight_param, row_ids=[all_row_ids],
+                               priority=-weight_index)
         # evaluate metric on validation dataset
         score = mod.score(eval_data, ['nll_loss'])
         logging.info('epoch %d, eval nll = %s ' % (epoch, score[0][1]))
+        save_optimizer_states = 'dist' not in kv.type
+        mod.save_checkpoint("checkpoint", epoch, save_optimizer_states=False)
         # reset the iterator for next pass of data
         data_iter.reset()
     logging.info('Training completed.')
diff --git a/src/kvstore/kvstore_dist.h b/src/kvstore/kvstore_dist.h
index 2d5e52fc3a67..5e62be8c4c40 100644
--- a/src/kvstore/kvstore_dist.h
+++ b/src/kvstore/kvstore_dist.h
@@ -42,10 +42,6 @@ namespace kvstore {
 /**
  * \brief distributed kvstore
  *
- * for a worker node, it always guarantees that all push and pull issued from
- * this worker on the same key are serialized. namely push(3) and then pull(3),
- * then the data pulled is always containing the modification from the push(3).
- *
  * it's the server node's job to control the data consistency among all
  * workers. see details on \ref ServerHandle::Start
  */
@@ -248,7 +244,7 @@ class KVStoreDist : public KVStoreLocal {
         LOG(FATAL) << "RowSparsePull with multiple values is not implemented yet";
       } else {
         auto& indices = target_val_rowids[0].second;
-        PullRowSparse_(key, &recv_buf, indices, priority);
+        PullRowSparse_(key, recv_buf, indices, priority);
         comm_->BroadcastRowSparse(key, recv_buf, grouped_val_rowid, num_vals == 1, priority);
       }
     }
@@ -322,24 +318,24 @@ class KVStoreDist : public KVStoreLocal {
   }
 
   // pull row sparse weight into `recv_buf` based on indices given by `indices`
-  void PullRowSparse_(const int key, NDArray *recv_buf, const NDArray& indices, int priority) {
+  void PullRowSparse_(const int key, const NDArray& recv_buf,
+                      const NDArray& indices, int priority) {
     using namespace rowsparse;
     auto pull_from_servers = [this, key, recv_buf, indices]
                              (RunContext rctx, Engine::CallbackOnComplete cb) {
       // allocate memory for the buffer
       size_t num_rows = indices.shape().Size();
-      recv_buf->CheckAndAlloc({mshadow::Shape1(num_rows)});
+      recv_buf.CheckAndAlloc({mshadow::Shape1(num_rows)});
 #if MKL_EXPERIMENTAL == 1
-      mkl_set_tblob_eager_mode(recv_buf->data());
+      mkl_set_tblob_eager_mode(recv_buf.data());
 #endif
-      real_t* data = recv_buf->data().dptr<real_t>();
-      auto indices_data = indices.data();
-      const auto offsets = indices_data.dptr<int64_t>();
-      const auto unit_len = recv_buf->shape().ProdShape(1, recv_buf->shape().ndim());
+      real_t* data = recv_buf.data().dptr<real_t>();
+      const auto offsets = indices.data().dptr<int64_t>();
+      const auto unit_len = recv_buf.shape().ProdShape(1, recv_buf.shape().ndim());
       const int64_t size = num_rows * unit_len;
        // convert to ps keys in row sparse format
       PSKV& pskv = EncodeRowSparseKey(key, size, num_rows, offsets,
-                                      unit_len, recv_buf->shape()[0]);
+                                      unit_len, recv_buf.shape()[0]);
       if (this->log_verbose_) {
         LOG(INFO) << "worker " << get_rank() << " pull lens: " << pskv.lens << " keys: "
                   << pskv.keys << " size: " << size;
@@ -348,8 +344,8 @@ class KVStoreDist : public KVStoreLocal {
       // copy indices to recv_buf. this needs to be done before ZPull
       // because after pull is done, the callback function returns and locks are released.
       // at this point, later functions may access the indices variable while copy happens
-      mshadow::Copy(recv_buf->aux_data(kIdx).FlatTo1D<cpu, int64_t>(),
-                    indices_data.FlatTo1D<cpu, int64_t>());
+      mshadow::Copy(recv_buf.aux_data(kIdx).FlatTo1D<cpu, int64_t>(),
+                    indices.data().FlatTo1D<cpu, int64_t>());
       CHECK_NOTNULL(ps_worker_)->ZPull(pskv.keys, vals, &pskv.lens, kRowSparsePushPull,
         [vals, cb]() { delete vals; cb(); });
     };
@@ -357,7 +353,7 @@ class KVStoreDist : public KVStoreLocal {
         pull_from_servers,
         pinned_ctx_,
         {indices.var()},
-        {recv_buf->var()},
+        {recv_buf.var()},
         FnProperty::kNormal,
         priority,
         PROFILER_MESSAGE("KVStoreDistRowSparsePull"));
@@ -366,15 +362,14 @@ class KVStoreDist : public KVStoreLocal {
   // push row sparse gradient
   void PushRowSparse(int key, const NDArray &send_buf, int priority) {
     using namespace rowsparse;
-    auto push_to_servers = [this, key, &send_buf]
+    auto push_to_servers = [this, key, send_buf]
                            (RunContext rctx, Engine::CallbackOnComplete cb) {
 #if MKL_EXPERIMENTAL == 1
       mkl_set_tblob_eager_mode(send_buf.data());
 #endif
       real_t* data = send_buf.data().dptr<real_t>();
-      bool init = send_buf.storage_initialized();
-      const int64_t num_rows = init ? send_buf.aux_shape(kIdx)[0] : 0;
-      const auto offsets = init ? send_buf.aux_data(kIdx).dptr<int64_t>() : nullptr;
+      const int64_t num_rows = send_buf.aux_shape(kIdx)[0];
+      const auto offsets = send_buf.aux_data(kIdx).dptr<int64_t>();
       const auto unit_len = send_buf.shape().ProdShape(1, send_buf.shape().ndim());
       const int64_t size = num_rows * unit_len;
 
@@ -472,7 +467,7 @@ class KVStoreDist : public KVStoreLocal {
     return pskv;
   }
 
-  // TODO(haibin) this encoding method for row sparse keys doesn't allow cross-layer batching
+  // Note: this encoding method for row sparse keys doesn't allow cross-layer batching
   inline PSKV& EncodeRowSparseKey(const int key, const int64_t size, const int64_t num_rows,
                                   const int64_t *offsets, const size_t unit_len,
                                   const int64_t total_num_rows) {
@@ -495,15 +490,15 @@ class KVStoreDist : public KVStoreLocal {
         ps::Key master_key = krs[i].begin() + key;
         pskv.keys.push_back(master_key);
         pskv.lens.push_back(0);
-        if (offsets) {
+        if (offsets && size > 0) {
           // calculate partition ranges
           int64_t part_num_rows =
             llround(static_cast<double>(total_num_rows) / num_servers * (i + 1)) -
             llround(static_cast<double>(total_num_rows) / num_servers * i);
           auto end_row = start_row + part_num_rows;
+          // search for offsets in [start_row, end_row)
           auto lb = std::lower_bound(offsets, offsets + num_rows, start_row);
           auto ub = std::upper_bound(offsets, offsets + num_rows, end_row - 1);
-
           for (auto offset = lb; offset < ub; offset++) {
             ps::Key ps_key = krs[i].begin() + key + (*offset - start_row);
             CHECK_LT(ps_key, krs[i].end());
diff --git a/src/operator/tensor/init_op.h b/src/operator/tensor/init_op.h
index 97bda906c66f..ea4243ee911b 100644
--- a/src/operator/tensor/init_op.h
+++ b/src/operator/tensor/init_op.h
@@ -93,6 +93,7 @@ struct RangeParam : public dmlc::Parameter<RangeParam> {
     .add_enum("float16", mshadow::kFloat16)
     .add_enum("uint8", mshadow::kUint8)
     .add_enum("int32", mshadow::kInt32)
+    .add_enum("int64", mshadow::kInt64)
     .describe("Target data type.");
   }
 };
@@ -179,6 +180,13 @@ void FillCompute(const nnvm::NodeAttrs& attrs,
   });
 }
 
+struct PopulateFullIdxRspKernel {
+  template<typename IType>
+  MSHADOW_XINLINE static void Map(int i, IType* out) {
+    KERNEL_ASSIGN(out[i], kWriteTo, i);
+  }
+};
+
 // Fill in the indices and values of a RowSparse NDArray to represent a zeros NDArray,
 // instead of the usual compact representation.
 template<typename xpu>
@@ -192,21 +200,14 @@ inline void FillDnsZerosRspImpl(mshadow::Stream<xpu> *s, NDArray *dst) {
     MSHADOW_IDX_TYPE_SWITCH(dst->aux_type(kIdx), IType, {
       auto num_rows = dst->shape()[0];
       dst->CheckAndAlloc({Shape1(num_rows)});
-      auto idx = dst->aux_data(kIdx).FlatTo1D<xpu, IType>(s);
+      auto idx = dst->aux_data(kIdx);
       auto val = dst->data();
       Kernel<set_zero, xpu>::Launch(s, val.Size(), val.dptr<DType>());
-      ASSIGN_DISPATCH(idx, kWriteTo, range<IType>(0, num_rows, 1, 1));
+      Kernel<PopulateFullIdxRspKernel, xpu>::Launch(s, num_rows, idx.dptr<IType>());
     });
   });
 }
 
-struct PopulateFullIdxRspKernel {
-  template<typename IType>
-  MSHADOW_XINLINE static void Map(int i, IType* out) {
-    KERNEL_ASSIGN(out[i], kWriteTo, i);
-  }
-};
-
 // Fill full indices NDArray with zeros by updating the aux shape.
 template<typename xpu>
 void PopulateFullIdxRspImpl(mshadow::Stream<xpu> *s, NDArray *dst) {
diff --git a/tests/nightly/dist_sync_kvstore.py b/tests/nightly/dist_sync_kvstore.py
index 5f1b11f041a7..900d6bb6afb7 100644
--- a/tests/nightly/dist_sync_kvstore.py
+++ b/tests/nightly/dist_sync_kvstore.py
@@ -39,7 +39,7 @@ def check_diff_to_scalar(A, x, rank=None):
 
 rate = 2
 shape = (2, 3)
-big_shape = (1200, 1200)        # bigger than BIGARRAY_BOUND
+big_shape = (1200, 1200)        # bigger than MXNET_KVSTORE_BIGARRAY_BOUND
 
 kv = mx.kv.create('dist_sync')
 
@@ -104,24 +104,27 @@ def check_row_sparse_keys(kv, my_rank, nworker):
     def check_row_sparse_keys_with_zeros(kv, my_rank, nworker):
         nrepeat = 3
         # prepare gradient
-        v = mx.nd.zeros(shape)
-        big_v = mx.nd.zeros(big_shape)
+        v = mx.nd.sparse.zeros('row_sparse', shape)
+        big_v = mx.nd.sparse.zeros('row_sparse', big_shape)
         # push
         for i in range(nrepeat):
-            kv.push('11', v.tostype('row_sparse'))
-            kv.push('100', big_v.tostype('row_sparse'))
-
+            kv.push('11', v)
+            kv.push('100', big_v)
             # pull a subset of rows this worker is interested in
             all_row_ids = np.arange(shape[0])
-            val = mx.nd.ones(shape).tostype('row_sparse')
-            big_val = mx.nd.ones(big_shape).tostype('row_sparse')
-            kv.row_sparse_pull('11', out=val, row_ids=mx.nd.array(all_row_ids, dtype='int64'))
-            big_num_rows = shape[0]
+            val = mx.nd.sparse.zeros('row_sparse', shape)
+            big_val = mx.nd.sparse.zeros('row_sparse', big_shape)
+            kv.row_sparse_pull('11', out=val, row_ids=mx.nd.array(all_row_ids))
             big_all_row_ids = np.arange(big_shape[0])
-            kv.row_sparse_pull('100', out=big_val, row_ids=mx.nd.array(big_all_row_ids, dtype='int64'))
+            kv.row_sparse_pull('100', out=big_val, row_ids=mx.nd.array(big_all_row_ids))
             # verify results
-            check_diff_to_scalar(val, mx.nd.ones(shape))
-            check_diff_to_scalar(big_val, mx.nd.ones(big_shape))
+            check_diff_to_scalar(val, 1)
+            check_diff_to_scalar(big_val, 1)
+            # pull empty weights
+            kv.row_sparse_pull('11', out=val, row_ids=mx.nd.array([]))
+            kv.row_sparse_pull('100', out=big_val, row_ids=mx.nd.array([]))
+            check_diff_to_scalar(val, 0)
+            check_diff_to_scalar(big_val, 0)
 
     def check_big_row_sparse_keys(kv, my_rank, nworker):
         mx.random.seed(123)
@@ -154,7 +157,7 @@ def check_big_row_sparse_keys(kv, my_rank, nworker):
             rnd.seed(my_rank)
             num_rows = big_shape[0]
             row_ids_np = np.random.randint(num_rows, size=num_rows)
-            row_ids = mx.nd.array(row_ids_np, dtype='int64')
+            row_ids = mx.nd.array(row_ids_np)
             # perform pull
             val = mx.nd.zeros(big_shape, stype='row_sparse')
             kv.row_sparse_pull('100', out=val, row_ids=row_ids)
diff --git a/tests/python/unittest/test_ndarray.py b/tests/python/unittest/test_ndarray.py
index 576d9635406b..fc8c350bbbe4 100644
--- a/tests/python/unittest/test_ndarray.py
+++ b/tests/python/unittest/test_ndarray.py
@@ -734,6 +734,8 @@ def test_output():
     assert_almost_equal(out.asnumpy(), zeros.asnumpy())
     mx.nd.full(shape, 2, out=out)
     assert_almost_equal(out.asnumpy(), ones.asnumpy() * 2)
+    arange_out = mx.nd.arange(0, 20, dtype='int64')
+    assert_almost_equal(arange_out.asnumpy(), np.arange(0, 20))
 
 def test_ndarray_fluent():
     has_grad = set(['flatten', 'expand_dims', 'flip', 'tile', 'transpose', 'sum', 'nansum', 'prod',
diff --git a/tests/python/unittest/test_optimizer.py b/tests/python/unittest/test_optimizer.py
index 8666b9e71430..1a26434015de 100644
--- a/tests/python/unittest/test_optimizer.py
+++ b/tests/python/unittest/test_optimizer.py
@@ -232,6 +232,10 @@ def test_sgd():
                                 if dtype != np.float16:
                                     compare_optimizer(opt1(**kwarg), opt2(**kwarg), shape[:2],
                                                       dtype, w_stype='csr', g_stype='csr')
+    # test optimizer with a big shape
+    big_shape = (54686454, 1)
+    kwarg = {'momentum': 0.9, 'wd': 0.05}
+    compare_optimizer(opt1(**kwarg), opt2(**kwarg), big_shape, np.float32)
 
 class PySparseSGD(mx.optimizer.Optimizer):
     """python reference implemenation of sgd"""

From f4c57aa2417604f3691067b5692b06e961d4e5de Mon Sep 17 00:00:00 2001
From: mbaijal <30911248+mbaijal@users.noreply.github.com>
Date: Sat, 21 Oct 2017 12:02:15 -0700
Subject: [PATCH 18/23] Fix the Readme (#8369)

---
 README.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/README.md b/README.md
index fc252a7a72b6..8a65b4060c71 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,6 @@ deep learning systems, and interesting insights of DL systems for hackers.
 
 What's New
 ----------
-* [Version 0.12.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/0.12.0) - MXNet 0.12.0 Release.
 * [Version 0.11.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/0.11.0) - MXNet 0.11.0 Release.
 * [Apache Incubator](http://incubator.apache.org/projects/mxnet.html) - We are now an Apache Incubator project.
 * [Version 0.10.0 Release](https://github.com/dmlc/mxnet/releases/tag/v0.10.0) - MXNet 0.10.0 Release.

From 68ea95f68ebd8b343971a9c50a4b556c09fb47ec Mon Sep 17 00:00:00 2001
From: Chris Olivier <cjolivier01@gmail.com>
Date: Sat, 21 Oct 2017 12:06:21 -0700
Subject: [PATCH 19/23] Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build
---
 tests/python/train/test_dtype.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/python/train/test_dtype.py b/tests/python/train/test_dtype.py
index b0a524815c6c..96912c09dbe0 100644
--- a/tests/python/train/test_dtype.py
+++ b/tests/python/train/test_dtype.py
@@ -99,7 +99,7 @@ def run_cifar10(train, val, use_module):
     devs = [mx.cpu(0)]
     net = get_net()
     mod = mx.mod.Module(net, context=devs)
-    optim_args = {'learning_rate': 0.05, 'wd': 0.00001, 'momentum': 0.9}
+    optim_args = {'learning_rate': 0.001, 'wd': 0.00001, 'momentum': 0.9}
     eval_metrics = ['accuracy']
     if use_module:
         executor = mx.mod.Module(net, context=devs)

From 2bb9e94133b965cc74efed301667fb773f2b121c Mon Sep 17 00:00:00 2001
From: solin319 <lipengfei19890603@126.com>
Date: Sun, 22 Oct 2017 03:13:02 +0800
Subject: [PATCH 20/23] Update cudnn_algoreg-inl.h (#7988)

---
 src/operator/cudnn_algoreg-inl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/operator/cudnn_algoreg-inl.h b/src/operator/cudnn_algoreg-inl.h
index ccc5140c9397..c10593fb0af4 100644
--- a/src/operator/cudnn_algoreg-inl.h
+++ b/src/operator/cudnn_algoreg-inl.h
@@ -102,7 +102,7 @@ class CuDNNAlgoReg {
     ParamKey key{param, in_shape[0], in_shape[1], out_shape[0], cudnn_data_type,
                  cudnn_forward_compute_type, cudnn_backward_compute_type, sm_arch};
     std::lock_guard<std::mutex> guard(lock_);
-    if (reg_.size() % 50 == 0) {
+    if (param.cudnn_tune.value() && reg_.size() % 50 == 0) {
       LOG(INFO) << "Running performance tests to find the best convolution "
                    "algorithm, "
                    "this can take a while... (setting env variable "

From 52adc56e937726bfddb4ba84202d71b6feff550e Mon Sep 17 00:00:00 2001
From: Robert Stone <talby@trap.mtview.ca.us>
Date: Sat, 21 Oct 2017 12:15:08 -0700
Subject: [PATCH 21/23] [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form
---
 .../AI-MXNet/lib/AI/MXNet/AutoGrad.pm         |   8 +-
 perl-package/AI-MXNet/lib/AI/MXNet/Base.pm    |  23 ++--
 .../AI-MXNet/lib/AI/MXNet/Executor/Group.pm   |  74 ++++++------
 .../AI-MXNet/lib/AI/MXNet/Gluon/Block.pm      |  24 ++--
 .../AI-MXNet/lib/AI/MXNet/Gluon/Parameter.pm  |   8 +-
 .../AI-MXNet/lib/AI/MXNet/Gluon/RNN/Cell.pm   |  14 +--
 .../AI-MXNet/lib/AI/MXNet/Gluon/RNN/Layer.pm  |   6 +-
 .../AI-MXNet/lib/AI/MXNet/Gluon/Trainer.pm    |   8 +-
 .../AI-MXNet/lib/AI/MXNet/Gluon/Utils.pm      |   8 +-
 perl-package/AI-MXNet/lib/AI/MXNet/KVStore.pm |   6 +-
 perl-package/AI-MXNet/lib/AI/MXNet/Metric.pm  |  66 +++++------
 perl-package/AI-MXNet/lib/AI/MXNet/Module.pm  |  12 +-
 perl-package/AI-MXNet/lib/AI/MXNet/Monitor.pm |  12 +-
 perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm |  12 +-
 .../AI-MXNet/lib/AI/MXNet/NDArray/Slice.pm    |  20 ++--
 .../AI-MXNet/lib/AI/MXNet/RNN/Cell.pm         |  18 +--
 perl-package/AI-MXNet/lib/AI/MXNet/Symbol.pm  |   6 +-
 perl-package/AI-MXNet/t/test_autograd.t       |   6 +-
 perl-package/AI-MXNet/t/test_base.t           | 107 ++++++++++++++++++
 perl-package/AI-MXNet/t/test_model_parallel.t |   6 +-
 perl-package/AI-MXNet/t/test_module.t         |   8 +-
 .../AI-MXNet/t/test_multi_device_exec.t       |   6 +-
 perl-package/AI-MXNetCAPI/mxnet.i             |  37 ++++++
 23 files changed, 318 insertions(+), 177 deletions(-)
 create mode 100644 perl-package/AI-MXNet/t/test_base.t

diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/AutoGrad.pm b/perl-package/AI-MXNet/lib/AI/MXNet/AutoGrad.pm
index b49c0b69c52f..221840e300aa 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/AutoGrad.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/AutoGrad.pm
@@ -333,10 +333,10 @@ method grad(
     );
 
     my @ret;
-    zip(sub {
-        my ($handle, $stype) = @_;
+    for(zip($grad_vars, $grad_stypes)) {
+        my ($handle, $stype) = @$_;
         push @ret, AI::MXNet::NDArray->new(handle => $handle, stype => $stype);
-    }, $grad_vars, $grad_stypes);
+    }
     if(blessed $variables)
     {
         return $ret[0];
@@ -474,4 +474,4 @@ func _parse_head($heads, $head_grads)
     return (\@head_handles, \@hgrad_handles);
 }
 
-1;
\ No newline at end of file
+1;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Base.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Base.pm
index a8da8470f574..f748ecbe1f37 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Base.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Base.pm
@@ -120,12 +120,17 @@ use constant GRAD_REQ_MAP => {
 
 sub zip
 {
-    my ($sub, @arrays) = @_;
-    my $len = @{ $arrays[0] };
-    for (my $i = 0; $i < $len; $i++)
+    if('CODE' eq ref $_[0])
     {
-        $sub->(map { $_->[$i] } @arrays);
+        # continue supporting the callback style
+        my $code = shift;
+        $code->(@$_) for AI::MXNetCAPI::py_zip(map { \@$_ } @_);
+        return;
     }
+    # the map() here may seem like a no-op, but triggers overloading or
+    # whatever else is needed to make array-ish things actually arrays
+    # before entering the low level list builder.
+    return AI::MXNetCAPI::py_zip(map { \@$_ } @_);
 }
 
 =head2 enumerate
@@ -270,16 +275,14 @@ sub build_param_doc
     $remove_dup //= 1;
     my %param_keys;
     my @param_str;
-    zip(sub {
-            my ($key, $type_info, $desc) = @_;
-            return if exists $param_keys{$key} and $remove_dup;
+    for(zip($arg_names, $arg_types, $arg_descs)) {
+            my ($key, $type_info, $desc) = @$_;
+            next if exists $param_keys{$key} and $remove_dup;
             $param_keys{$key} = 1;
             my $ret = sprintf("%s : %s", $key, $type_info);
             $ret .= "\n    ".$desc if length($desc);
             push @param_str,  $ret;
-        },
-        $arg_names, $arg_types, $arg_descs
-    );
+    }
     return sprintf("Parameters\n----------\n%s\n", join("\n", @param_str));
 }
 
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Executor/Group.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Executor/Group.pm
index 7ac054333c13..acacffde1ee2 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Executor/Group.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Executor/Group.pm
@@ -57,18 +57,18 @@ func _split_input_slice($batch_size, $work_load_list)
 # Load a array ref of arrays into a array ref of arrays specified by slices
 func _load_general($data, $targets, $major_axis)
 {
-    zip(sub {
-        my ($d_src, $d_targets, $axis) = @_;
+    for(zip($data, $targets, $major_axis)) {
+        my ($d_src, $d_targets, $axis) = @$_;
         if(blessed($d_targets) and $d_targets->isa('AI::MXNet::NDarray'))
         {
             $d_src->copyto($d_targets);
         }
         elsif(ref $d_targets eq 'ARRAY' and blessed $d_targets->[0])
         {
-            zip(sub {
-                my ($src, $dst) = @_;
+            for(zip($d_src, $d_targets)) {
+                my ($src, $dst) = @$_;
                 $src->copyto($dst);
-            }, $d_src, $d_targets);
+            }
         }
         else
         {
@@ -124,7 +124,7 @@ func _load_general($data, $targets, $major_axis)
                 }
             }
         }
-    }, $data, $targets, $major_axis);
+    }
 }
 
 # Load data into sliced arrays
@@ -144,8 +144,8 @@ func _load_label($batch, $targets, $major_axis)
 func _merge_multi_context($outputs, $major_axis)
 {
     my @rets;
-    zip(sub {
-        my ($tensors, $axis) = @_;
+    for(zip($outputs, $major_axis)) {
+        my ($tensors, $axis) = @$_;
         if($axis >= 0)
         {
             if(@$tensors == 1)
@@ -165,7 +165,7 @@ func _merge_multi_context($outputs, $major_axis)
             # first one, without checking they are actually the same
             push @rets, $tensors->[0];
         }
-    }, $outputs, $major_axis);
+    }
     return \@rets;
 }
 
@@ -353,9 +353,9 @@ method decide_slices(ArrayRef[AI::MXNet::DataDesc] $data_shapes)
 {
     confess("empty data_shapes array") unless @{ $data_shapes } > 0;
     my $major_axis = [map { AI::MXNet::DataDesc->get_batch_axis($_->layout) } @{ $data_shapes }];
-    zip(sub {
-        my ($desc, $axis) = @_;
-        return if($axis == -1);
+    for(zip($data_shapes, $major_axis)) {
+        my ($desc, $axis) = @$_;
+        next if($axis == -1);
         my $batch_size = $desc->shape->[$axis];
         if(defined $self->_p->batch_size)
         {
@@ -370,7 +370,7 @@ method decide_slices(ArrayRef[AI::MXNet::DataDesc] $data_shapes)
             $self->_p->batch_size($batch_size);
             $self->_p->slices(AI::MXNet::Executor::Group::_split_input_slice($self->_p->batch_size, $self->workload));
         }
-    }, $data_shapes, $major_axis);
+    }
     return $major_axis;
 }
 
@@ -590,16 +590,16 @@ method set_params(HashRef[AI::MXNet::NDArray] $arg_params, HashRef[AI::MXNet::ND
 method get_params(HashRef[AI::MXNet::NDArray] $arg_params, HashRef[AI::MXNet::NDArray] $aux_params)
 {
     my $weight = 0;
-    zip(sub {
-        my ($name, $block) = @_;
+    for(zip($self->param_names, $self->_p->param_arrays)) {
+        my ($name, $block) = @$_;
             my $weight = sum(map { $_->copyto(AI::MXNet::Context->cpu) } @{ $block }) / @{ $block };
             $weight->astype($arg_params->{$name}->dtype)->copyto($arg_params->{$name});
-    }, $self->param_names, $self->_p->param_arrays);
-    zip(sub {
-        my ($name, $block) = @_;
+    }
+    for(zip($self->_p->aux_names, $self->_p->aux_arrays)) {
+        my ($name, $block) = @$_;
             my $weight = sum(map { $_->copyto(AI::MXNet::Context->cpu) } @{ $block }) / @{ $block };
             $weight->astype($aux_params->{$name}->dtype)->copyto($aux_params->{$name});
-    }, $self->_p->aux_names, $self->_p->aux_arrays);
+    }
 }
 
 
@@ -668,15 +668,15 @@ method get_output_shapes()
 {
     my @shapes = map { $_->shape } @{ $self->execs->[0]->outputs };
     my @concat_shapes;
-    zip(sub {
-        my ($key, $shape, $axis) = @_;
+    for(zip($self->symbol->list_outputs, \@shapes, $self->_p->output_layouts)) {
+        my ($key, $shape, $axis) = @$_;
         my @the_shape = @{ $shape };
         if($axis >= 0)
         {
             $the_shape[$axis] = $self->_p->batch_size;
         }
         push @concat_shapes, AI::MXNet::DataDesc->new(name => $key, shape => \@the_shape);
-    }, $self->symbol->list_outputs, \@shapes, $self->_p->output_layouts);
+    }
     return \@concat_shapes;
 }
 
@@ -765,11 +765,11 @@ method backward(Maybe[AI::MXNet::NDArray|ArrayRef[AI::MXNet::NDArray]] $out_grad
 {
     confess('re-bind with for_training=1 to run backward') unless $self->for_training;
     $out_grads //= [];
-    zip(sub {
-        my ($i, $exec, $islice) = @_;
+    for(zip([0..@{ $self->_p->execs }-1], $self->_p->execs, $self->_p->slices)) {
+        my ($i, $exec, $islice) = @$_;
         my @out_grads_slice;
-        zip(sub{
-            my ($grad, $axis) = @_;
+        for(zip($out_grads, $self->_p->output_layouts)) {
+            my ($grad, $axis) = @$_;
             if($axis >= 0)
             {
                 my $og_my_slice = $grad->slice_axis({
@@ -783,9 +783,9 @@ method backward(Maybe[AI::MXNet::NDArray|ArrayRef[AI::MXNet::NDArray]] $out_grad
             {
                 push @out_grads_slice, $grad->copyto($self->contexts->[$i]);
             }
-        }, $out_grads, $self->_p->output_layouts);
+        }
         $exec->backward(\@out_grads_slice);
-    }, [0..@{ $self->_p->execs }-1], $self->_p->execs, $self->_p->slices);
+    }
 }
 
 =head2 update_metric
@@ -802,11 +802,11 @@ method backward(Maybe[AI::MXNet::NDArray|ArrayRef[AI::MXNet::NDArray]] $out_grad
 
 method update_metric(AI::MXNet::EvalMetric $eval_metric, ArrayRef[AI::MXNet::NDArray] $labels)
 {
-    zip(sub {
-        my ($texec, $islice) = @_;
+    for(zip($self->_p->execs, $self->_p->slices)) {
+        my ($texec, $islice) = @$_;
         my @labels_slice;
-        zip(sub {
-            my ($label, $axis) = @_;
+        for(zip($labels, $self->_p->label_layouts)) {
+            my ($label, $axis) = @$_;
             if($axis == 0)
             {
                 # slicing NDArray along axis 0 can avoid copying
@@ -825,9 +825,9 @@ method update_metric(AI::MXNet::EvalMetric $eval_metric, ArrayRef[AI::MXNet::NDA
             {
                 push @labels_slice, $label;
             }
-        }, $labels, $self->_p->label_layouts);
+        }
         $eval_metric->update(\@labels_slice, $texec->outputs);
-    }, $self->_p->execs, $self->_p->slices);
+    }
 }
 
 method _bind_ith_exec(
@@ -874,8 +874,8 @@ method _bind_ith_exec(
 method _sliced_shape(ArrayRef[AI::MXNet::DataDesc] $shapes, Int $i, ArrayRef[Int] $major_axis)
 {
     my @sliced_shapes;
-    zip(sub {
-        my ($desc, $axis) = @_;
+    for(zip($shapes, $major_axis)) {
+        my ($desc, $axis) = @$_;
         my @shape = @{ $desc->shape };
         if($axis >= 0)
         {
@@ -887,7 +887,7 @@ method _sliced_shape(ArrayRef[AI::MXNet::DataDesc] $shapes, Int $i, ArrayRef[Int
             dtype   => $desc->dtype,
             layout  => $desc->layout
         );
-    }, $shapes, $major_axis);
+    }
     return \@sliced_shapes;
 }
 
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Block.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Block.pm
index 982822be5dc8..148df0471f2a 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Block.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Block.pm
@@ -565,21 +565,21 @@ method infer_shape(@args)
     my $args = \@args;
     ($args) = __PACKAGE__->_flatten($args);
     my %in;
-    zip(sub {
-        my ($i, $j) = @_;
+    for(zip($inputs, $args)) {
+        my ($i, $j) = @$_;
         $in{ $i->name } = $j->shape;
-    }, $inputs, $args);
+    }
     my ($arg_shapes, undef, $aux_shapes) = $out->infer_shape(%in);
     my %sdict;
-    zip(sub {
-        my ($i, $j) = @_;
+    for(zip($out->list_arguments(), $arg_shapes)) {
+        my ($i, $j) = @$_;
         $sdict{ $i } = $j;
-    }, $out->list_arguments(), $arg_shapes);
+    }
     my %aux;
-    zip(sub {
-        my ($i, $j) = @_;
+    for(zip($out->list_auxiliary_states(), $aux_shapes)) {
+        my ($i, $j) = @$_;
         $aux{ $i } = $j;
-    }, $out->list_auxiliary_states(), $aux_shapes);
+    }
     %sdict = (%sdict, %aux);
     for my $i ($self->collect_params->values)
     {
@@ -878,10 +878,10 @@ method forward($x, @args)
     assert((Data::Dumper::Dumper($in_fmt) eq Data::Dumper::Dumper($self->_in_format)), "Invalid input format");
     my $ret = $self->_cached_graph->[1]->deepcopy;
     my %in;
-    zip(sub {
-        my ($k, $v) = @_;
+    for(zip($self->_cached_graph->[0], $args)) {
+        my ($k, $v) = @$_;
         $in{$k->name} = $v;
-    }, $self->_cached_graph->[0], $args);
+    }
     $ret->_compose(%in);
     $ret = (__PACKAGE__->_regroup($ret, $self->_out_format))[0];
     if(ref($ret) eq 'ARRAY' and wantarray)
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Parameter.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Parameter.pm
index 0341fd7e6636..d241aa196a96 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Parameter.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Parameter.pm
@@ -194,8 +194,8 @@ method _load_init($data, $ctx)
 {
     if($self->shape)
     {
-        zip(sub {
-            my ($i, $j) = @_;
+        for(zip($self->shape, $data->shape)) {
+            my ($i, $j) = @$_;
             assert(
                 ($i == 0 or $i == $j),
                 sprintf(
@@ -204,7 +204,7 @@ method _load_init($data, $ctx)
                     $self->name, "@{$self->shape}", "@{$data->shape}"
                 )
             );
-        }, $self->shape, $data->shape);
+        }
     }
     if($self->dtype)
     {
@@ -923,4 +923,4 @@ method load(
     }
 }
 
-1;
\ No newline at end of file
+1;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Cell.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Cell.pm
index d2e7db280aaa..a3fb3c51a147 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Cell.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Cell.pm
@@ -1047,10 +1047,10 @@ method hybrid_forward(GluonClass $F, GluonInput $inputs, GluonInput $states)
     if($p_states != 0)
     {
         my @tmp;
-        zip(sub {
-            my ($new_s, $old_s) = @_;
+        for(zip($next_states, $states)) {
+            my ($new_s, $old_s) = @$_;
             push @tmp, $F->where($mask->($p_states, $new_s), $new_s, $old_s);
-        }, $next_states, $states);
+        }
         $states = \@tmp;
     }
     else
@@ -1109,10 +1109,10 @@ method unroll(Int $length, GluonInput $inputs, Maybe[GluonInput] :$begin_state=,
     else
     {
         my @tmp;
-        zip(sub {
-            my ($i, $j) = @_;
+        for(zip($outputs, $inputs)) {
+            my ($i, $j) = @$_;
             push @tmp, $F->elemwise_add($i, $j);
-        }, $outputs, $inputs);
+        }
         $outputs = \@tmp;
     }
     return ($outputs, $states);
@@ -1222,4 +1222,4 @@ method unroll(Int $length, GluonInput $inputs, Maybe[GluonInput] :$begin_state=,
 
 __PACKAGE__->register('AI::MXNet::Gluon::RNN');
 
-1;
\ No newline at end of file
+1;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Layer.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Layer.pm
index fa850e62a76a..2b6e8a5bdae4 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Layer.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/RNN/Layer.pm
@@ -230,14 +230,14 @@ method forward(GluonInput $inputs, Maybe[GluonInput] $states=)
     {
         $states = [$states];
     }
-    zip(sub {
-        my ($state, $info) = @_;
+    for(zip($states, $self->state_info($batch_size))) {
+        my ($state, $info) = @$_;
         if(Dumper($state->shape) ne Dumper($info->{shape}))
         {
             my @state_shape = @{ $state->shape };
             confess("Invalid recurrent state shape. Expecting @{$info->{shape}}, got @state_shape.");
         }
-    }, $states, $self->state_info($batch_size));
+    }
     if($self->input_size == 0)
     {
         for my $i (0..$self->dir-1)
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Trainer.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Trainer.pm
index 405c6d29aa38..63f521c5c699 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Trainer.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Trainer.pm
@@ -231,14 +231,14 @@ method step(Int $batch_size, Bool $ignore_stale_grad=0)
                 $self->_kv_store->pull($i, out => $param->list_grad, priority => -$i);
             }
         }
-        zip(sub {
-            my ($upd, $arr, $grad) = @_;
+        for(zip($self->_updaters, $param->list_data, $param->list_grad)) {
+            my ($upd, $arr, $grad) = @$_;
             if(not $ignore_stale_grad or $arr->_fresh_grad)
             {
                 $upd->($i, $grad, $arr);
                 $arr->_fresh_grad(0);
             }
-        }, $self->_updaters, $param->list_data, $param->list_grad);
+        }
     }, $self->_params);
 }
 
@@ -331,4 +331,4 @@ method load_states(Str $fname)
     }
 }
 
-1;
\ No newline at end of file
+1;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Utils.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Utils.pm
index eee3cb5a907b..6acb66237195 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Utils.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Gluon/Utils.pm
@@ -163,10 +163,10 @@ method split_and_load(
     }
     my $slices = __PACKAGE__->split_data($data, scalar(@$ctx_list), $batch_axis, $even_split);
     my @ret;
-    zip(sub {
-        my ($i, $ctx) = @_;
+    for(zip($slices, $ctx_list)) {
+        my ($i, $ctx) = @$_;
         push @ret, $i->as_in_context($ctx);
-    }, $slices, $ctx_list);
+    }
     return \@ret;
 }
 
@@ -277,4 +277,4 @@ func download(Str $url, Maybe[Str] :$path=, Bool :$overwrite=0, Maybe[Str] :$sha
     return $fname
 }
 
-1;
\ No newline at end of file
+1;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/KVStore.pm b/perl-package/AI-MXNet/lib/AI/MXNet/KVStore.pm
index 4410eb3d7a7a..84a890dcc908 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/KVStore.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/KVStore.pm
@@ -481,12 +481,12 @@ sub _key_value
         assert(not blessed($vals) and @$keys == @$vals);
         my @c_keys;
         my @c_vals;
-        zip(sub {
-            my ($key, $val) = @_;
+        for(zip($keys, $vals)) {
+            my ($key, $val) = @$_;
             my ($c_key, $c_val) = _key_value($key, $val);
             push @c_keys, @$c_key;
             push @c_vals, @$c_val;
-        }, $keys, $vals);
+        }
         return (\@c_keys, \@c_vals);
     }
 }
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Metric.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Metric.pm
index a6b440be6eb3..3b9345d8baf9 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Metric.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Metric.pm
@@ -241,8 +241,8 @@ has '+name'   => (default => 'accuracy');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred_label) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred_label) = @$_;
         if(join(',', @{$pred_label->shape}) ne join(',', @{$label->shape}))
         {
             $pred_label = AI::MXNet::NDArray->argmax_channel($pred_label);
@@ -251,7 +251,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         my $sum = ($pred_label->aspdl->flat == $label->aspdl->flat)->sum;
         $self->sum_metric($self->sum_metric + $sum);
         $self->num_inst($self->num_inst + $pred_label->size);
-    }, $labels, $preds);
+    }
 }
 
 package AI::MXNet::TopKAccuracy;
@@ -274,8 +274,8 @@ sub BUILD
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred_label) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred_label) = @$_;
         confess('Predictions should be no more than 2 dims')
             unless @{ $pred_label->shape } <= 2;
         $pred_label = $pred_label->aspdl->qsorti;
@@ -299,7 +299,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
             }
         }
         $self->num_inst($self->num_inst + $num_samples);
-    }, $labels, $preds);
+    }
 }
 
 # Calculate the F1 score of a binary classification problem.
@@ -312,16 +312,16 @@ has '+name'   => (default => 'f1');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred_label) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred_label) = @$_;
         AI::MXNet::Metric::check_label_shapes($label, $pred_label);
         $pred_label = $pred_label->aspdl->maximum_ind;
         $label = $label->astype('int32')->aspdl;
         confess("F1 currently only supports binary classification.")
             if $label->uniq->shape->at(0) > 2;
         my ($true_positives, $false_positives, $false_negatives) = (0,0,0);
-        zip(sub{
-            my ($y_pred, $y_true) = @_;
+        for(zip($pred_label->unpdl, $label->unpdl)) {
+            my ($y_pred, $y_true) = @$_;
             if($y_pred == 1 and $y_true == 1)
             {
                 $true_positives += 1;
@@ -334,7 +334,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
             {
                 $false_negatives += 1;
             }
-        }, $pred_label->unpdl, $label->unpdl);
+        }
         my $precision;
         my $recall;
         if($true_positives + $false_positives > 0)
@@ -364,7 +364,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         }
         $self->sum_metric($self->sum_metric + $f1_score);
         $self->num_inst($self->num_inst + 1);
-    }, $labels, $preds);
+    }
 }
 
 package AI::MXNet::Perplexity;
@@ -408,8 +408,8 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
     my ($loss, $num) = (0, 0);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         my $label_shape = $label->shape;
         my $pred_shape  = $pred->shape;
         assert(
@@ -426,7 +426,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         }
         $loss -= $pred->maximum(1e-10)->log->sum->asscalar;
         $num  += $pred->size;
-    }, $labels, $preds);
+    }
     $self->sum_metric($self->sum_metric + $loss);
     $self->num_inst($self->num_inst + $num);
 }
@@ -450,8 +450,8 @@ has '+name'   => (default => 'mae');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         $label = $label->aspdl;
         $pred =  $pred->aspdl;
         if($label->ndims == 1)
@@ -460,7 +460,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         }
         $self->sum_metric($self->sum_metric + ($label - $pred)->abs->avg);
         $self->num_inst($self->num_inst + 1);
-    }, $labels, $preds);
+    }
 }
 
 # Calculate Mean Squared Error loss
@@ -473,8 +473,8 @@ has '+name'   => (default => 'mse');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         $label = $label->aspdl;
         $pred =  $pred->aspdl;
         if($label->ndims == 1)
@@ -483,7 +483,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         }
         $self->sum_metric($self->sum_metric + (($label - $pred)**2)->avg);
         $self->num_inst($self->num_inst + 1);
-    }, $labels, $preds);
+    }
 }
 
 # Calculate Root Mean Squred Error loss
@@ -496,8 +496,8 @@ has '+name'   => (default => 'rmse');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         $label = $label->aspdl;
         $pred =  $pred->aspdl;
         if($label->ndims == 1)
@@ -506,7 +506,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         }
         $self->sum_metric($self->sum_metric + sqrt((($label - $pred)**2)->avg));
         $self->num_inst($self->num_inst + 1);
-    }, $labels, $preds);
+    }
 }
 
 # Calculate Cross Entropy loss
@@ -521,8 +521,8 @@ method python_constructor_arguments() { ['eps'] }
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         $label = $label->aspdl->flat;
         $pred =  $pred->aspdl;
         my $label_shape = $label->shape->at(0);
@@ -534,7 +534,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         my $prob = $pred->index($label);
         $self->sum_metric($self->sum_metric + (-($prob + $self->eps)->log)->sum);
         $self->num_inst($self->num_inst + $label_shape);
-    }, $labels, $preds);
+    }
 }
 
 package AI::MXNet::PearsonCorrelation;
@@ -570,8 +570,8 @@ has '+name'   => (default => 'pearson-correlation');
 method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray] $preds)
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds);
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         AI::MXNet::Metric::check_label_shapes($label, $pred);
         $label = $label->aspdl->flat;
         $pred  = $pred->aspdl->flat;
@@ -583,7 +583,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
             ((($label-$label_mean)*($pred-$pred_mean))->sum/$label->nelem)/(($label_stdv*$pred_stdv)->at(0))
         );
         $self->num_inst($self->num_inst + 1);
-    }, $labels, $preds);
+    }
 }
 
 package AI::MXNet::Loss;
@@ -749,8 +749,8 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
 {
     AI::MXNet::Metric::check_label_shapes($labels, $preds)
         unless $self->allow_extra_outputs;
-    zip(sub {
-        my ($label, $pred) = @_;
+    for(zip($labels, $preds)) {
+        my ($label, $pred) = @$_;
         $label = $label->aspdl;
         $pred =  $pred->aspdl;
         my $value = $self->eval_function->($label, $pred);
@@ -758,7 +758,7 @@ method update(ArrayRef[AI::MXNet::NDArray] $labels, ArrayRef[AI::MXNet::NDArray]
         my $num_inst   = ref $value ? $value->[1] : 1;
         $self->sum_metric($self->sum_metric + $sum_metric);
         $self->num_inst($self->num_inst + $num_inst);
-    }, $labels, $preds);
+    }
 }
 
 package AI::MXNet::Metric;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Module.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Module.pm
index a1aa1b2f9769..3229d22597d0 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Module.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Module.pm
@@ -809,12 +809,12 @@ method forward(
         else
         {
             $new_dshape = [];
-            zip(sub {
-                my ($i, $shape) = @_;
+            for(zip($self->data_shapes, \@new_data_shapes)) {
+                my ($i, $shape) = @$_;
                 push @{ $new_dshape }, AI::MXNet::DataDesc->new(
                     $i->name, $shape, $i->dtype, $i->layout
                 );
-            }, $self->data_shapes, \@new_data_shapes);
+            }
         }
         my $new_lshape;
         if($data_batch->can('provide_label') and $data_batch->provide_label)
@@ -824,12 +824,12 @@ method forward(
         elsif($data_batch->can('label') and $data_batch->label)
         {
             $new_lshape = [];
-            zip(sub {
-                my ($i, $j) = @_;
+            for(zip($self->label_shapes, $data_batch->label)) {
+                my ($i, $j) = @$_;
                 push @{ $new_lshape }, AI::MXNet::DataDesc->new(
                     $i->name, $j->shape, $i->dtype, $i->layout
                 );
-            }, $self->label_shapes, $data_batch->label);
+            }
         }
         $self->reshape(data_shapes => $new_dshape, label_shapes => $new_lshape);
     }
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Monitor.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Monitor.pm
index 993461713cb6..386164112e65 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Monitor.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Monitor.pm
@@ -145,14 +145,14 @@ method toc()
     }
     for my $exe (@{ $self->exes })
     {
-        zip(sub {
-            my ($name, $array) = @_;
+        for(zip($exe->_symbol->list_arguments, $exe->arg_arrays)) {
+            my ($name, $array) = @$_;
             push @{ $self->queue }, [$self->step, $name, $self->stat_func->($array)];
-        }, $exe->_symbol->list_arguments, $exe->arg_arrays);
-        zip(sub {
-            my ($name, $array) = @_;
+        }
+        for(zip($exe->_symbol->list_auxiliary_states, $exe->aux_arrays)) {
+            my ($name, $array) = @$_;
             push @{ $self->queue }, [$self->step, $name, $self->stat_func->($array)];
-        }, $exe->_symbol->list_auxiliary_states, $exe->aux_arrays);
+        }
     }
     $self->activated(0);
     my @res;
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
index 7193f526b892..ffee1295d0db 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
@@ -96,12 +96,12 @@ method at(Index @indices)
                    or full crop")
         if $isize > 1 and $dsize != $isize;
     my $i = 0;
-    zip(sub {
-        my ($idx, $dim_size) = @_;
+    for(zip(\@indices, $shape)) {
+        my ($idx, $dim_size) = @$_;
         confess("Dimension $i mismatch Idx: $idx >= Dim Size: $dim_size")
             if $idx >= $dim_size or ($idx + $dim_size) < 0;
         ++$i;
-    }, \@indices, $shape);
+    }
     $i = 0;
     for my $v (@indices)
     {
@@ -151,8 +151,8 @@ method slice(Slice|AdvancedSlice @slices)
         ++$i;
         ref $_ ? (@$_ == 1 ? [$_->[0], $_->[0]] : $_) : ($_ eq 'X' ? [0, $shape->[$i] - 1] : [$_, $_]);
     } @slices;
-    zip(sub {
-        my ($slice, $dim_size) = @_;
+    for(zip(\@slices, $shape)) {
+        my ($slice, $dim_size) = @$_;
         my ($begin, $end, $stride) = @$slice;
         confess("NDArray does not support slice strides != 1")
             if ($stride//0) > 1;
@@ -160,7 +160,7 @@ method slice(Slice|AdvancedSlice @slices)
             if $begin >= $dim_size or ($begin + $dim_size) < 0;
         confess("Dimension $i mismatch slice end : $end >= Dim Size: $dim_size")
             if $end >= $dim_size or ($end + $dim_size) < 0;
-    }, \@slices, $shape);
+    }
     $i = 0;
     my ($begin, $end) = ([], []);
     for my $s (@slices)
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray/Slice.pm b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray/Slice.pm
index ea49ac5960a4..1a3ea7e0a460 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray/Slice.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray/Slice.pm
@@ -55,12 +55,10 @@ method set(AcceptableInput $value, $reverse=)
 {
     confess("set value must be defined") unless defined $value;
     confess("${\ $self->parent } is not writable") unless $self->parent->writable;
-    my $shape = [];
-    zip(
-        sub { my ($begin, $end) = @_; push @$shape, ($end-$begin); },
-        $self->begin,
-        $self->end
-    );
+    my $shape = [ map {
+        my($begin, $end) = @$_;
+        ($end-$begin);
+    } zip($self->begin, $self->end) ];
     if(ref $value)
     {
         if(blessed($value) and $value->isa('AI::MXNet::NDArray'))
@@ -77,15 +75,11 @@ method set(AcceptableInput $value, $reverse=)
         }
         confess("value $value does not match slice dim sizes [@$shape]")
             if @{$value->shape} != @$shape;
-        zip(
-            sub {
-                my ($dsize, $vdsize) = @_;
+        for(zip($shape, $value->shape)) {
+                my ($dsize, $vdsize) = @$_;
                 confess("Slice [@$shape]  != $value given as value")
                     if $dsize != $vdsize;
-            },
-            $shape,
-            $value->shape
-        );
+        }
         AI::MXNet::NDArray->_crop_assign(
             $self->parent,
             $value,
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/RNN/Cell.pm b/perl-package/AI-MXNet/lib/AI/MXNet/RNN/Cell.pm
index 38db4090556e..f2d8b5369e99 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/RNN/Cell.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/RNN/Cell.pm
@@ -1412,15 +1412,15 @@ method unroll(
         $r_outputs = [reverse(@{ $r_outputs })];
     }
     my $outputs = [];
-    zip(sub {
-        my ($i, $l_o, $r_o) = @_;
+    for(zip([0..@{ $l_outputs }-1], [@{ $l_outputs }], [@{ $r_outputs }])) {
+        my ($i, $l_o, $r_o) = @$_;
         push @$outputs, AI::MXNet::Symbol->Concat(
             $l_o, $r_o, dim=>(1+($merge_outputs?1:0)),
             name => $merge_outputs
                         ? sprintf('%sout', $self->_output_prefix)
                         : sprintf('%st%d', $self->_output_prefix, $i)
         );
-    }, [0..@{ $l_outputs }-1], [@{ $l_outputs }], [@{ $r_outputs }]);
+    }
     if($merge_outputs)
     {
         $outputs = @{ $outputs }[0];
@@ -1907,14 +1907,14 @@ method call(AI::MXNet::Symbol $inputs, SymbolOrArrayOfSymbols $states)
     my @states;
     if($p_states != 0)
     {
-        zip(sub {
-            my ($new_s, $old_s) = @_;
+        for(zip($next_states, $states)) {
+            my ($new_s, $old_s) = @$_;
             push @states, AI::MXNet::Symbol->where(
                 $mask->($p_states, $new_s),
                 $new_s,
                 $old_s
             );
-        }, $next_states, $states);
+        }
     }
     $self->prev_output($output);
     return ($output, @states ? \@states : $next_states);
@@ -1968,11 +1968,11 @@ method unroll(
     else
     {
         my @temp;
-        zip(sub {
-            my ($output_sym, $input_sym) = @_;
+        for(zip([@{ $outputs }], [@{ $inputs }])) {
+            my ($output_sym, $input_sym) = @$_;
             push @temp, AI::MXNet::Symbol->elemwise_add($output_sym, $input_sym,
                             name=>$output_sym->name."_plus_residual");
-        }, [@{ $outputs }], [@{ $inputs }]);
+        }
         $outputs = \@temp;
     }
     return ($outputs, $states);
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/Symbol.pm b/perl-package/AI-MXNet/lib/AI/MXNet/Symbol.pm
index d35bdaea62cd..8fd885a1d2c8 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/Symbol.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/Symbol.pm
@@ -585,8 +585,8 @@ method infer_shape(Maybe[Str|Shape] @args)
         my ($arg_shapes) = $self->_infer_shape_impl(1, @args);
         my $arg_names    = $self->list_arguments;
         my @unknowns;
-        zip(sub {
-            my ($name, $shape) = @_;
+        for(zip($arg_names, $arg_shapes)) {
+            my ($name, $shape) = @$_;
             if(not ref $shape or not @$shape or not product(@$shape))
             {
                 if(@unknowns >= 10)
@@ -599,7 +599,7 @@ method infer_shape(Maybe[Str|Shape] @args)
                     push @unknowns, "$name @shape";
                 }
             }
-        }, $arg_names, $arg_shapes);
+        }
         AI::MXNet::Logging->warning(
             "Cannot decide shape for the following arguments "
             ."(0s in shape means unknown dimensions). "
diff --git a/perl-package/AI-MXNet/t/test_autograd.t b/perl-package/AI-MXNet/t/test_autograd.t
index 32225bfd2728..b45d233d79a0 100644
--- a/perl-package/AI-MXNet/t/test_autograd.t
+++ b/perl-package/AI-MXNet/t/test_autograd.t
@@ -37,10 +37,10 @@ sub autograd_assert
     ok(same($output->aspdl, $res->aspdl));
     my $grad_res = $grad_f->(@args);
     ok(@$grad_vals == @$grad_res);
-    zip(sub {
-        my ($a, $b) = @_;
+    for(zip($grad_vals, $grad_res)) {
+        my ($a, $b) = @$_;
         ok(same($a->aspdl, $b->aspdl));
-    }, $grad_vals, $grad_res);
+    }
 }
 
 sub test_unary_func
diff --git a/perl-package/AI-MXNet/t/test_base.t b/perl-package/AI-MXNet/t/test_base.t
new file mode 100644
index 000000000000..ea0bd0ef98f3
--- /dev/null
+++ b/perl-package/AI-MXNet/t/test_base.t
@@ -0,0 +1,107 @@
+use strict;
+use warnings;
+use Test::More;
+use AI::MXNet qw(mx);
+
+sub test_builtin_zip()
+{
+    is_deeply(
+        [ AI::MXNet::zip([ 0 .. 9 ], [ 10 .. 19 ]) ],
+        [ map { [ $_, 10 + $_ ] } 0 .. 9 ]);
+    is_deeply(
+        [ AI::MXNet::zip([ 0 .. 9 ], [ 10 .. 19 ], [ 20 .. 29 ]) ],
+        [ map { [ $_, 10 + $_, 20 + $_ ] } 0 .. 9 ]);
+    my $over = ListOverload->new(10 .. 19);
+    is_deeply(
+        [ AI::MXNet::zip([ 0 .. 9 ], \@$over) ],
+        [ map { [ $_, 10 + $_ ] } 0 .. 9 ]);
+    my $tied = ListTied->new(10 .. 19);
+    is_deeply(
+        [ AI::MXNet::zip([ 0 .. 9 ], \@$tied) ],
+        [ map { [ $_, 10 + $_ ] } 0 .. 9 ]);
+}
+
+
+test_builtin_zip();
+done_testing();
+
+package ListTied {
+    sub new {
+        my($class, @list) = @_;
+        my @tied;
+        tie @tied, $class, @list;
+        return \@tied;
+    }
+    sub TIEARRAY {
+        my($class, @list) = @_;
+        return bless { list => \@list }, $class;
+    }
+    sub FETCH {
+        my($self, $index) = @_;
+        return $self->{list}[$index];
+    }
+    sub STORE {
+        my($self, $index, $value) = @_;
+        return $self->{list}[$index] = $value;
+    }
+    sub FETCHSIZE {
+        my($self) = @_;
+        return scalar @{$self->{list}};
+    }
+    sub STORESIZE {
+        my($self, $count) = @_;
+        return $self->{list}[$count - 1] //= undef;
+    }
+    sub EXTEND {
+        my($self, $count) = @_;
+        return $self->STORESIZE($count);
+    }
+    sub EXISTS {
+        my($self, $key) = @_;
+        return exists $self->{list}[$key];
+    }
+    sub DELETE {
+        my($self, $key) = @_;
+        return delete $self->{list}[$key];
+    }
+    sub CLEAR {
+        my($self) = @_;
+        return @{$self->{list}} = ();
+    }
+    sub PUSH {
+        my($self, @list) = @_;
+        return push @{$self->{list}}, @list;
+    }
+    sub POP {
+        my($self) = @_;
+        return pop @{$self->{list}};
+    }
+    sub SHIFT {
+        my($self) = @_;
+        return shift @{$self->{list}};
+    }
+    sub UNSHIFT {
+        my($self, @list) = @_;
+        return unshift @{$self->{list}}, @list;
+    }
+    sub SPLICE {
+        my($self, $offset, $length, @list) = @_;
+        return splice @{$self->{list}}, $offset, $length, @list;
+    }
+    sub UNTIE {
+        my($self) = @_;
+    }
+    sub DESTROY {
+        my($self) = @_;
+    }
+}
+
+package ListOverload {
+    use overload '@{}' => \&as_list;
+    sub new {
+        my($class, @list) = @_;
+        return bless { list => \@list }, $class;
+    }
+    sub as_list { return $_[0]{list} }
+}
+
diff --git a/perl-package/AI-MXNet/t/test_model_parallel.t b/perl-package/AI-MXNet/t/test_model_parallel.t
index 6a8aba7aab06..76fe25625be3 100644
--- a/perl-package/AI-MXNet/t/test_model_parallel.t
+++ b/perl-package/AI-MXNet/t/test_model_parallel.t
@@ -65,10 +65,10 @@ sub test_chain
     $out_grad .= 1;
     $exec1->backward([$out_grad]);
     $exec2->backward([$out_grad->copyto($ctx1)]);
-    zip(sub {
-        my ($a, $b) = @_;
+    for(zip($arr_grad, $arr_grad2)) {
+        my ($a, $b) = @$_;
         ok(reldiff($a->aspdl, $b->aspdl) < 1e-6);
-    }, $arr_grad, $arr_grad2);
+    }
 }
 
 test_chain();
diff --git a/perl-package/AI-MXNet/t/test_module.t b/perl-package/AI-MXNet/t/test_module.t
index 7c5690a68b15..305b232a7222 100644
--- a/perl-package/AI-MXNet/t/test_module.t
+++ b/perl-package/AI-MXNet/t/test_module.t
@@ -148,10 +148,10 @@ sub test_module_states
     $mod->forward($batch);
     my $out2 = $mod->get_outputs(1);
 
-    zip(sub {
-        my ($x1, $x2) = @_;
+    for(zip($out1, $out2)) {
+        my ($x1, $x2) = @$_;
         ok(not almost_equal($x1->aspdl, $x2->aspdl, 1e-3));
-    }, $out1, $out2);
+    }
 }
 
 sub test_module_switch_bucket
@@ -619,4 +619,4 @@ test_module_reshape();
 test_save_load();
 test_executor_group();
 test_module_set_params();
-test_forward_reshape();
\ No newline at end of file
+test_forward_reshape();
diff --git a/perl-package/AI-MXNet/t/test_multi_device_exec.t b/perl-package/AI-MXNet/t/test_multi_device_exec.t
index 87ca25778c92..15111a7a5d80 100644
--- a/perl-package/AI-MXNet/t/test_multi_device_exec.t
+++ b/perl-package/AI-MXNet/t/test_multi_device_exec.t
@@ -41,8 +41,8 @@ sub test_ctx_group
         shapes    => { data => [1,200] }
     );
 
-    zip(sub {
-        my ($arr, $name) = @_;
+    for(zip($texec->arg_arrays, $mlp->list_arguments())) {
+        my ($arr, $name) = @$_;
         if(exists $set_stage1{ $name })
         {
             ok($arr->context == $group2ctx->{stage1});
@@ -51,7 +51,7 @@ sub test_ctx_group
         {
             ok($arr->context == $group2ctx->{stage2});
         }
-    }, $texec->arg_arrays, $mlp->list_arguments());
+    }
 }
 
 test_ctx_group();
diff --git a/perl-package/AI-MXNetCAPI/mxnet.i b/perl-package/AI-MXNetCAPI/mxnet.i
index e466e98b7842..663a0c285f0b 100644
--- a/perl-package/AI-MXNetCAPI/mxnet.i
+++ b/perl-package/AI-MXNetCAPI/mxnet.i
@@ -106,7 +106,44 @@ static void ExecutorMonitor_callback(const char* name, NDArrayHandle handle, voi
 
 %}
 
+%{
+
+/* this is an adaptation of Python/bltinmodule.c's builtin_zip() */
+XS(py_zip) {
+    dXSARGS;
+    I32 i;
+    I32 len = -1;
+    AV *l[items];
+
+    for(i = 0; i < items; i++) {
+        AV *av = (AV *)SvRV(ST(i));
+        I32 thislen;
+
+        if(SvTYPE(av) != SVt_PVAV)
+            croak("zip argument#%d must be an array", i);
+        thislen = av_len(av) + 1;
+        if(len < 0 || thislen < len)
+            len = thislen;
+        l[i] = av;
+    }
+    EXTEND(SP, len);
+    for(i = 0; i < len; i++) {
+        I32 j;
+        SV *next[items];
+
+        for(j = 0; j < items; j++) {
+            SV **sv = av_fetch(l[j], i, 0);
+            next[j] = sv ? *sv : &PL_sv_undef;
+        }
+        ST(i) = sv_2mortal(newRV_noinc((SV *)av_make(items, next)));
+    }
+    XSRETURN(len);
+}
+
+%}
+
 %init %{
+    newXS(SWIG_prefix "py_zip", py_zip, (char *)__FILE__);
     /* These SWIG_TypeClientData() calls might break in the future, but
      * %rename should work on these types before that happens. */
     SWIG_TypeClientData(SWIGTYPE_p_MXNDArray, (void *)"NDArrayHandle");

From fa80a318620a5e9ceb80c2d9199b76079f2edd1d Mon Sep 17 00:00:00 2001
From: Sheng Zha <szha@users.noreply.github.com>
Date: Sat, 21 Oct 2017 14:32:36 -0700
Subject: [PATCH 22/23] add profile option for frontend profiling to image
 script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py
---
 example/gluon/image_classification.py | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/example/gluon/image_classification.py b/example/gluon/image_classification.py
index 8481afb50c1a..a67da3534135 100644
--- a/example/gluon/image_classification.py
+++ b/example/gluon/image_classification.py
@@ -64,6 +64,9 @@
 parser.add_argument('--kvstore', type=str, default='device',
                     help='kvstore to use for trainer/module.')
 parser.add_argument('--log-interval', type=int, default=50, help='Number of batches to wait before logging.')
+parser.add_argument('--profile', action='store_true',
+                    help='Option to turn on memory profiling for front-end, '\
+                         'and prints out the memory usage by python function at the end.')
 opt = parser.parse_args()
 
 logging.info(opt)
@@ -166,7 +169,7 @@ def train(epochs, ctx):
 
     net.save_params('image-classifier-%s-%d.params'%(opt.model, epochs))
 
-if __name__ == '__main__':
+def main():
     if opt.mode == 'symbolic':
         data = mx.sym.var('data')
         out = net(data)
@@ -186,3 +189,16 @@ def train(epochs, ctx):
         if opt.mode == 'hybrid':
             net.hybridize()
         train(opt.epochs, context)
+
+if __name__ == '__main__':
+    if opt.profile:
+        import hotshot, hotshot.stats
+        prof = hotshot.Profile('image-classifier-%s-%s.prof'%(opt.model, opt.mode))
+        prof.runcall(main)
+        prof.close()
+        stats = hotshot.stats.load('image-classifier-%s-%s.prof'%(opt.model, opt.mode))
+        stats.strip_dirs()
+        stats.sort_stats('cumtime', 'calls')
+        stats.print_stats()
+    else:
+        main()

From 97954619c8cfcd782f64ae1645a5fa403f30a795 Mon Sep 17 00:00:00 2001
From: jb <jeff@codesuji.com>
Date: Sat, 21 Oct 2017 19:38:16 -0400
Subject: [PATCH 23/23] Fix Typo (classification) (#8376)

Fix a typo in the example readme.
---
 example/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/example/README.md b/example/README.md
index 12ada4d0ceef..507b144ad607 100644
--- a/example/README.md
+++ b/example/README.md
@@ -53,7 +53,7 @@ If you want to contribute to this list and the examples, please open a new pull
 * [Fast R-CNN](https://github.com/precedenceguo/mx-rcnn) by [Jian Guo](https://github.com/precedenceguo)
 * "End2End Captcha Recognition (OCR)" by [xlvector](https://github.com/xlvector) [github link](https://github.com/xlvector/learning-dl/tree/master/mxnet/ocr) [Blog in Chinese](http://blog.xlvector.net/2016-05/mxnet-ocr-cnn/)
 * "Prediction step of xlvector's lstm ocr" by [melody-rain](https://github.com/melody-rain) [github link](https://github.com/melody-rain/mxnet/commit/46002e31fc34c746c01bcaa7ade999187068ad3c) [Blog in Chinese](https://zhuanlan.zhihu.com/p/22698511)
-* "Solving classificiation + regression with MXnet in Multi Input + Multi Obj" by [xlvector](https://github.com/xlvector) [github link](https://gist.github.com/xlvector/c304d74f9dd6a3b68a3387985482baac) [Blog in Chinese](http://blog.xlvector.net/2016-05/mxnet-regression-classification-for-concret-continuous-features/)
+* "Solving classification + regression with MXnet in Multi Input + Multi Obj" by [xlvector](https://github.com/xlvector) [github link](https://gist.github.com/xlvector/c304d74f9dd6a3b68a3387985482baac) [Blog in Chinese](http://blog.xlvector.net/2016-05/mxnet-regression-classification-for-concret-continuous-features/)
 * "Learn to sort by LSTM" by [xlvector](https://github.com/xlvector) [github link](https://github.com/xlvector/learning-dl/tree/master/mxnet/lstm_sort) [Blog in Chinese](http://blog.xlvector.net/2016-05/mxnet-lstm-example/)
 * [Neural Art using extremely lightweight (<500K) neural network](https://github.com/pavelgonchar/neural-art-mini) Lightweight version of mxnet neural art implementation by [Pavel Gonchar](https://github.com/pavelgonchar)
 * [Neural Art with generative networks](https://github.com/zhaw/neural_style) by [zhaw](https://github.com/zhaw)