From 9974ab05898a76d3bf6cb28a03a95908d2d9248c Mon Sep 17 00:00:00 2001
From: Bartosz Kuncer <bartosz.kuncer@intel.com>
Date: Thu, 14 Oct 2021 16:40:09 +0200
Subject: [PATCH 1/7] Remove MXNET_SUBGRAPH_BACKEND environment variable

---
 .../performance/backend/dnnl/dnnl_readme.md   | 12 +++-------
 .../api/cpp/docs/tutorials/subgraphAPI.md     |  8 +------
 docs/static_site/src/pages/api/faq/env_var.md |  6 -----
 docs/static_site/src/pages/api/faq/perf.md    |  1 -
 tests/python/dnnl/test_quantization_dnnl.py   |  2 --
 tests/python/unittest/test_subgraph_op.py     | 22 +++++++------------
 6 files changed, 12 insertions(+), 39 deletions(-)
diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index e68dc53a780b..d9cee98d91cf 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -276,17 +276,11 @@ MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64
 MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1380,0x7f7f7400a280,12) 8.93ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40 WDiv:HOST:+0.000
 ```
 
-<h2 id="6">Enable graph optimization</h2>
+<h2 id="6">Graph optimization</h2>
 
-Graph optimization with subgraph is available and enabled by default in master branch. For MXNet release v1.5, you can manually enable it by:
+Limitations of this experimental feature are:
 
-```
-export MXNET_SUBGRAPH_BACKEND=ONEDNN
-```
-
-This limitations of this experimental feature are:
-
-- Use this feature only for inference. When training, be sure to turn the feature off by unsetting the `MXNET_SUBGRAPH_BACKEND` environment variable.
+- Use this feature only for inference.
 
 - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet.
 
diff --git a/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md b/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md
index 2b85a9db77af..b8c81bdf5220 100644
--- a/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md
+++ b/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md
@@ -151,13 +151,7 @@ MXNET_REGISTER_SUBGRAPH_PROPERTY(SgTest, SgProperty2); // Execution order 2.
 MXNET_REGISTER_SUBGRAPH_PROPERTY(SgTest, SgProperty3); // Execution order 3.
 ```
 
-After compiling this subgraph mechanism into MXNet, we can use the environment variable `MXNET_SUBGRAPH_BACKEND` to activate it during symbol bind.
-
-```bash
-export MXNET_SUBGRAPH_BACKEND=SgTest
-```
-
-Or you can use python symbol API `get_backend_symbol` to run all properties registered for this backend and get returned symbol.
+After compiling this subgraph mechanism into MXNet you can use python symbol API `get_backend_symbol` to run all properties registered for this backend and get returned symbol.
 
 ```python
 sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
diff --git a/docs/static_site/src/pages/api/faq/env_var.md b/docs/static_site/src/pages/api/faq/env_var.md
index eed6cf3d9fc0..99a94b9ec79a 100644
--- a/docs/static_site/src/pages/api/faq/env_var.md
+++ b/docs/static_site/src/pages/api/faq/env_var.md
@@ -374,12 +374,6 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
   - Values: Int ```(default=4)```
   - This variable controls how many CuDNN dropout state resources to create for each GPU context for use in operator.
 
-* MXNET_SUBGRAPH_BACKEND
-  - Values: String ```(default="ONEDNN")``` if oneDNN is available, otherwise ```(default="")```
-  - This variable controls the subgraph partitioning in MXNet.
-  - This variable is used to perform oneDNN FP32 operator fusion and quantization. Please refer to the [oneDNN operator list](https://github.com/apache/incubator-mxnet/blob/v1.5.x/docs/tutorials/mkldnn/operator_list.md) for how this variable is used and the list of fusion passes.
-  - Set ```MXNET_SUBGRAPH_BACKEND=NONE``` to disable subgraph backend.
-
 * MXNET_SAFE_ACCUMULATION
   - Values: Values: 0(false) or 1(true) ```(default=1)```
   - If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
diff --git a/docs/static_site/src/pages/api/faq/perf.md b/docs/static_site/src/pages/api/faq/perf.md
index 0759afcc0163..0cbee87d9009 100644
--- a/docs/static_site/src/pages/api/faq/perf.md
+++ b/docs/static_site/src/pages/api/faq/perf.md
@@ -58,7 +58,6 @@ We also find that setting the following environment variables can help:
 | :-------- | :---------- |
 | `OMP_NUM_THREADS`            | Suggested value: `vCPUs / 2` in which `vCPUs` is the number of virtual CPUs. For more information, please see the guide for [setting the number of threads using an OpenMP environment variable](https://software.intel.com/en-us/mkl-windows-developer-guide-setting-the-number-of-threads-using-an-openmp-environment-variable) |
 | `KMP_AFFINITY`               | Suggested value: `granularity=fine,compact,1,0`.  For more information, please see the guide for [Thread Affinity Interface (Linux* and Windows*)](https://software.intel.com/en-us/node/522691). |
-| `MXNET_SUBGRAPH_BACKEND` | Set to ONEDNN to enable the [subgraph feature](https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN) for better performance. For more information please see [Build/Install MXNet with oneDNN](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/dnnl/dnnl_readme.html)|
 
 Note that _MXNet_ treats all CPUs on a single machine as a single device.
 So whether you specify `cpu(0)` or `cpu()`, _MXNet_ will use all CPU cores on the machine.
diff --git a/tests/python/dnnl/test_quantization_dnnl.py b/tests/python/dnnl/test_quantization_dnnl.py
index a578dbe0b56f..52a38971b2d3 100644
--- a/tests/python/dnnl/test_quantization_dnnl.py
+++ b/tests/python/dnnl/test_quantization_dnnl.py
@@ -19,7 +19,6 @@
 import mxnet as mx
 
 os.environ['ENABLE_ONEDNN_QUANTIZATION_TEST'] = '1'
-os.environ['MXNET_SUBGRAPH_BACKEND'] = 'NONE'
 curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__)))
 sys.path.insert(0, os.path.join(curr_path, '../quantization'))
 from test_quantization import *
@@ -30,4 +29,3 @@
     import pytest
     pytest.main()
     del os.environ['ENABLE_ONEDNN_QUANTIZATION_TEST']
-    del os.environ['MXNET_SUBGRAPH_BACKEND']
diff --git a/tests/python/unittest/test_subgraph_op.py b/tests/python/unittest/test_subgraph_op.py
index b4f7917c64d9..216be2d55b48 100644
--- a/tests/python/unittest/test_subgraph_op.py
+++ b/tests/python/unittest/test_subgraph_op.py
@@ -160,8 +160,6 @@ def test_subgraph_exe1(sym, subgraph_backend, op_names):
 @pytest.mark.parametrize('sym,op_names', get_graphs())
 @pytest.mark.skipif(sys.platform == "win32", reason='https://github.com/apache/incubator-mxnet/issues/19915')
 def test_subgraph_exe2(sym, subgraph_backend, op_names):
-    """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in _simple_bind
-    and compare results of the partitioned sym and the original sym."""
     def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
         exe = sym._simple_bind(ctx=mx.current_context(), grad_req='null')
         input_names = sym.list_inputs()
@@ -177,11 +175,10 @@ def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
         return exe
     sym, _, _ = sym
     original_exec = get_executor(sym)
-    with environment('MXNET_SUBGRAPH_BACKEND', subgraph_backend):
-        check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)),
-                                                     c_str_array(op_names)))
-        partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec)
-        check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend)))
+    check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)),
+                                                    c_str_array(op_names)))
+    partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec)
+    check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend)))
     outputs1 = original_exec.outputs
     outputs2 = partitioned_exec.outputs
     assert len(outputs1) == len(outputs2)
@@ -223,8 +220,6 @@ def test_subgraph_exe3(sym, subgraph_backend, op_names):
 @pytest.mark.parametrize('sym,op_names', get_graphs())
 @pytest.mark.skipif(sys.platform == "win32", reason='https://github.com/apache/incubator-mxnet/issues/19915')
 def test_subgraph_exe4(sym, subgraph_backend, op_names):
-    """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in bind
-    and compare results of the partitioned sym and the original sym."""
     def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
         arg_shapes, _, aux_shapes = sym.infer_shape()
         if subgraph_backend is None:
@@ -242,11 +237,10 @@ def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
 
     sym, _, _ = sym
     original_exec = get_executor(sym)
-    with environment('MXNET_SUBGRAPH_BACKEND', subgraph_backend):
-        check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)),
-                                                     c_str_array(op_names)))
-        partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec)
-        check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend)))
+    check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)),
+                                                    c_str_array(op_names)))
+    partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec)
+    check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend)))
     outputs1 = original_exec.outputs
     outputs2 = partitioned_exec.outputs
     assert len(outputs1) == len(outputs2)

From f70d878a5ad13f32b6b33479fb72625252c26523 Mon Sep 17 00:00:00 2001
From: Bartosz Kuncer <bartosz.kuncer@intel.com>
Date: Thu, 14 Oct 2021 19:58:19 +0200
Subject: [PATCH 2/7] Improve descriptions in
 docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
 file

---
 .../python/tutorials/performance/backend/dnnl/dnnl_readme.md  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index d9cee98d91cf..1d33f1ed7a78 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -278,9 +278,9 @@ MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1
 
 <h2 id="6">Graph optimization</h2>
 
-Limitations of this experimental feature are:
+Graph optimization with subgraph is available and enabled by default on master branch. Limitations of this experimental feature are:
 
-- Use this feature only for inference.
+- It works only for inference.
 
 - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet.
 

From 7cd45bc9845c57c397a5502064022ab5266cc81e Mon Sep 17 00:00:00 2001
From: Bartlomiej Gawrych <bartlomiej.gawrych@intel.com>
Date: Tue, 19 Oct 2021 16:07:50 +0800
Subject: [PATCH 3/7] examples fixes

---
 .../performance/backend/dnnl/dnnl_readme.md   | 173 +++++++++++-------
 1 file changed, 103 insertions(+), 70 deletions(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index 1d33f1ed7a78..6d9e5c86e143 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -15,14 +15,12 @@
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->
 
-# Install MXNet with ONEDNN
+# Install MXNet with oneDNN
 
-A better training and inference performance is expected to be achieved on Intel-Architecture CPUs with MXNet built with [Intel ONEDNN](https://github.com/oneapi-src/oneDNN) on multiple operating system, including Linux, Windows and MacOS.
-In the following sections, you will find build instructions for MXNet with Intel ONEDNN on Linux, MacOS and Windows.
+A better training and inference performance is expected to be achieved on Intel-Architecture CPUs with MXNet built with [oneDNN](https://github.com/oneapi-src/oneDNN) on multiple operating system, including Linux, Windows and MacOS.
+In the following sections, you will find build instructions for MXNet with oneDNN on Linux, MacOS and Windows.
 
-Please find ONEDNN optimized operators and other features in the [ONEDNN operator list](https://github.com/apache/incubator-mxnet/blob/v1.5.x/docs/tutorials/mkldnn/operator_list.md).
-
-The detailed performance data collected on Intel Xeon CPU with MXNet built with Intel ONEDNN can be found [here](https://mxnet.apache.org/api/faq/perf#intel-cpu).
+The detailed performance data collected on Intel Xeon CPU with MXNet built with oneDNN can be found [here](https://mxnet.apache.org/api/faq/perf#intel-cpu).
 
 
 <h2 id="0">Contents</h2>
@@ -55,12 +53,12 @@ git clone --recursive https://github.com/apache/incubator-mxnet.git
 cd incubator-mxnet
 ```
 
-### Build MXNet with ONEDNN
+### Build MXNet with oneDNN
 
-To achieve better performance, the Intel OpenMP and llvm OpenMP are recommended as below instruction. Otherwise, default GNU OpenMP will be used and you may get the sub-optimal performance. If you don't have the full [MKL](https://software.intel.com/en-us/intel-mkl) library installation, you might use OpenBLAS as the blas library, by setting USE_BLAS=openblas.
+To achieve better performance, the Intel OpenMP and llvm OpenMP are recommended as below instruction. Otherwise, default GNU OpenMP will be used and you may get the sub-optimal performance. If you don't have the full [MKL](https://software.intel.com/en-us/intel-mkl) library installation, you might use OpenBLAS as the blas library, by setting USE_BLAS=Open.
 
 ```
-# build with llvm OpenMP and Intel MKL/openblas
+# build with llvm OpenMP and Intel MKL/OpenBlas
 mkdir build && cd build
 cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_OPENMP=ON -DUSE_OPENCV=ON ..
 make -j $(nproc)
@@ -68,12 +66,16 @@ make -j $(nproc)
 
 ```
 # build with Intel MKL and Intel OpenMP
-make -j $(nproc) USE_OPENCV=1 USE_ONEDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel
+mkdir build && cd build
+cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=mkl ..
+make -j $(nproc)
 ```
 
 ```
-# build with openblas and GNU OpenMP(sub-optimal performance)
-make -j $(nproc) USE_OPENCV=1 USE_ONEDNN=1 USE_BLAS=openblas
+# build with openblas and GNU OpenMP (sub-optimal performance)
+mkdir build && cd build
+cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=Open ..
+make -j $(nproc)
 ```
 
 <h2 id="2">MacOS</h2>
@@ -107,7 +109,7 @@ git clone --recursive https://github.com/apache/incubator-mxnet.git
 cd incubator-mxnet
 ```
 
-### Build MXNet with ONEDNN
+### Build MXNet with oneDNN
 
 ```
 LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew --prefix llvm)/bin/clang CXX=$(brew --prefix llvm)/bin/clang++ USE_OPENCV=1 USE_OPENMP=1 USE_ONEDNN=1 USE_BLAS=apple
@@ -115,7 +117,7 @@ LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew -
 
 <h2 id="3">Windows</h2>
 
-On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel ONEDNN.
+On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel oneDNN.
 [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) is recommended.
 
 **Visual Studio 2015**
@@ -136,14 +138,14 @@ After you have installed all of the required dependencies, build the MXNet sourc
 git clone --recursive https://github.com/apache/incubator-mxnet.git
 cd C:\incubator-mxent
 ```
-2. Enable Intel ONEDNN by -DUSE_ONEDNN=1. Use [CMake 3](https://cmake.org/) to create a Visual Studio solution in ```./build```. Make sure to specify the architecture in the
+2. Enable oneDNN by -DUSE_ONEDNN=1. Use [CMake 3](https://cmake.org/) to create a Visual Studio solution in ```./build```. Make sure to specify the architecture in the
 command:
 ```
 >mkdir build
 >cd build
->cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
+>cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=Open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
 ```
-3. Enable Intel ONEDNN and Intel MKL as BLAS library by the command:
+3. Enable oneDNN and Intel MKL as BLAS library by the command:
 ```
 >"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\bin\mklvars.bat" intel64
 >cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
@@ -158,7 +160,7 @@ msbuild mxnet.sln /p:Configuration=Release;Platform=x64 /maxcpucount
 
 **Visual Studio 2017**
 
-User can follow the same steps of Visual Studio 2015 to build MXNET with ONEDNN, but change the version related command, for example,```C:\opencv\build\x64\vc15\bin``` and build command is as below:
+User can follow the same steps of Visual Studio 2015 to build MXNET with oneDNN, but change the version related command, for example,```C:\opencv\build\x64\vc15\bin``` and build command is as below:
 
 ```
 >cmake -G "Visual Studio 15 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
@@ -183,29 +185,24 @@ Expected Output:
 [[ 2.  2.  2.]
  [ 2.  2.  2.]]
 ```
-### Verify whether ONEDNN works
+### Verify whether oneDNN works
 
-After MXNet is installed, you can verify if ONEDNN backend works well with a single Convolution layer.
+After MXNet is installed, you can verify if oneDNN backend works well with a single Convolution layer.
 ```
-import mxnet as mx
-import numpy as np
+from mxnet import np
+from mxnet.gluon import nn
 
 num_filter = 32
 kernel = (3, 3)
 pad = (1, 1)
 shape = (32, 32, 256, 256)
 
-x = mx.sym.Variable('x')
-w = mx.sym.Variable('w')
-y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad)
-exe = y.simple_bind(mx.cpu(), x=shape)
-
-exe.arg_arrays[0][:] = np.random.normal(size=exe.arg_arrays[0].shape)
-exe.arg_arrays[1][:] = np.random.normal(size=exe.arg_arrays[1].shape)
+conv_layer = nn.Conv2D(channels=num_filter, kernel_size=kernel, padding=pad)
+conv_layer.initialize()
 
-exe.forward(is_train=False)
-o = exe.outputs[0]
-t = o.asnumpy()
+data = np.random.normal(size=shape)
+o = conv_layer(data)
+o.wait_to_read()
 ```
 
 More detailed debugging and profiling information can be logged by setting the environment variable 'DNNL_VERBOSE':
@@ -214,16 +211,17 @@ export DNNL_VERBOSE=1
 ```
 For example, by running above code snippet, the following debugging logs providing more insights on oneDNN primitives `convolution` and `reorder`. That includes: Memory layout, infer shape and the time cost of primitive execution.
 ```
-dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
-dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
-dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:aBcd16b:f0,,,32x32x256x256,7.43701
-dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:ABcd16b16a:f0,,,32x32x3x3,0.202148
-dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_inference,src_f32::blocked:aBcd16b:f0 wei_f32::blocked:ABcd16b16a:f0 bia_undef::undef::f0 dst_f32::blocked:aBcd16b:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,20.7539
-dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:ABcd16b16a:f0,,,32x32x3x3,1.86694
-dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,32x32x256x256,35.9771
+dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49)
+dnnl_verbose,info,cpu,runtime:OpenMP
+dnnl_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost
+dnnl_verbose,info,gpu,runtime:none
+dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x32x256x256,8.34912
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb32a:f0,,,32x32x3x3,0.0229492
+dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb32a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,10.5898
 ```
 
-You can find step-by-step guidance to do profiling for ONEDNN primitives in [Profiling ONEDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-MKLDNN-Operators).
+You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-oneDNN-Operators).
 
 <h2 id="5">Enable MKL BLAS</h2>
 
@@ -233,61 +231,96 @@ Installing the full MKL installation enables MKL support for all operators under
 
   1. Download and install the latest full MKL version following instructions on the [intel website.](https://software.intel.com/en-us/mkl) You can also install MKL through [YUM](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/yum-dnf-zypper.html) or [APT](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/apt.html) Repository.
 
-  2. Run `make -j ${nproc} USE_BLAS=mkl`
+  2. Create and navigate to build directory `mkdir build && cd build`
 
-  3. Navigate into the python directory
+  3. Run `cmake -DUSE_CUDA=OFF -DUSE_BLAS=mkl ..`
 
-  4. Run `sudo python setup.py install`
+  4. Run `make -j`
 
-### Verify whether MKL works
-
-After MXNet is installed, you can verify if MKL BLAS works well with a single dot layer.
+  5. Navigate into the python directory
 
-```
-import mxnet as mx
-import numpy as np
+  6. Run `sudo python setup.py install`
 
-shape_x = (1, 10, 8)
-shape_w = (1, 12, 8)
-
-x_npy = np.random.normal(0, 1, shape_x)
-w_npy = np.random.normal(0, 1, shape_w)
+### Verify whether MKL works
 
-x = mx.sym.Variable('x')
-w = mx.sym.Variable('w')
-y = mx.sym.batch_dot(x, w, transpose_b=True)
-exe = y.simple_bind(mx.cpu(), x=x_npy.shape, w=w_npy.shape)
+After MXNet is installed, you can verify if MKL BLAS works well with a linear matrix solver.
 
-exe.forward(is_train=False)
-o = exe.outputs[0]
-t = o.asnumpy()
+```
+from mxnet import np
+coeff = np.array([[7, 0], [5, 2]])
+y = np.array([14, 18])
+x = np.linalg.solve(coeff, y)
+x.wait_to_read()
 ```
 
 You can open the `MKL_VERBOSE` flag by setting environment variable:
 ```
 export MKL_VERBOSE=1
 ```
-Then by running above code snippet, you probably will get the following output message which means `SGEMM` primitive from MKL are called. Layout information and primitive execution performance are also demonstrated in the log message.
+Then by running above code snippet, you should get the similar output to message below (`SGESV` primitive from MKL was executed). Layout information and primitive execution performance are also demonstrated in the log message.
 ```
-Numpy + Intel(R) MKL: THREADING LAYER: (null)
-Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
-Numpy + Intel(R) MKL: preloading libiomp5.so runtime
-MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz lp64 intel_thread NMICDev:0
-MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1380,0x7f7f7400a280,12) 8.93ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40 WDiv:HOST:+0.000
+mkl-service + Intel(R) MKL: THREADING LAYER: (null)
+mkl-service + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
+mkl-service + Intel(R) MKL: preloading libiomp5.so runtime
+Intel(R) MKL 2020.0 Update 1 Product build 20200208 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.70GHz lp64 intel_thread
+MKL_VERBOSE SGESV(2,1,0x7f74d4002780,2,0x7f74d4002798,0x7f74d4002790,2,0) 77.58us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 ```
 
 <h2 id="6">Graph optimization</h2>
 
-Graph optimization with subgraph is available and enabled by default on master branch. Limitations of this experimental feature are:
+To better utilise oneDNN potential, using graph optimizations is recommended. There are few limitations of this feature:
 
 - It works only for inference.
-
+- Only subclasses of HybridBlock and Symbol can call optimize_for API.
 - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet.
 
+If your use case met above conditions, graph optimizations can be enabled by just simple call `optimize_for` API. Example below:
+```
+from mxnet import np
+from mxnet.gluon import nn
+
+data = np.random.normal(size=(32,3,224,224))
+
+net = nn.HybridSequential()
+net.add(nn.Conv2D(channels=64, kernel_size=(3,3)))
+net.add(nn.Activation('relu'))
+net.initialize()
+print("=" * 5, " Not optimized ", "=" * 5)
+o = net(data)
+o.wait_to_read()
+
+net.optimize_for(data, backend='ONEDNN')
+print("=" * 5, " Optimized ", "=" * 5)
+o = net(data)
+o.wait_to_read()
+
+```
+
+Above code snippet should produce following output:
+```
+===== Not optimized =====
+[15:05:43] ../src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU
+dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49)
+dnnl_verbose,info,cpu,runtime:OpenMP
+dnnl_verbose,info,cpu,isa:Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions
+dnnl_verbose,info,gpu,runtime:none
+dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,8.87793
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00708008
+dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,91.511
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00610352
+dnnl_verbose,exec,cpu,eltwise,jit:avx512_common,forward_inference,data_f32::blocked:acdb:f0 diff_undef::undef::f0,,alg:eltwise_relu alpha:0 beta:0,32x64x222x222,85.4392
+===== Optimized =====
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:Acdb64a:f0 dst_f32::blocked:abcd:f0,,,64x3x3x3,0.00610352
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00585938
+dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,3.98999
+dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,attr-post-ops:eltwise_relu:0:1 ,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,20.46
+```
+After optimization of Convolution + ReLU oneDNN executes both operations within single convolution primitive.
 
 <h2 id="7">Quantization and Inference with INT8</h2>
 
-Benefiting from oneDNN, MXNet built with oneDNN brings outstanding performance improvement on quantization and inference with INT8 Intel CPU Platform on Intel Xeon Scalable Platform.
+MXNet built with oneDNN brings outstanding performance improvement on quantization and inference with INT8 Intel CPU Platform on Intel Xeon Scalable Platform.
 
 - [CNN Quantization Examples](https://github.com/apache/incubator-mxnet/tree/master/example/quantization).
 

From 8cc9090d00d34cb4192c8598a609dbcdfb5af086 Mon Sep 17 00:00:00 2001
From: Bartosz Kuncer <bartosz.kuncer@intel.com>
Date: Tue, 19 Oct 2021 11:12:01 +0200
Subject: [PATCH 4/7] Fix MKL_VERBOSE description

---
 .../python/tutorials/performance/backend/dnnl/dnnl_readme.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index 6d9e5c86e143..0902e9bd4578 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -253,7 +253,7 @@ x = np.linalg.solve(coeff, y)
 x.wait_to_read()
 ```
 
-You can open the `MKL_VERBOSE` flag by setting environment variable:
+You can get the verbose log output from mkl library by setting environment variable:
 ```
 export MKL_VERBOSE=1
 ```

From f8ac34a3a28396d940f1de3f452fbbe27b411b66 Mon Sep 17 00:00:00 2001
From: bgawrych <bartlomiej.gawrych@intel.com>
Date: Tue, 19 Oct 2021 13:44:07 +0200
Subject: [PATCH 5/7] replace wait_to_read with print

---
 .../tutorials/performance/backend/dnnl/dnnl_readme.md  | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index 0902e9bd4578..5f11394e914e 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -202,7 +202,7 @@ conv_layer.initialize()
 
 data = np.random.normal(size=shape)
 o = conv_layer(data)
-o.wait_to_read()
+print(o)
 ```
 
 More detailed debugging and profiling information can be logged by setting the environment variable 'DNNL_VERBOSE':
@@ -250,7 +250,7 @@ from mxnet import np
 coeff = np.array([[7, 0], [5, 2]])
 y = np.array([14, 18])
 x = np.linalg.solve(coeff, y)
-x.wait_to_read()
+print(x)
 ```
 
 You can get the verbose log output from mkl library by setting environment variable:
@@ -287,16 +287,16 @@ net.add(nn.Activation('relu'))
 net.initialize()
 print("=" * 5, " Not optimized ", "=" * 5)
 o = net(data)
-o.wait_to_read()
+print(o)
 
 net.optimize_for(data, backend='ONEDNN')
 print("=" * 5, " Optimized ", "=" * 5)
 o = net(data)
-o.wait_to_read()
+print(o)
 
 ```
 
-Above code snippet should produce following output:
+Above code snippet should produce similar output to the following one (printed tensors are omitted) :
 ```
 ===== Not optimized =====
 [15:05:43] ../src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU

From d4d78d97c26a562f44b86da19abaf030bdf090ce Mon Sep 17 00:00:00 2001
From: Bartosz Kuncer <bartosz.kuncer@intel.com>
Date: Tue, 26 Oct 2021 10:10:37 +0200
Subject: [PATCH 6/7] Fix dnnl_readme.md

---
 .../python/tutorials/performance/backend/dnnl/dnnl_readme.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index 5f11394e914e..1cac07c49d1f 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -117,7 +117,7 @@ LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew -
 
 <h2 id="3">Windows</h2>
 
-On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel oneDNN.
+On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with oneDNN.
 [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) is recommended.
 
 **Visual Studio 2015**

From 7cb10c2b63b9c271999f37debe7dd8314c07bdd2 Mon Sep 17 00:00:00 2001
From: Bartosz Kuncer <bartosz.kuncer@intel.com>
Date: Tue, 26 Oct 2021 17:25:09 +0200
Subject: [PATCH 7/7] Fix link check

---
 .../python/tutorials/performance/backend/dnnl/dnnl_readme.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
index 1cac07c49d1f..a75e09293bf1 100644
--- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
+++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md
@@ -221,7 +221,7 @@ dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::bl
 dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb32a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,10.5898
 ```
 
-You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-oneDNN-Operators).
+You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-MKLDNN-Operators).
 
 <h2 id="5">Enable MKL BLAS</h2>