From 9974ab05898a76d3bf6cb28a03a95908d2d9248c Mon Sep 17 00:00:00 2001 From: Bartosz Kuncer Date: Thu, 14 Oct 2021 16:40:09 +0200 Subject: [PATCH 1/7] Remove MXNET_SUBGRAPH_BACKEND environment variable --- .../performance/backend/dnnl/dnnl_readme.md | 12 +++------- .../api/cpp/docs/tutorials/subgraphAPI.md | 8 +------ docs/static_site/src/pages/api/faq/env_var.md | 6 ----- docs/static_site/src/pages/api/faq/perf.md | 1 - tests/python/dnnl/test_quantization_dnnl.py | 2 -- tests/python/unittest/test_subgraph_op.py | 22 +++++++------------ 6 files changed, 12 insertions(+), 39 deletions(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index e68dc53a780b..d9cee98d91cf 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -276,17 +276,11 @@ MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1380,0x7f7f7400a280,12) 8.93ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40 WDiv:HOST:+0.000 ``` -

Enable graph optimization

+

Graph optimization

-Graph optimization with subgraph is available and enabled by default in master branch. For MXNet release v1.5, you can manually enable it by: +Limitations of this experimental feature are: -``` -export MXNET_SUBGRAPH_BACKEND=ONEDNN -``` - -This limitations of this experimental feature are: - -- Use this feature only for inference. When training, be sure to turn the feature off by unsetting the `MXNET_SUBGRAPH_BACKEND` environment variable. +- Use this feature only for inference. - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet. diff --git a/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md b/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md index 2b85a9db77af..b8c81bdf5220 100644 --- a/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md +++ b/docs/static_site/src/pages/api/cpp/docs/tutorials/subgraphAPI.md @@ -151,13 +151,7 @@ MXNET_REGISTER_SUBGRAPH_PROPERTY(SgTest, SgProperty2); // Execution order 2. MXNET_REGISTER_SUBGRAPH_PROPERTY(SgTest, SgProperty3); // Execution order 3. ``` -After compiling this subgraph mechanism into MXNet, we can use the environment variable `MXNET_SUBGRAPH_BACKEND` to activate it during symbol bind. - -```bash -export MXNET_SUBGRAPH_BACKEND=SgTest -``` - -Or you can use python symbol API `get_backend_symbol` to run all properties registered for this backend and get returned symbol. +After compiling this subgraph mechanism into MXNet you can use python symbol API `get_backend_symbol` to run all properties registered for this backend and get returned symbol. ```python sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch) diff --git a/docs/static_site/src/pages/api/faq/env_var.md b/docs/static_site/src/pages/api/faq/env_var.md index eed6cf3d9fc0..99a94b9ec79a 100644 --- a/docs/static_site/src/pages/api/faq/env_var.md +++ b/docs/static_site/src/pages/api/faq/env_var.md @@ -374,12 +374,6 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`. - Values: Int ```(default=4)``` - This variable controls how many CuDNN dropout state resources to create for each GPU context for use in operator. -* MXNET_SUBGRAPH_BACKEND - - Values: String ```(default="ONEDNN")``` if oneDNN is available, otherwise ```(default="")``` - - This variable controls the subgraph partitioning in MXNet. - - This variable is used to perform oneDNN FP32 operator fusion and quantization. Please refer to the [oneDNN operator list](https://github.com/apache/incubator-mxnet/blob/v1.5.x/docs/tutorials/mkldnn/operator_list.md) for how this variable is used and the list of fusion passes. - - Set ```MXNET_SUBGRAPH_BACKEND=NONE``` to disable subgraph backend. - * MXNET_SAFE_ACCUMULATION - Values: Values: 0(false) or 1(true) ```(default=1)``` - If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than diff --git a/docs/static_site/src/pages/api/faq/perf.md b/docs/static_site/src/pages/api/faq/perf.md index 0759afcc0163..0cbee87d9009 100644 --- a/docs/static_site/src/pages/api/faq/perf.md +++ b/docs/static_site/src/pages/api/faq/perf.md @@ -58,7 +58,6 @@ We also find that setting the following environment variables can help: | :-------- | :---------- | | `OMP_NUM_THREADS` | Suggested value: `vCPUs / 2` in which `vCPUs` is the number of virtual CPUs. For more information, please see the guide for [setting the number of threads using an OpenMP environment variable](https://software.intel.com/en-us/mkl-windows-developer-guide-setting-the-number-of-threads-using-an-openmp-environment-variable) | | `KMP_AFFINITY` | Suggested value: `granularity=fine,compact,1,0`. For more information, please see the guide for [Thread Affinity Interface (Linux* and Windows*)](https://software.intel.com/en-us/node/522691). | -| `MXNET_SUBGRAPH_BACKEND` | Set to ONEDNN to enable the [subgraph feature](https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN) for better performance. For more information please see [Build/Install MXNet with oneDNN](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/dnnl/dnnl_readme.html)| Note that _MXNet_ treats all CPUs on a single machine as a single device. So whether you specify `cpu(0)` or `cpu()`, _MXNet_ will use all CPU cores on the machine. diff --git a/tests/python/dnnl/test_quantization_dnnl.py b/tests/python/dnnl/test_quantization_dnnl.py index a578dbe0b56f..52a38971b2d3 100644 --- a/tests/python/dnnl/test_quantization_dnnl.py +++ b/tests/python/dnnl/test_quantization_dnnl.py @@ -19,7 +19,6 @@ import mxnet as mx os.environ['ENABLE_ONEDNN_QUANTIZATION_TEST'] = '1' -os.environ['MXNET_SUBGRAPH_BACKEND'] = 'NONE' curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__))) sys.path.insert(0, os.path.join(curr_path, '../quantization')) from test_quantization import * @@ -30,4 +29,3 @@ import pytest pytest.main() del os.environ['ENABLE_ONEDNN_QUANTIZATION_TEST'] - del os.environ['MXNET_SUBGRAPH_BACKEND'] diff --git a/tests/python/unittest/test_subgraph_op.py b/tests/python/unittest/test_subgraph_op.py index b4f7917c64d9..216be2d55b48 100644 --- a/tests/python/unittest/test_subgraph_op.py +++ b/tests/python/unittest/test_subgraph_op.py @@ -160,8 +160,6 @@ def test_subgraph_exe1(sym, subgraph_backend, op_names): @pytest.mark.parametrize('sym,op_names', get_graphs()) @pytest.mark.skipif(sys.platform == "win32", reason='https://github.com/apache/incubator-mxnet/issues/19915') def test_subgraph_exe2(sym, subgraph_backend, op_names): - """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in _simple_bind - and compare results of the partitioned sym and the original sym.""" def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None): exe = sym._simple_bind(ctx=mx.current_context(), grad_req='null') input_names = sym.list_inputs() @@ -177,11 +175,10 @@ def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None): return exe sym, _, _ = sym original_exec = get_executor(sym) - with environment('MXNET_SUBGRAPH_BACKEND', subgraph_backend): - check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)), - c_str_array(op_names))) - partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec) - check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend))) + check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)), + c_str_array(op_names))) + partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec) + check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend))) outputs1 = original_exec.outputs outputs2 = partitioned_exec.outputs assert len(outputs1) == len(outputs2) @@ -223,8 +220,6 @@ def test_subgraph_exe3(sym, subgraph_backend, op_names): @pytest.mark.parametrize('sym,op_names', get_graphs()) @pytest.mark.skipif(sys.platform == "win32", reason='https://github.com/apache/incubator-mxnet/issues/19915') def test_subgraph_exe4(sym, subgraph_backend, op_names): - """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in bind - and compare results of the partitioned sym and the original sym.""" def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None): arg_shapes, _, aux_shapes = sym.infer_shape() if subgraph_backend is None: @@ -242,11 +237,10 @@ def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None): sym, _, _ = sym original_exec = get_executor(sym) - with environment('MXNET_SUBGRAPH_BACKEND', subgraph_backend): - check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)), - c_str_array(op_names))) - partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec) - check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend))) + check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str(subgraph_backend), mx_uint(len(op_names)), + c_str_array(op_names))) + partitioned_exec = get_executor(sym, subgraph_backend, op_names, original_exec) + check_call(_LIB.MXRemoveSubgraphPropertyOpNames(c_str(subgraph_backend))) outputs1 = original_exec.outputs outputs2 = partitioned_exec.outputs assert len(outputs1) == len(outputs2) From f70d878a5ad13f32b6b33479fb72625252c26523 Mon Sep 17 00:00:00 2001 From: Bartosz Kuncer Date: Thu, 14 Oct 2021 19:58:19 +0200 Subject: [PATCH 2/7] Improve descriptions in docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md file --- .../python/tutorials/performance/backend/dnnl/dnnl_readme.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index d9cee98d91cf..1d33f1ed7a78 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -278,9 +278,9 @@ MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1

Graph optimization

-Limitations of this experimental feature are: +Graph optimization with subgraph is available and enabled by default on master branch. Limitations of this experimental feature are: -- Use this feature only for inference. +- It works only for inference. - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet. From 7cd45bc9845c57c397a5502064022ab5266cc81e Mon Sep 17 00:00:00 2001 From: Bartlomiej Gawrych Date: Tue, 19 Oct 2021 16:07:50 +0800 Subject: [PATCH 3/7] examples fixes --- .../performance/backend/dnnl/dnnl_readme.md | 173 +++++++++++------- 1 file changed, 103 insertions(+), 70 deletions(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index 1d33f1ed7a78..6d9e5c86e143 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -15,14 +15,12 @@ -# Install MXNet with ONEDNN +# Install MXNet with oneDNN -A better training and inference performance is expected to be achieved on Intel-Architecture CPUs with MXNet built with [Intel ONEDNN](https://github.com/oneapi-src/oneDNN) on multiple operating system, including Linux, Windows and MacOS. -In the following sections, you will find build instructions for MXNet with Intel ONEDNN on Linux, MacOS and Windows. +A better training and inference performance is expected to be achieved on Intel-Architecture CPUs with MXNet built with [oneDNN](https://github.com/oneapi-src/oneDNN) on multiple operating system, including Linux, Windows and MacOS. +In the following sections, you will find build instructions for MXNet with oneDNN on Linux, MacOS and Windows. -Please find ONEDNN optimized operators and other features in the [ONEDNN operator list](https://github.com/apache/incubator-mxnet/blob/v1.5.x/docs/tutorials/mkldnn/operator_list.md). - -The detailed performance data collected on Intel Xeon CPU with MXNet built with Intel ONEDNN can be found [here](https://mxnet.apache.org/api/faq/perf#intel-cpu). +The detailed performance data collected on Intel Xeon CPU with MXNet built with oneDNN can be found [here](https://mxnet.apache.org/api/faq/perf#intel-cpu).

Contents

@@ -55,12 +53,12 @@ git clone --recursive https://github.com/apache/incubator-mxnet.git cd incubator-mxnet ``` -### Build MXNet with ONEDNN +### Build MXNet with oneDNN -To achieve better performance, the Intel OpenMP and llvm OpenMP are recommended as below instruction. Otherwise, default GNU OpenMP will be used and you may get the sub-optimal performance. If you don't have the full [MKL](https://software.intel.com/en-us/intel-mkl) library installation, you might use OpenBLAS as the blas library, by setting USE_BLAS=openblas. +To achieve better performance, the Intel OpenMP and llvm OpenMP are recommended as below instruction. Otherwise, default GNU OpenMP will be used and you may get the sub-optimal performance. If you don't have the full [MKL](https://software.intel.com/en-us/intel-mkl) library installation, you might use OpenBLAS as the blas library, by setting USE_BLAS=Open. ``` -# build with llvm OpenMP and Intel MKL/openblas +# build with llvm OpenMP and Intel MKL/OpenBlas mkdir build && cd build cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_OPENMP=ON -DUSE_OPENCV=ON .. make -j $(nproc) @@ -68,12 +66,16 @@ make -j $(nproc) ``` # build with Intel MKL and Intel OpenMP -make -j $(nproc) USE_OPENCV=1 USE_ONEDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel +mkdir build && cd build +cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=mkl .. +make -j $(nproc) ``` ``` -# build with openblas and GNU OpenMP(sub-optimal performance) -make -j $(nproc) USE_OPENCV=1 USE_ONEDNN=1 USE_BLAS=openblas +# build with openblas and GNU OpenMP (sub-optimal performance) +mkdir build && cd build +cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=Open .. +make -j $(nproc) ```

MacOS

@@ -107,7 +109,7 @@ git clone --recursive https://github.com/apache/incubator-mxnet.git cd incubator-mxnet ``` -### Build MXNet with ONEDNN +### Build MXNet with oneDNN ``` LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew --prefix llvm)/bin/clang CXX=$(brew --prefix llvm)/bin/clang++ USE_OPENCV=1 USE_OPENMP=1 USE_ONEDNN=1 USE_BLAS=apple @@ -115,7 +117,7 @@ LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew -

Windows

-On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel ONEDNN. +On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel oneDNN. [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) is recommended. **Visual Studio 2015** @@ -136,14 +138,14 @@ After you have installed all of the required dependencies, build the MXNet sourc git clone --recursive https://github.com/apache/incubator-mxnet.git cd C:\incubator-mxent ``` -2. Enable Intel ONEDNN by -DUSE_ONEDNN=1. Use [CMake 3](https://cmake.org/) to create a Visual Studio solution in ```./build```. Make sure to specify the architecture in the +2. Enable oneDNN by -DUSE_ONEDNN=1. Use [CMake 3](https://cmake.org/) to create a Visual Studio solution in ```./build```. Make sure to specify the architecture in the command: ``` >mkdir build >cd build ->cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release +>cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=Open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release ``` -3. Enable Intel ONEDNN and Intel MKL as BLAS library by the command: +3. Enable oneDNN and Intel MKL as BLAS library by the command: ``` >"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\bin\mklvars.bat" intel64 >cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release @@ -158,7 +160,7 @@ msbuild mxnet.sln /p:Configuration=Release;Platform=x64 /maxcpucount **Visual Studio 2017** -User can follow the same steps of Visual Studio 2015 to build MXNET with ONEDNN, but change the version related command, for example,```C:\opencv\build\x64\vc15\bin``` and build command is as below: +User can follow the same steps of Visual Studio 2015 to build MXNET with oneDNN, but change the version related command, for example,```C:\opencv\build\x64\vc15\bin``` and build command is as below: ``` >cmake -G "Visual Studio 15 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release @@ -183,29 +185,24 @@ Expected Output: [[ 2. 2. 2.] [ 2. 2. 2.]] ``` -### Verify whether ONEDNN works +### Verify whether oneDNN works -After MXNet is installed, you can verify if ONEDNN backend works well with a single Convolution layer. +After MXNet is installed, you can verify if oneDNN backend works well with a single Convolution layer. ``` -import mxnet as mx -import numpy as np +from mxnet import np +from mxnet.gluon import nn num_filter = 32 kernel = (3, 3) pad = (1, 1) shape = (32, 32, 256, 256) -x = mx.sym.Variable('x') -w = mx.sym.Variable('w') -y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad) -exe = y.simple_bind(mx.cpu(), x=shape) - -exe.arg_arrays[0][:] = np.random.normal(size=exe.arg_arrays[0].shape) -exe.arg_arrays[1][:] = np.random.normal(size=exe.arg_arrays[1].shape) +conv_layer = nn.Conv2D(channels=num_filter, kernel_size=kernel, padding=pad) +conv_layer.initialize() -exe.forward(is_train=False) -o = exe.outputs[0] -t = o.asnumpy() +data = np.random.normal(size=shape) +o = conv_layer(data) +o.wait_to_read() ``` More detailed debugging and profiling information can be logged by setting the environment variable 'DNNL_VERBOSE': @@ -214,16 +211,17 @@ export DNNL_VERBOSE=1 ``` For example, by running above code snippet, the following debugging logs providing more insights on oneDNN primitives `convolution` and `reorder`. That includes: Memory layout, infer shape and the time cost of primitive execution. ``` -dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d) -dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost -dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:aBcd16b:f0,,,32x32x256x256,7.43701 -dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:ABcd16b16a:f0,,,32x32x3x3,0.202148 -dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_inference,src_f32::blocked:aBcd16b:f0 wei_f32::blocked:ABcd16b16a:f0 bia_undef::undef::f0 dst_f32::blocked:aBcd16b:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,20.7539 -dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:ABcd16b16a:f0,,,32x32x3x3,1.86694 -dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,32x32x256x256,35.9771 +dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49) +dnnl_verbose,info,cpu,runtime:OpenMP +dnnl_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost +dnnl_verbose,info,gpu,runtime:none +dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x32x256x256,8.34912 +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb32a:f0,,,32x32x3x3,0.0229492 +dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb32a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,10.5898 ``` -You can find step-by-step guidance to do profiling for ONEDNN primitives in [Profiling ONEDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-MKLDNN-Operators). +You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-oneDNN-Operators).

Enable MKL BLAS

@@ -233,61 +231,96 @@ Installing the full MKL installation enables MKL support for all operators under 1. Download and install the latest full MKL version following instructions on the [intel website.](https://software.intel.com/en-us/mkl) You can also install MKL through [YUM](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/yum-dnf-zypper.html) or [APT](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/apt.html) Repository. - 2. Run `make -j ${nproc} USE_BLAS=mkl` + 2. Create and navigate to build directory `mkdir build && cd build` - 3. Navigate into the python directory + 3. Run `cmake -DUSE_CUDA=OFF -DUSE_BLAS=mkl ..` - 4. Run `sudo python setup.py install` + 4. Run `make -j` -### Verify whether MKL works - -After MXNet is installed, you can verify if MKL BLAS works well with a single dot layer. + 5. Navigate into the python directory -``` -import mxnet as mx -import numpy as np + 6. Run `sudo python setup.py install` -shape_x = (1, 10, 8) -shape_w = (1, 12, 8) - -x_npy = np.random.normal(0, 1, shape_x) -w_npy = np.random.normal(0, 1, shape_w) +### Verify whether MKL works -x = mx.sym.Variable('x') -w = mx.sym.Variable('w') -y = mx.sym.batch_dot(x, w, transpose_b=True) -exe = y.simple_bind(mx.cpu(), x=x_npy.shape, w=w_npy.shape) +After MXNet is installed, you can verify if MKL BLAS works well with a linear matrix solver. -exe.forward(is_train=False) -o = exe.outputs[0] -t = o.asnumpy() +``` +from mxnet import np +coeff = np.array([[7, 0], [5, 2]]) +y = np.array([14, 18]) +x = np.linalg.solve(coeff, y) +x.wait_to_read() ``` You can open the `MKL_VERBOSE` flag by setting environment variable: ``` export MKL_VERBOSE=1 ``` -Then by running above code snippet, you probably will get the following output message which means `SGEMM` primitive from MKL are called. Layout information and primitive execution performance are also demonstrated in the log message. +Then by running above code snippet, you should get the similar output to message below (`SGESV` primitive from MKL was executed). Layout information and primitive execution performance are also demonstrated in the log message. ``` -Numpy + Intel(R) MKL: THREADING LAYER: (null) -Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime -Numpy + Intel(R) MKL: preloading libiomp5.so runtime -MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz lp64 intel_thread NMICDev:0 -MKL_VERBOSE SGEMM(T,N,12,10,8,0x7f7f927b1378,0x1bc2140,8,0x1ba8040,8,0x7f7f927b1380,0x7f7f7400a280,12) 8.93ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40 WDiv:HOST:+0.000 +mkl-service + Intel(R) MKL: THREADING LAYER: (null) +mkl-service + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime +mkl-service + Intel(R) MKL: preloading libiomp5.so runtime +Intel(R) MKL 2020.0 Update 1 Product build 20200208 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.70GHz lp64 intel_thread +MKL_VERBOSE SGESV(2,1,0x7f74d4002780,2,0x7f74d4002798,0x7f74d4002790,2,0) 77.58us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 ```

Graph optimization

-Graph optimization with subgraph is available and enabled by default on master branch. Limitations of this experimental feature are: +To better utilise oneDNN potential, using graph optimizations is recommended. There are few limitations of this feature: - It works only for inference. - +- Only subclasses of HybridBlock and Symbol can call optimize_for API. - This feature will only run on the CPU, even if you're using a GPU-enabled build of MXNet. +If your use case met above conditions, graph optimizations can be enabled by just simple call `optimize_for` API. Example below: +``` +from mxnet import np +from mxnet.gluon import nn + +data = np.random.normal(size=(32,3,224,224)) + +net = nn.HybridSequential() +net.add(nn.Conv2D(channels=64, kernel_size=(3,3))) +net.add(nn.Activation('relu')) +net.initialize() +print("=" * 5, " Not optimized ", "=" * 5) +o = net(data) +o.wait_to_read() + +net.optimize_for(data, backend='ONEDNN') +print("=" * 5, " Optimized ", "=" * 5) +o = net(data) +o.wait_to_read() + +``` + +Above code snippet should produce following output: +``` +===== Not optimized ===== +[15:05:43] ../src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU +dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49) +dnnl_verbose,info,cpu,runtime:OpenMP +dnnl_verbose,info,cpu,isa:Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions +dnnl_verbose,info,gpu,runtime:none +dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,8.87793 +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00708008 +dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,91.511 +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00610352 +dnnl_verbose,exec,cpu,eltwise,jit:avx512_common,forward_inference,data_f32::blocked:acdb:f0 diff_undef::undef::f0,,alg:eltwise_relu alpha:0 beta:0,32x64x222x222,85.4392 +===== Optimized ===== +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:Acdb64a:f0 dst_f32::blocked:abcd:f0,,,64x3x3x3,0.00610352 +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00585938 +dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,3.98999 +dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,attr-post-ops:eltwise_relu:0:1 ,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,20.46 +``` +After optimization of Convolution + ReLU oneDNN executes both operations within single convolution primitive.

Quantization and Inference with INT8

-Benefiting from oneDNN, MXNet built with oneDNN brings outstanding performance improvement on quantization and inference with INT8 Intel CPU Platform on Intel Xeon Scalable Platform. +MXNet built with oneDNN brings outstanding performance improvement on quantization and inference with INT8 Intel CPU Platform on Intel Xeon Scalable Platform. - [CNN Quantization Examples](https://github.com/apache/incubator-mxnet/tree/master/example/quantization). From 8cc9090d00d34cb4192c8598a609dbcdfb5af086 Mon Sep 17 00:00:00 2001 From: Bartosz Kuncer Date: Tue, 19 Oct 2021 11:12:01 +0200 Subject: [PATCH 4/7] Fix MKL_VERBOSE description --- .../python/tutorials/performance/backend/dnnl/dnnl_readme.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index 6d9e5c86e143..0902e9bd4578 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -253,7 +253,7 @@ x = np.linalg.solve(coeff, y) x.wait_to_read() ``` -You can open the `MKL_VERBOSE` flag by setting environment variable: +You can get the verbose log output from mkl library by setting environment variable: ``` export MKL_VERBOSE=1 ``` From f8ac34a3a28396d940f1de3f452fbbe27b411b66 Mon Sep 17 00:00:00 2001 From: bgawrych Date: Tue, 19 Oct 2021 13:44:07 +0200 Subject: [PATCH 5/7] replace wait_to_read with print --- .../tutorials/performance/backend/dnnl/dnnl_readme.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index 0902e9bd4578..5f11394e914e 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -202,7 +202,7 @@ conv_layer.initialize() data = np.random.normal(size=shape) o = conv_layer(data) -o.wait_to_read() +print(o) ``` More detailed debugging and profiling information can be logged by setting the environment variable 'DNNL_VERBOSE': @@ -250,7 +250,7 @@ from mxnet import np coeff = np.array([[7, 0], [5, 2]]) y = np.array([14, 18]) x = np.linalg.solve(coeff, y) -x.wait_to_read() +print(x) ``` You can get the verbose log output from mkl library by setting environment variable: @@ -287,16 +287,16 @@ net.add(nn.Activation('relu')) net.initialize() print("=" * 5, " Not optimized ", "=" * 5) o = net(data) -o.wait_to_read() +print(o) net.optimize_for(data, backend='ONEDNN') print("=" * 5, " Optimized ", "=" * 5) o = net(data) -o.wait_to_read() +print(o) ``` -Above code snippet should produce following output: +Above code snippet should produce similar output to the following one (printed tensors are omitted) : ``` ===== Not optimized ===== [15:05:43] ../src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU From d4d78d97c26a562f44b86da19abaf030bdf090ce Mon Sep 17 00:00:00 2001 From: Bartosz Kuncer Date: Tue, 26 Oct 2021 10:10:37 +0200 Subject: [PATCH 6/7] Fix dnnl_readme.md --- .../python/tutorials/performance/backend/dnnl/dnnl_readme.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index 5f11394e914e..1cac07c49d1f 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -117,7 +117,7 @@ LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew -

Windows

-On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with Intel oneDNN. +On Windows, you can use [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) and [Microsoft Visual Studio 2017](https://www.visualstudio.com/downloads/) to compile MXNet with oneDNN. [Micrsoft Visual Studio 2015](https://www.visualstudio.com/vs/older-downloads/) is recommended. **Visual Studio 2015** From 7cb10c2b63b9c271999f37debe7dd8314c07bdd2 Mon Sep 17 00:00:00 2001 From: Bartosz Kuncer Date: Tue, 26 Oct 2021 17:25:09 +0200 Subject: [PATCH 7/7] Fix link check --- .../python/tutorials/performance/backend/dnnl/dnnl_readme.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md index 1cac07c49d1f..a75e09293bf1 100644 --- a/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md +++ b/docs/python_docs/python/tutorials/performance/backend/dnnl/dnnl_readme.md @@ -221,7 +221,7 @@ dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::bl dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb32a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,10.5898 ``` -You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-oneDNN-Operators). +You can find step-by-step guidance to do profiling for oneDNN primitives in [Profiling oneDNN Operators](https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html#Profiling-MKLDNN-Operators).

Enable MKL BLAS