diff --git a/docs/how_to/deploy/adreno.rst b/docs/how_to/deploy/adreno.rst index 7f4616fbf797..ed016a3ff744 100644 --- a/docs/how_to/deploy/adreno.rst +++ b/docs/how_to/deploy/adreno.rst @@ -15,41 +15,54 @@ specific language governing permissions and limitations under the License. -Deploy to Adreno GPU -======================================= +Deploy to Adreno™ GPU +===================== -**Authors**: Daniil Barinov, Egor Churaev, Andrey Malyshev +**Authors**: Daniil Barinov, Egor Churaev, Andrey Malyshev, Siva Rama Krishna Introduction ------------ -Adreno is a series of graphics processing unit (GPU) semiconductor +Adreno™ is a series of graphics processing unit (GPU) semiconductor intellectual property cores developed by Qualcomm and used in many of their SoCs. -The Adreno GPU accelerates the rendering of complex geometries to +The Adreno™ GPU accelerates the rendering of complex geometries to deliver high-performance graphics and a rich user experience with low power consumption. -This guide will demonstrate :ref:`the benefits of using textures with Adreno`, -how to :ref:`build TVM with OpenCL` (needed by Adreno devices) and TVM RPC -enabled. It will also provide :ref:`example code` to better understand the differences in compiling and deploying models -for Adreno devices. +TVM supports deep learning acceleration on Adreno™ GPU by native OpenCL backend of TVM and +also through OpenCLML backend. Native OpenCL backend of TVM is enhanced to make it +Adreno™ friendly by incorporating texture memory usage and Adreno™ friendly layouts. +OpenCLML is an SDK release by Qualcomm that provides kernel acceleration library +for most of the deep learning operators. -.. _advantages_of_the_textures: +This guide is organized to demonstrate various design aspects of -Advantages of the Textures --------------------------- +- :ref:`OpenCL Backend Ehnahcements` +- :ref:`About OpenCLML` +- :ref:`Build and Deploy` -One of the Adreno's advantages is the clever handling of textures. At + +.. _opencl_enhancements: + +OpenCL Backend Enhancements +--------------------------- + +OpenCL backend of TVM is enhanced to take advantage of Adreno™ specific features like +- Texture memory usage. +- Adreno™ friendly activation layouts. +- Brand new schedules to accelerate with above features. + +One of the Adreno™'s advantages is the clever handling of textures. At the moment, TVM is able to benefit from this by having texture support -for Adreno. The graph below shows the Adreno A5x architecture. +for Adreno™. The graph below shows the Adreno™ A5x architecture. -|High-level overview of the Adreno A5x architecture for OpenCL| +|High-level overview of the Adreno™ A5x architecture for OpenCL| -*Fig. 1 High-level overview of the Adreno A5x architecture for OpenCL* +*Fig. 1 High-level overview of the Adreno™ A5x architecture for OpenCL* -*source:* `OpenCL Optimization and Best Practices for Qualcomm Adreno GPUs `_ +*source:* `OpenCL Optimization and Best Practices for Qualcomm Adreno™ GPUs `_ Reasons of using textures: @@ -65,142 +78,479 @@ Reasons of using textures: Overall, with textures, it is possible to achieve a significant performance boost compared to OpenCL buffer based solutions. -.. _building_tvm_for_adreno: +In general we specify target as ``target="opencl"`` for a regular OpenCL based target which generates the kernels as shown below. -Building TVM for Adreno ------------------------ +.. code:: c + + __kernel void tvmgen_default_fused_nn_conv2d_kernel0(__global float* restrict p0, __global double* restrict p1, __global float* restrict conv2d_nhwc) { + // body.. + +Above OpenCL kernel definition has ``__global float*`` poniters which are essestially OpenCL ``buffer`` objects. + +When enabled texture based enhancements by modifying target definition as ``target="opencl -device=adreno"`` we can see the generated +kernels using texture backed OpenCL image objects as shown below. + +.. code:: c + + __kernel void tvmgen_default_fused_nn_conv2d_kernel0(__write_only image2d_t pad_temp_global_texture, __read_only image2d_t p0) { + // body.. + +*image2d_t* is a built-in OpenCL types that represents two-dimensional image object and provides several additional functions. +When we use *image2d_t* we read *4 elements at one time*, and it helps to utilize hardware in a more efficient way. + +Please refer to :ref:`Advanced Usage` for more details about generation and inspection of kernel sources. + + +.. _about_openclml: + +About OpenCLML +-------------- + +OpenCLML is a SDK released by Qualcomm that provides accelerated deep learning operators. +These operators are exposed as an extension ``cl_qcom_ml_ops`` to standard OpenCL specification. +Please refer `Accelerate your models with our OpenCL ML SDK `_ for more details. + +OpenCLML is integrated into TVM as a `BYOC `_ solution. +OpenCLML operators can use same context and can be enqueued on same command queue as used in native OpenCL. +We took advantage of this to avoid any context switching over heads while fallback to native OpenCL. + + +.. _build_deploy: + +TVM for Adreno™ +--------------- + +This section gives instructions about various ways of building and deploying model +to Adreno™ target. Adreno™ is a remote target which is connected to the host via ADB connection. +Deploying the compiled model here require use some tools on host as well as on target. + +TVM has simplified user friendly command line based tools as well as +developer centric python API interface for various steps like auto tuning, building and deploying. + + +|Adreno deployment pipeline| + +*Fig.2 Build and Deployment pipeline on Adreno devices* + +The figure above demonstrates a generalized pipeline for various stages listed below. + +**Model import:** +At this stage we import a model from well known frameworks like Tensorflow, PyTorch, ONNX ...etc. +This stage converts the given model into TVM's relay module format. Alternatively one can build a relay module manually +by using TVM's operator inventory too. TVM module generated here is a target independent representation of the graph. + +**Auto Tuning:** +At this stage we tune the TVM generated kernels specific to a target. Auto tuning process requires +target device availability and in case of a remote target like Adreno™ on Android device we use RPC Setup for communication. +Later sections in this guide will detail about RPC Setup for Android device. Auto tuning is not a necessary step for +compilation of a model. It is necessary for acheiving best performance out of TVM generated kernels. + +**Compilation:** +At this stage we compile the model for specific target. Given we auto tuned the module in previous stage, +TVM compilation make use of the tuning log for genetrating best performing kernels. TVM compilation process produces artifacts +containing kernel shared lib, graph definition in json format and parameters binary file in TVM specific format. + +**Deploy (or test run) on Target:** +At this stage we run the TVM compilation output on the target. Deployment is possible from python +environment using RPC Setup and also using TVM's native tool which is native binary cross compiled for Android. +At this stage we can run the compiled model on Android target and unit test output correctness and performance aspects. + +**Application Integration:** +This stage is all about integrating TVM compiled model in applications. Here we discuss about +interfacing tvm runtime from Android (cpp native environment or from JNI) for setting input and getting output. + +**Advanced Usage:** +This section advanced user interests like viewing generated source code, altering precision of the module ...etc. + + +This tutorial covers all the above aspects as part of below sections. + +- :ref:`Development environment` +- :ref:`RPC Setup` +- :ref:`Commandline tools` +- :ref:`Python interface` +- :ref:`Application Integration` +- :ref:`Advanced Usage` + +.. _development_environment: + + +Development Environment Setup : Automatic +----------------------------------------- +TVM ships a predefined docker container environment with all prerequisites to get started quickly. +You may also refer to :ref:`Manual Environment Setup` for more control on the dependencies. + +For docker setup the pre requisite is just docker tool availabilty on host. -This section gives instructions on how to build the Android part of TVM -with OpenCL and TVM RPC Server in order to deploy models on Adreno. +Below commands can build a docker image for adreno. -Since the process of building TVM for Adreno is exactly the same as the -process of building TVM for Android, please refer to these instructions: -`TVM RPC -Server `_. +:: + + ./docker/build.sh ci_adreno + docker tag tvm.ci_adreno ci_adreno + + +Now we can build both host and target utils with below command. + +:: + + ./tests/scripts/ci.py adreno -i + +To build TVM with OpenCLML SDK we need export the OpenCLML SDK as shown below while building + +:: + + export ADRENO_OPENCL= + ./tests/scripts/ci.py adreno -i + +On successful compilation this leaves us into a docker shell. The build leaves two folders -Since there are many required packages for Android, you can use the official Docker Image to build TVM. -For more information refer to this guide: `Deploy the Pretrained Model on Android `_. +* build-adreno: The host side TVM compiler build. +* build-adreno-target : Contains the android target components + + * libtvm_runtime.so : TVM runtime library + * tvm_rpc : The rpc runtime environment tool + * rtvm : A native stand alone tool + +While using docker environment the android device is shared with host. Hence, it is required +to have adb version ``1.0.41`` on the host as the docker used the same version. + +We can check adb devices availability inside docker environment too. + +:: + + user@ci-adreno-fpeqs:~$ adb devices + List of devices attached + aaaabbbb device + ccccdddd device + +.. _manual_setup: + +Development Environment Setup : Manual +-------------------------------------- + +Manual build process require building of host and target components. + +Below command will configure the build the host compiler + +:: -**Prerequisites**: Android NDK and Android Debug Bridge must -be installed, the desired device must have OpenCL support and Android part of TVM must be built: + mkdir -p build + cd build + cp ../cmake/config.cmake . + + # Enable RPC capability to communicate to remote device. + echo set\(USE_RPC ON\) >> config.cmake + # We use graph executor for any host(x86) side verification of the model. + echo set\(USE_GRAPH_EXECUTOR ON\) >> config.cmake + # Enable backtrace if possible for more ebug information on any crash. + echo set\(USE_LIBBACKTRACE AUTO\) >> config.cmake + # The target_host will be llvm. + echo set\(USE_LLVM ON\) >> config.cmake + +Additionally we can push below config entry to compile with OpenCLML support. + +:: + + export ADRENO_OPENCL= + echo set\(USE_CLML ${ADRENO_OPENCL}\) >> config.cmake + +now we can build as shown below + +:: + + cmake .. + make + +Finally we can export python path as + +:: + + export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH} + python3 -c "import tvm" # Verify tvm python package + + +Now, we can configure and build the target components with below configuration +Target build require Android NDK to be installed. - Read documentation about *Android NDK installation* here: https://developer.android.com/ndk - To get access to adb tools you can see *Android Debug Bridge installation* here: https://developer.android.com/studio/command-line/adb -You can also build the android part of TVM locally. From the root -folder of TVM: :: - mkdir build_android - cd build_android - cmake .. -DUSE_OPENCL=ON -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_HOME}/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_NATIVE_API_LEVEL=android-28 -DCMAKE_FIND_ROOT_PATH_MODE_PACKAGE=ON -DANDROID_STL=c++_static -DUSE_CPP_RPC=ON - make -jN tvm_runtime tvm_rpc + mkdir -p build-adreno + cd build-adreno + cp ../cmake/config.cmake . + # Enable OpenCL backend. + echo set\(USE_OPENCL ON\) >> config.cmake + # Enable RPC functionality. + echo set\(USE_RPC ON\) >> config.cmake + # Build tvm_rpc tool that runs on target device. + echo set\(USE_CPP_RPC ON\) >> config.cmake + # Build native rtvm deploy tool. + echo set\(USE_CPP_RTVM ON\) >> config.cmake + # We use graph executor for deploying on devices like Android. + echo set\(USE_GRAPH_EXECUTOR ON\) >> config.cmake + # Backtrace enablement if possible. + echo set\(USE_LIBBACKTRACE AUTO\) >> config.cmake + # Adreno supports 32bit alignment for OpenCL allocations rather 64bit. + echo set\(USE_KALLOC_ALIGNMENT 32\) >> config.cmake + + # Android build related defines. + echo set\(ANDROID_ABI arm64-v8a\) >> config.cmake + echo set\(ANDROID_PLATFORM android-28\) >> config.cmake + echo set\(MACHINE_NAME aarch64-linux-gnu\) >> config.cmake + +Additionally we can push below config to compile with OpenCLML support. -where **N** is the number of cores available on your *CPU*. +:: -At this stage you have built TVM for Adreno. + export ADRENO_OPENCL= + echo set\(USE_CLML "${ADRENO_OPENCL}"\) >> config.cmake + echo set\(USE_CLML_GRAPH_EXECUTOR "${ADRENO_OPENCL}"\) >> config.cmake -.. _build_and_deploy_model_for_adreno: +For Android target build ``ANDROID_NDK_HOME`` is a dependency and we should have the same in the enviromnet variable. +Below commands will build Adreno™ target components -Build and deploy model for Adreno ---------------------------------- +:: -In this section we will focus on target, needed to compile and deploy models for Adreno, demonstrate -the differences in generated kernels with and without textures and, in addition, the -possibility of choosing a different precision for model compilation will -be considered. + cmake -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_HOME}/build/cmake/android.toolchain.cmake" \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=android-28 \ + -DCMAKE_SYSTEM_VERSION=1 \ + -DCMAKE_FIND_ROOT_PATH="${ADRENO_OPENCL}" \ + -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \ + -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \ + -DCMAKE_CXX_COMPILER="${ANDROID_NDK_HOME}/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang++" \ + -DCMAKE_C_COMPILER="${ANDROID_NDK_HOME}/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang" \ + -DMACHINE_NAME="aarch64-linux-gnu" .. -For the complete step-py-step process of compiling and deploying models on -Adreno, including selection of precision, running the inference of the -model, getting the predictions, and measuring the performance please refer to this tutorial: `How To Deploy model on Adreno `_ + make tvm_runtime tvm_rpc rtvm -|Android deployment pipeline| -*Fig.2 Deployment pipeline on Adreno devices* +.. _rpc_setup: -The figure above demonstrates a generalized pipeline for deploying and running neural network models on android devices. -As can be seen from the figure, the compiled model has a set_input() and a run() methods, -which *prepare the inputs* for inference and *execute the inference* on the remote device using the Graph Executor runtime module. +RPC Setup +--------- -Adreno target -~~~~~~~~~~~~~ +RPC Setup allows remote target access over TCP/IP networking interface. RPC Setup is essential for auto tuning stage as tuning +involves running of auto generated kernels on real device and optimize the same by using machine learning approach. Please refer +`Auto-Tune with Templates and AutoTVM `_ got more details about AutoTVM. -Normally, when compiling models for Android using OpenCL, the -corresponding target is used +RPC Setup is also useful to deply the compiled model to a remote device from python interface or ``tvmc`` tool from host device. -.. code:: python +RPC Setup has multiple components as listed below. - target="opencl" +**TVM Tracker:** +TVM tracker is a host side daemon that manages remote devices and serve them to host side applications. Applications +can connect to this tracker and acquire a remote device handle to communicate. -Using Adreno, we want to get all the benefits of textures, so we have to -use the following target to generate texture leveraging kernels +**TVM RPC:** +TVM RPC is a native application that runs on the remote device (Android in our case) and registers itself to the TVM Tracker +running on the host. -.. code:: python - target="opencl -device=adreno" +Hence, for RPC based setup we will have above components running on host and target device. Below sections explain how to setup the same +manually and also inside docker using automated tools. -Let's write a simple model with one convolutional (conv2d) layer and take a look at generated kernels for these -two targets +**Automated RPC Setup:** +Here we will explain how to setup RPC in docker environment. -.. code:: python +Below command launches tracker in docker environment, where tracker listens on port 9190. - import tvm - from tvm import relay - import numpy as np +:: - input_shape=(1, 56, 56, 32) - filter_shape=(3, 3, 32, 64) - filter = np.random.rand(*filter_shape) + ./tests/scripts/ci.py adreno -i # Launch a new shell on the anreno docker + source tests/scripts/setup-adreno-env.sh -e tracker -p 9190 - dtype="float32" - input = tvm.relay.var("input", shape=input_shape, dtype=dtype) - weight = tvm.relay.var("weight", shape=filter_shape, dtype=dtype) - D = relay.nn.conv2d(input, weight, padding=(1, 1), data_layout="NHWC", kernel_layout="HWIO", out_dtype=dtype) +Now, the below comand can run TVM RPC on remote android device with id ``abcdefgh``. - mod = relay.Function([input, weight], D) - params = { - "weight": tvm.nd.array(filter) - } -Now compile our model with the classic OpenCL target and print its modules: +:: -.. code:: python + ./tests/scripts/ci.py adreno -i # Launch a new shell on adreno docker. + source tests/scripts/setup-adreno-env.sh -e device -p 9190 -d abcdefgh - target="opencl" +Further, below command can be used to query the RPC setup details on any other docker terminals. - with tvm.transform.PassContext(opt_level=3): - graph, lib, params = relay.build_module.build(mod, target, params=params) - print(lib.imported_modules[0].get_source()) +:: -Notice that the generated convolution kernel has pointers in -the initialization of the function. The kernels generated with the above target are buffer-based. + ./tests/scripts/ci.py adreno -i # Launch a new shell on adreno docker. + source tests/scripts/setup-adreno-env.sh -e query -p 9190 -.. code:: c - __kernel void tvmgen_default_fused_nn_conv2d_kernel0(__global float* restrict p0, __global double* restrict p1, __global float* restrict conv2d_nhwc) { - // body.. +**Manual RPC Setup:** + +Please refer to the tutorial +`How To Deploy model on Adreno `_ +for manual RPC environment setup. + +This concludes RPC Setup and we have rpc-tracker available on host ``127.0.0.1`` (rpc-tracker) and port ``9190`` (rpc-port). + + +.. _commandline_interface: + +Commandline Tools +----------------- + +Here we describe entire compilation process using command line tools. TVM has command line utility +`tvmc `_ to perform +model import, auto tuning, compilation and deply over rpc. +`tvmc `_ has many options to explore and try. + +**Model Import & Tuning:** +Use the below command to import a model from any framework and auto tune the same. +Here we use a model from Keras and it uses RPC setup for tuning and finally generates tuning log file +``keras-resnet50.log``. + +:: + + python3 -m tvm.driver.tvmc tune --target="opencl -device=adreno" \ + --target-host="llvm -mtriple=aarch64-linux-gnu" \ + resnet50.h5 -o \ + keras-resnet50.log \ + --early-stopping 0 --repeat 30 --rpc-key android \ + --rpc-tracker 127.0.0.1:9190 --trials 1024 \ + --tuning-records keras-resnet50-records.log --tuner xgb + +**Model Compilation:** + +Use below command for compiling the model and produce TVM compiler outputs. + +:: + + python3 -m tvm.driver.tvmc compile \ + --cross-compiler ${ANDROID_NDK_HOME}/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang \ + --target="opencl, llvm" --target-llvm-mtriple aarch64-linux-gnu --target-opencl-device adreno \ + --tuning-records keras-resnet50.log -o keras-resnet50.tar resnet50.h5 + +While enabled OpenCLML offloading we need to add target ``clml`` as shown below. Tuning log is valid for OpenCLML offloading also +as the OpenCL path is fallback option for any operator didn't go through OpenCLML path. The tuning log will be used for such operators. + +:: + + python3 -m tvm.driver.tvmc compile \ + --cross-compiler ${ANDROID_NDK_HOME}/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang \ + --target="opencl, clml, llvm" --target-llvm-mtriple aarch64-linux-gnu --target-opencl-device adreno \ + --tuning-records keras-resnet50.log -o keras-resnet50.tar resnet50.h5 + +On successful compilation, above command produce ``keras-resnet50.tar``. +It is a compressed archive with kernel shared lib(mod.so), graph json(mod.json) and params binary(mod.params). + +**Deploy & Run on Target:** + +Running the compiled model on Android target is possible in RPC way as well as native deployment. + +We can use below tvmc command to deploy on remore target via RPC based setup. + +:: + + python3 -m tvm.driver.tvmc run --device="cl" keras-resnet50.tar \ + --rpc-key android --rpc-tracker 127.0.0.1:9190 --print-time + +`tvmc `_ based run has more options +to initialize the input in various modes like fill, random ..etc. + +``tvmc`` based deployment generally a quick verification of compiled model on target from remote host via RPC setup. + +Production generally uses native deploymenmt environment like Android JNI or CPP native environments. +Here we need to use cross compiled ``tvm_runtime`` interface to deploy the tvm compilation output, i.e. ``TVMPackage``. + +TVM has a standalone tool ``rtvm`` to deploy and run the model natively on ADB shell. The build process produces this tool under build-adreno-target. +Please refer to `rtvm `_ for more details about this tool. + +While integrating inside existing Android application TVM has multiple options. For JNI or CPP native we may use `C Runtime API `_ +You may refer to ``rtvm``'s simplified interface `TVMRunner `_ also. + +.. _python_interface: + +Python Interface +---------------- + +This section explains importing, auto tuning, compiling and running a model using python interface.\ +TVM has a high level interface through ``tvmc`` abstraction as well as low level relay api. We will discuss about both of these in details. + +**TVMC Interface:** + +While using ``tvmc`` python interface we first load a model that produces ``TVMCModel``. ``TVMCModel`` will be used for Auto Tuning to produce tuning cache. +Compilation process uses ``TVMCModel`` and tuning cache (optional) to produce ``TVMCPackage``. Now, ``TVMCPackage`` will be saved to file system or +can be used to deploy and run on target device. + +Please refer to the tutorial for the same +`How To Deploy model on Adreno using TVMC `_ + +Saved ``TVMCPackage`` can be used for native deployment using ``rtvm`` utility too. + +Also, please refer to `tvmc `_ +documentation for more details about the api interface. + +**Relay Interface:** + +Relay api interface gives lower level api access to the tvm compiler interface. +Similar to ``tvmc`` interface relay api interface provides various frontend API to convert models to a relay ``Module``. +Relay ``Module`` will be used for all kinds transforms like precision conversions, CLML offloading and other custom transforms if any. +The resulting Module will be used for Auto Tuning too. Finally, we use ``relay.build`` API to generate library module. +From this library module, we can export compilation artifacts like module shared library (mod.so), params(mod.params) and json graph(mod.json). +This library module will be used to create graph runtime to deploy and run on target device. + +Please refer to the tutorial `How To Deploy model on Adreno `_ +for a step by step explanation of the same. + +Additionally, TVM also supports Java interface through `TVM4J `_ +.. _application_integration: -Now take a look at “opencl -device=adreno” target: +Application Integration +----------------------- + +TVM compilation output is represented as module shared lib (mod.so), graph json(mod.json) and params (mod.params). +Archived representation of TVMPackage is also contains the same. + +In general a CPP/C based interface will be sufficient for any Android application integration. + +TVM natively expose ``c_runtime_api`` for loading a TVM compiled module and run the same. + +Alternatively one may refer to `cpp_rtvm `_ +``TVMRunner`` interface too for further simplified version of the same. + + + +.. _advanced_usage: + +Advanced Usage +-------------- + +This section details some of the advanced usage and additional information while using Adreno™ target on TVM. + +Generated Source Inspection +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Apart from standard tvm compilation artifacts kernel library (mod.so), graph (mod.json) and params (mod.params) +we can also generate opencl kernel source, clml offloaded graph ...etc from lib handle as shown below. +TVM compilation output is organized as a TVM module and many other TVM modules imported into it. + +Below snippet can dump CLML sub graphs in json format. .. code:: python - target="opencl -device=adreno" + # Look for "clml" typed module imported. + clml_modules = list(filter(lambda mod: mod.type_key == "clml", lib.get_lib().imported_modules)) + # Loop through all clml sub graphs and dump the json formatted CLML sub graphs. + for cmod in clml_modules: + print("CLML Src:", cmod.get_source()) - with tvm.transform.PassContext(opt_level=3): - graph, lib, params = relay.build_module.build(mod, target, params=params) - print(lib.imported_modules[0].get_source()) -The kernels generated this way is actually working with 2d arrays, leveraging textures +Similarly, below snippet can extract opencl kernel source from the compiled TVM module. -.. code:: c +.. code:: python - __kernel void tvmgen_default_fused_nn_conv2d_kernel0(__write_only image2d_t pad_temp_global_texture, __read_only image2d_t p0) { - // body.. + # Similarly we can dump open kernel source too as shown below + # Look for "opencl" typed module imported. + opencl_modules = list(filter(lambda mod: mod.type_key == "opencl", lib.get_lib().imported_modules)) + # Now dump kernel source for each OpenCL targetted sub graph. + for omod in opencl_modules: + print("OpenCL Src:", omod.get_source()) -*image2d_t* is a built-in OpenCL types that represents two-dimensional image object and provides several additional functions. -When we use *image2d_t* we read *4 elements at one time*, and it helps to utilize hardware in a more efficient way. Precisions ~~~~~~~~~~ @@ -214,30 +564,26 @@ We can choose from *float16*, *float16_acc32* (Mixed Precision), *float32* (stan To leverage the GPU hardware capabilities and utilize the benefits of half precision computation and memory management, we can convert an original model having floating points operation to a model operating with half precision. Choosing lower precision will positively affect the performance of the model, but it may also have a decrease in the accuracy of the model. -To do the conversion you need to write a simple conversion function and specify the *dtype* value of "float16" before calling the function: + +To do the conversion you need to call adreno specific transformation API as soon as relay module is generated through any frontend. .. code:: python - def convert_to_dtype(mod, dtype): - # downcast to float16 - if dtype == "float16": - global conv2d_acc = "float16" - from tvm.ir import IRModule - mod = IRModule.from_expr(mod) - seq = tvm.transform.Sequential( - [ - relay.transform.InferType(), - relay.transform.ToMixedPrecision() - ] - ) - with tvm.transform.PassContext(opt_level=3): - mod = seq(mod) - return mod - - dtype="float16" - mod = convert_to_dtype(mod["main"], dtype) - -We then can compile our model in any convinient way + from tvm.driver.tvmc.transform import apply_graph_transforms + mod = apply_graph_transforms( + mod, + { + "mixed_precision": True, + "mixed_precision_ops": ["nn.conv2d", "nn.dense"], + "mixed_precision_calculation_type": "float16", + "mixed_precision_acc_type": "float16", + }, + ) + + +``tvm.driver.tvmc.transform.apply_graph_transforms`` is simplified API over ``ToMixedPrecision`` pass to get desired precision. + +We can then compile our model in any convinient way .. code:: python @@ -246,84 +592,59 @@ We then can compile our model in any convinient way mod, target_host=target_host, target=target, params=params ) +While using ``tvmc`` python interface, the below arguments enables precision conversion to float16. + +.. code:: python + + mixed_precision = True, + mixed_precision_ops = ["nn.conv2d", "nn.dense"], + mixed_precision_calculation_type = "float16", + mixed_precision_acc_type = "float16" + +Similarly, ``tvmc`` command line interface option bas below listed options. + +.. code:: bash + + --mixed-precision + --mixed-precision-ops nn.conv2d nn.dense + --mixed-precision-calculation-type float16 + --mixed-precision-acc-type float16 + + **float16_acc32 (Mixed Precision)** -ToMixedPrecision pass traverse over the network and split network to clusters of ops dealing with float or float16 data types. +``ToMixedPrecision`` pass traverse over the network and split network to clusters of ops dealing with float or float16 data types. The clusters are defined by three types of operations: - Operations always be converted into float16 data type -- Operations which can be converted if they follow by converted cluster +- Operations which can be converted if they followed by converted cluster - Operations never be converted to the float16 data type This list is defined in the ToMixedPrecision implementation here `relay/transform/mixed_precision.py `_ -and can be overridden by user +and can be overridden by user. -In some cases, we want higher precision in accumulation than the input data. -This is supported, for example, for conv2d and dense operations. To override accumulation type you need to register -function with ``@register_mixed_precision_conversion`` decorator to modify parameters of ``ToMixedPrecision`` conversion - -.. code:: python +The ``ToMixedPrecision`` method is a pass to convert an FP32 relay graph into an FP16 version (with +FP16 or FP32 accumulation dtypes). Doing this transformation is useful for reducing model size +as it halves the expected size of the weights (FP16_acc16 case). - from tvm.relay.op import register_mixed_precision_conversion - - conv2d_acc = "float32" - - # Pick a priority > 10 to overwrite defaults, higher priorities take precedence - @register_mixed_precision_conversion("nn.conv2d", level=11) - def conv2d_mixed_precision_rule(call_node: "relay.Call", mixed_precision_type: str): - global conv2d_acc - return [ - # always do main calculation in mixed_precision_type - relay.transform.mixed_precision.MIXED_PRECISION_ALWAYS, - # the dtype for the accumulator - conv2d_acc, - # the output dtype for the operation (usually fp16) - mixed_precision_type, - ] - - # Same for dense - @register_mixed_precision_conversion("nn.dense", level=11) - def conv2d_mixed_precision_rule(call_node: "relay.Call", mixed_precision_type: str): - global conv2d_acc - return [ - relay.transform.mixed_precision.MIXED_PRECISION_ALWAYS, - conv2d_acc, - mixed_precision_type, - ] - -Now we need to modify the conversion function by adding some logical "forks" and ToMixedPrecision() call, -then create a Relay graph from desired model in any convinient way and obtain **mod** (which is IR representation of the model), -after which we can convert it to the required **dtype** and then assemble our model sequentialy +``ToMixedPrecision`` pass usage is simplified into a simple call as shown below for usage. .. code:: python - def convert_to_dtype(mod, dtype): - # downcast to float16 - if dtype == "float16" or dtype == "float16_acc32": - global conv2d_acc - conv2d_acc = "float16" if dtype == "float16" else "float32" - from tvm.ir import IRModule - mod = IRModule.from_expr(mod) - seq = tvm.transform.Sequential( - [ - relay.transform.InferType(), - relay.transform.ToMixedPrecision() - ] - ) - with tvm.transform.PassContext( - config={"relay.ToMixedPrecision.keep_orig_output_dtype": True}, - opt_level=3): - mod = seq(mod) - return mod - - dtype="float16_acc32" - mod = convert_to_dtype(mod["main"], dtype) - dtype = "float32" if dtype == "float32" else "float16" + from tvm.driver.tvmc.transform import apply_graph_transforms + mod = apply_graph_transforms( + mod, + { + "mixed_precision": True, + "mixed_precision_ops": ["nn.conv2d", "nn.dense"], + "mixed_precision_calculation_type": "float16", + "mixed_precision_acc_type": "float32", + }, + ) -The ``ToMixedPrecision`` method is a pass to convert an FP32 relay graph into an FP16 version (with -FP16 or FP32 accumulation dtypes). Doing this transformation is useful for reducing model size -as it halves the expected size of the weights (FP16_acc16 case). -From this point onwards, we can compile our model as normal +``tvm.driver.tvmc.transform.apply_graph_transforms`` is simplified API over ``ToMixedPrecision`` pass to get desired precision. + +We can then compile our model in any convinient way .. code:: python @@ -332,5 +653,24 @@ From this point onwards, we can compile our model as normal mod, target_host=target_host, target=target, params=params ) -.. |High-level overview of the Adreno A5x architecture for OpenCL| image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/how-to/adreno_architecture.png -.. |Android deployment pipeline| image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/how-to/android_deployment_pipeline.jpg +While using ``tvmc`` python interface, the below arguments enables precision conversion to float16. + +.. code:: python + + mixed_precision = True, + mixed_precision_ops = ["nn.conv2d", "nn.dense"], + mixed_precision_calculation_type = "float16", + mixed_precision_acc_type = "float32" + +Similarly, ``tvmc`` command line interface option bas below listed options. + +.. code:: bash + + --mixed-precision + --mixed-precision-ops nn.conv2d nn.dense + --mixed-precision-calculation-type float16 + --mixed-precision-acc-type float32 + + +.. |High-level overview of the Adreno™ A5x architecture for OpenCL| image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/how-to/adreno_architecture.png +.. |Adreno deployment pipeline| image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/how-to/Adreno-Deployment-Pipeline.jpg diff --git a/gallery/how_to/deploy_models/deploy_model_on_adreno.py b/gallery/how_to/deploy_models/deploy_model_on_adreno.py index c120c5339b62..c2ba189a6715 100644 --- a/gallery/how_to/deploy_models/deploy_model_on_adreno.py +++ b/gallery/how_to/deploy_models/deploy_model_on_adreno.py @@ -18,9 +18,9 @@ """ .. _tutorial-deploy-model-on-adreno: -Deploy the Pretrained Model on Adreno -======================================= -**Author**: Daniil Barinov +Deploy the Pretrained Model on Adreno™ +====================================== +**Author**: Daniil Barinov, Siva Rama Krishna This article is a step-by-step tutorial to deploy pretrained Pytorch ResNet-18 model on Adreno (on different precisions). @@ -53,11 +53,17 @@ # # adb devices # +# Set the android device to use, if you have several devices connected to your computer. +# +# .. code-block:: bash +# +# export ANDROID_SERIAL= +# # Then to upload these two files to the device you should use: # # .. code-block:: bash # -# adb -s push {libtvm_runtime.so,tvm_rpc} /data/local/tmp +# adb push {libtvm_runtime.so,tvm_rpc} /data/local/tmp # # At this moment you will have «libtvm_runtime.so» and «tvm_rpc» on path /data/local/tmp on your device. # Sometimes cmake can’t find «libc++_shared.so». Use: @@ -70,7 +76,7 @@ # # .. code-block:: bash # -# adb -s push libc++_shared.so /data/local/tmp +# adb push libc++_shared.so /data/local/tmp # # We are now ready to run the TVM RPC Server. # Launch rpc_tracker with following line in 1st console: @@ -83,12 +89,12 @@ # # .. code-block:: bash # -# adb -s reverse tcp:9190 tcp:9190 -# adb -s forward tcp:9090 tcp:9090 -# adb -s forward tcp:9091 tcp:9091 -# adb -s forward tcp:9092 tcp:9092 -# adb -s forward tcp:9093 tcp:9093 -# adb -s shell LD_LIBRARY_PATH=/data/local/tmp /data/local/tmp/tvm_rpc server --host=0.0.0.0 --port=9090 --tracker=127.0.0.1:9190 --key=android --port-end=9190 +# adb reverse tcp:9190 tcp:9190 +# adb forward tcp:5000 tcp:5000 +# adb forward tcp:5002 tcp:5001 +# adb forward tcp:5003 tcp:5002 +# adb forward tcp:5004 tcp:5003 +# adb shell LD_LIBRARY_PATH=/data/local/tmp /data/local/tmp/tvm_rpc server --host=0.0.0.0 --port=5000 --tracker=127.0.0.1:9190 --key=android --port-end=5100 # # Before proceeding to compile and infer model, specify TVM_TRACKER_HOST and TVM_TRACKER_PORT # @@ -115,6 +121,73 @@ # android 1 1 0 # ---------------------------------- +################################################################# +# Configuration +# ------------- + +import os +import torch +import torchvision +import tvm +from tvm import te +from tvm import relay, rpc +from tvm.contrib import utils, ndk +from tvm.contrib import graph_executor +from tvm.relay.op.contrib import clml +from tvm import autotvm + +# Below are set of configuration that controls the behaviour of this script like +# local run or device run, target definitions, dtype setting and auto tuning enablement. +# Change these settings as needed if required. + +# Adreno devices are efficient with float16 compared to float32 +# Given the expected output doesn't effect by lowering precision +# it's advisable to use lower precision. +# We have a helper API to make the precision conversion simple and +# it supports dtype with "float16" and "float16_acc32" modes. +# Let's choose "float16" for calculation and "float32" for accumulation. + +calculation_dtype = "float16" +acc_dtype = "float32" + +# Specify Adreno target before compiling to generate texture +# leveraging kernels and get all the benefits of textures +# Note: This generated example running on our x86 server for demonstration. +# If running it on the Android device, we need to +# specify its instruction set. Set :code:`local_demo` to False if you want +# to run this tutorial with a real device over rpc. +local_demo = True + +# by default on CPU target will execute. +# select 'cpu', 'opencl' and 'opencl -device=adreno' +test_target = "cpu" + +# Change target configuration. +# Run `adb shell cat /proc/cpuinfo` to find the arch. +arch = "arm64" +target = tvm.target.Target("llvm -mtriple=%s-linux-android" % arch) + +# Auto tuning is compute intensive and time taking task, +# hence disabling for default run. Please enable it if required. +is_tuning = False +tune_log = "adreno-resnet18.log" + +# To enable OpenCLML accelerated operator library. +enable_clml = False + +################################################################# +# Get a PyTorch Model +# ------------------- +# Get resnet18 from torchvision models +model_name = "resnet18" +model = getattr(torchvision.models, model_name)(pretrained=True) +model = model.eval() + +# We grab the TorchScripted model via tracing +input_shape = [1, 3, 224, 224] +input_data = torch.randn(input_shape) +scripted_model = torch.jit.trace(model, input_data).eval() + ################################################################# # Load a test image # ----------------- @@ -146,86 +219,43 @@ img = np.expand_dims(img, 0) ################################################################# -# Load pretrained Pytorch model -# ----------------------------- -# Create a Relay graph from a Pytorch ResNet-18 model -import os -import torch -import torchvision -import tvm -from tvm import te -from tvm import relay, rpc -from tvm.contrib import utils, ndk -from tvm.contrib import graph_executor - -model_name = "resnet18" -model = getattr(torchvision.models, model_name)(pretrained=True) -model = model.eval() - -# We grab the TorchScripted model via tracing -input_shape = [1, 3, 224, 224] -input_data = torch.randn(input_shape) -scripted_model = torch.jit.trace(model, input_data).eval() - +# Convert PyTorch model to Relay module +# ------------------------------------- +# TVM has frontend api for various frameworks under relay.frontend and now +# for pytorch model import we have relay.frontend.from_pytorch api. # Input name can be arbitrary input_name = "input0" shape_list = [(input_name, img.shape)] + mod, params = relay.frontend.from_pytorch(scripted_model, shape_list) ################################################################# # Precisions # ---------- -# Since TVM support Mixed Precision, we need to register mixed_precision_conversion: -from tvm.relay.op import register_mixed_precision_conversion - -conv2d_acc = "float32" - - -@register_mixed_precision_conversion("nn.conv2d", level=11) -def conv2d_mixed_precision_rule(call_node: "relay.Call", mixed_precision_type: str): - global conv2d_acc - return [ - relay.transform.mixed_precision.MIXED_PRECISION_ALWAYS, - conv2d_acc, - mixed_precision_type, - ] - -@register_mixed_precision_conversion("nn.dense", level=11) -def conv2d_mixed_precision_rule(call_node: "relay.Call", mixed_precision_type: str): - global conv2d_acc - return [ - relay.transform.mixed_precision.MIXED_PRECISION_ALWAYS, - conv2d_acc, - mixed_precision_type, - ] +# Adreno devices are efficient with float16 compared to float32 +# Given the expected output doesn't effect by lowering precision +# it's advisable to use lower precision. +# TVM support Mixed Precision through ToMixedPrecision transformation pass. +# We may need to register precision rules like precision type, accumultation +# datatype ...etc. for the required operators to override the default settings. +# The below helper api simplifies the precision conversions across the module. -################################################################# -# and also define the conversion function itself -def convert_to_dtype(mod, dtype): - # downcast to float16 - if dtype == "float16" or dtype == "float16_acc32": - global conv2d_acc - conv2d_acc = "float16" if dtype == "float16" else "float32" - from tvm.ir import IRModule - - mod = IRModule.from_expr(mod) - seq = tvm.transform.Sequential( - [relay.transform.InferType(), relay.transform.ToMixedPrecision()] - ) - with tvm.transform.PassContext(opt_level=3): - mod = seq(mod) - return mod +# Calculation dtype is set to "float16" and accumulation dtype is set to "float32" +# in configuration section above. +from tvm.driver.tvmc.transform import apply_graph_transforms -################################################################# -# Let's choose "float16_acc32" for example. -dtype = "float16_acc32" -mod = convert_to_dtype(mod["main"], dtype) -dtype = "float32" if dtype == "float32" else "float16" - -print(mod) +mod = apply_graph_transforms( + mod, + { + "mixed_precision": True, + "mixed_precision_ops": ["nn.conv2d", "nn.dense"], + "mixed_precision_calculation_type": calculation_dtype, + "mixed_precision_acc_type": acc_dtype, + }, +) ################################################################# # As you can see in the IR, the architecture now contains cast operations, which are @@ -233,46 +263,109 @@ def convert_to_dtype(mod, dtype): # You can also use "float16" or "float32" precisions as other dtype options. ################################################################# -# Compile the model with relay -# ---------------------------- -# Specify Adreno target before compiling to generate texture -# leveraging kernels and get all the benefits of textures -# Note: This generated example running on our x86 server for demonstration. -# If running it on the Android device, we need to -# specify its instruction set. Set :code:`local_demo` to False if you want -# to run this tutorial with a real device. +# Prepare TVM Target +# ------------------ -local_demo = True +# This generated example running on our x86 server for demonstration. -# by default on CPU target will execute. -# select 'cpu', 'opencl' and 'vulkan' -test_target = "cpu" - -# Change target configuration. -# Run `adb shell cat /proc/cpuinfo` to find the arch. -arch = "arm64" -target = tvm.target.Target("llvm -mtriple=%s-linux-android" % arch) +# To deply and tun on real target over RPC please set :code:`local_demo` to False in above configuration sestion. +# Also, :code:`test_target` is set to :code:`llvm` as this example to make compatible for x86 demonstration. +# Please change it to :code:`opencl` or :code:`opencl -device=adreno` for RPC target in configuration above. if local_demo: target = tvm.target.Target("llvm") -elif test_target == "opencl": - target = tvm.target.Target("opencl", host=target) -elif test_target == "vulkan": - target = tvm.target.Target("vulkan", host=target) +elif test_target.find("opencl"): + target = tvm.target.Target(test_target, host=target) -with tvm.transform.PassContext(opt_level=3): - lib = relay.build(mod, target=target, params=params) +################################################################## +# AutoTuning +# ---------- +# The below few instructions can auto tune the relay module with xgboost being the tuner algorithm. -################################################################# -# Deploy the Model Remotely by RPC -# -------------------------------- -# Using RPC you can deploy the model from host -# machine to the remote Adreno device +# Auto Tuning process involces stages of extracting the tasks, defining tuning congiguration and +# tuning each task for best performing kernel configuration. +# Get RPC related settings. rpc_tracker_host = os.environ.get("TVM_TRACKER_HOST", "127.0.0.1") rpc_tracker_port = int(os.environ.get("TVM_TRACKER_PORT", 9190)) key = "android" +# Auto tuning is compute intensive and time taking task. +# It is set to False in above configuration as this script runs in x86 for demonstration. +# Please to set :code:`is_tuning` to True to enable auto tuning. + +if is_tuning: + # Auto Tuning Stage 1: Extract tunable tasks + tasks = autotvm.task.extract_from_program( + mod, target=test_target, target_host=target, params=params + ) + + # Auto Tuning Stage 2: Define tuning configuration + tmp_log_file = tune_log + ".tmp" + measure_option = autotvm.measure_option( + builder=autotvm.LocalBuilder( + build_func=ndk.create_shared, timeout=15 + ), # Build the test kernel locally + runner=autotvm.RPCRunner( # The runner would be on a remote device. + key, # RPC Key + host=rpc_tracker_host, # Tracker host + port=int(rpc_tracker_port), # Tracker port + number=3, # Number of runs before averaging + timeout=600, # RPC Timeout + ), + ) + n_trial = 1024 # Number of iteration of training before choosing the best kernel config + early_stopping = False # Can be enabled to stop tuning while the loss is not minimizing. + + # Auto Tuning Stage 3: Iterate through the tasks and tune. + from tvm.autotvm.tuner import XGBTuner + + for i, tsk in enumerate(reversed(tasks[:3])): + print("Task:", tsk) + prefix = "[Task %2d/%2d] " % (i + 1, len(tasks)) + tuner_obj = XGBTuner(tsk, loss_type="rank") + + tsk_trial = min(n_trial, len(tsk.config_space)) + tuner_obj.tune( + n_trial=tsk_trial, + early_stopping=early_stopping, + measure_option=measure_option, + callbacks=[ + autotvm.callback.progress_bar(tsk_trial, prefix=prefix), + autotvm.callback.log_to_file(tmp_log_file), + ], + ) + # Auto Tuning Stage 4: Pick the best performing configurations from the overall log. + autotvm.record.pick_best(tmp_log_file, tune_log) + +################################################################# +# Enable OpenCLML Offloading +# -------------------------- +# OpenCLML offloading will try to accelerate supported operators +# by using OpenCLML proprietory operator library. + +# By default :code:`enable_clml` is set to False in above configuration section. + +if not local_demo and enable_clml: + mod = clml.partition_for_clml(mod, params) + +################################################################# +# Compilation +# ----------- +# Use tuning cache if exists. +if os.path.exists(tune_log): + with autotvm.apply_history_best(tune_log): + with tvm.transform.PassContext(opt_level=3): + lib = relay.build(mod, target=target, params=params) +else: + with tvm.transform.PassContext(opt_level=3): + lib = relay.build(mod, target=target, params=params) + +################################################################# +# Deploy the Model Remotely by RPC +# -------------------------------- +# Using RPC you can deploy the model from host +# machine to the remote Adreno device if local_demo: remote = rpc.LocalSession() else: @@ -282,10 +375,8 @@ def convert_to_dtype(mod, dtype): if local_demo: dev = remote.cpu(0) -elif test_target == "opencl": +elif test_target.find("opencl"): dev = remote.cl(0) -elif test_target == "vulkan": - dev = remote.vulkan(0) else: dev = remote.cpu(0) diff --git a/gallery/how_to/deploy_models/deploy_model_on_adreno_tvmc.py b/gallery/how_to/deploy_models/deploy_model_on_adreno_tvmc.py new file mode 100644 index 000000000000..b54ac1b2c6e7 --- /dev/null +++ b/gallery/how_to/deploy_models/deploy_model_on_adreno_tvmc.py @@ -0,0 +1,198 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +.. _tutorial-deploy-model-on-adreno-tvmc: + +Deploy the Pretrained Model on Adreno™ with tvmc Interface +========================================================== +**Author**: Siva Rama Krishna + +This article is a step-by-step tutorial to deploy pretrained Keras resnet50 model on Adreno™. + +Besides that, you should have TVM built for Android. +See the following instructions on how to build it and setup RPC environment. + +`Deploy to Adreno GPU `_ + +""" + +import os +import tvm +import numpy as np +from tvm import relay +from tvm.driver import tvmc +from tvm.driver.tvmc.model import TVMCPackage +from tvm.contrib import utils + +################################################################# +# Configuration +# ------------- +# Specify Adreno target before compiling to generate texture +# leveraging kernels and get all the benefits of textures +# Note: This generated example running on our x86 server for demonstration. +# If running it on the Android device, we need to +# specify its instruction set. Set :code:`local_demo` to False if you want +# to run this tutorial with a real device over rpc. +local_demo = True + +# by default on CPU target will execute. +# select 'llvm', 'opencl' and 'opencl -device=adreno' +target = "llvm" + +# Change target configuration. +# Run `adb shell cat /proc/cpuinfo` to find the arch. +arch = "arm64" +target_host = "llvm -mtriple=%s-linux-android" % arch + +# Auto tuning is compute and time taking task, hence disabling for default run. Please enable it if required. +is_tuning = False +tune_log = "adreno-resnet50.log" + +# To enable OpenCLML accelerated operator library. +enable_clml = False +cross_compiler = ( + os.getenv("ANDROID_NDK_HOME", "") + + "/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang" +) + +####################################################################### +# Make a Keras Resnet50 Model +# --------------------------- + +from tensorflow.keras.applications.resnet50 import ResNet50 + +tmp_path = utils.tempdir() +model_file_name = tmp_path.relpath("resnet50.h5") + +model = ResNet50(include_top=True, weights="imagenet", input_shape=(224, 224, 3), classes=1000) +model.save(model_file_name) + + +####################################################################### +# Load Model +# ---------- +# Convert a model from any framework to a tvm relay module. +# tvmc.load supports models from any framework (like tensorflow saves_model, onnx, tflite ..etc) and auto detects the filetype. + +tvmc_model = tvmc.load(model_file_name) + +print(tvmc_model.mod) + +# tvmc_model consists of tvmc_mode.mod which is relay module and tvmc_model.params which parms of the module. + +####################################################################### +# AutoTuning +# ---------- +# Now, the below api can be used for autotuning the model for any target. +# Tuning required RPC setup and please refer to +# `Deploy to Adreno GPU `_ + +rpc_tracker_host = os.environ.get("TVM_TRACKER_HOST", "127.0.0.1") +rpc_tracker_port = int(os.environ.get("TVM_TRACKER_PORT", 9190)) +rpc_key = "android" +rpc_tracker = rpc_tracker_host + ":" + str(rpc_tracker_port) + +# Auto tuning is compute intensive and time taking task. +# It is set to False in above configuration as this script runs in x86 for demonstration. +# Please to set :code:`is_tuning` to True to enable auto tuning. + +# Also, :code:`test_target` is set to :code:`llvm` as this example to make compatible for x86 demonstration. +# Please change it to :code:`opencl` or :code:`opencl -device=adreno` for RPC target in configuration above. + +if is_tuning: + tvmc.tune( + tvmc_model, + target=target, + tuning_records=tune_log, + target_host=target_host, + hostname=rpc_tracker_host, + port=rpc_tracker_port, + rpc_key=rpc_key, + tuner="xgb", + repeat=30, + trials=3, + early_stopping=0, + ) + +####################################################################### +# Compilation +# ----------- +# Compilation to produce tvm artifacts + +# This generated example running on our x86 server for demonstration. +# To deply and tun on real target over RPC please set :code:`local_demo` to False in above configuration sestion. + +# OpenCLML offloading will try to accelerate supported operators by using OpenCLML proprietory operator library. +# By default :code:`enable_clml` is set to False in above configuration section. + +if not enable_clml: + if local_demo: + tvmc_package = tvmc.compile( + tvmc_model, + target=target, + ) + else: + tvmc_package = tvmc.compile( + tvmc_model, + target=target, + target_host=target_host, + cross=cross_compiler, + tuning_records=tune_log, + ) +else: + # Altrernatively, we can save the compilation output and save it as a TVMCPackage. + # This way avoids loading of compiled module without compiling again. + target = target + ", clml" + pkg_path = tmp_path.relpath("keras-resnet50.tar") + tvmc.compile( + tvmc_model, + target=target, + target_host=target_host, + cross=cross_compiler, + tuning_records=tune_log, + package_path=pkg_path, + ) + + # Load the compiled package + tvmc_package = TVMCPackage(package_path=pkg_path) + +# tvmc_package consists of tvmc_package.lib_path, tvmc_package.graph, tvmc_package.params +# Saved TVMPackage is nothing but tar archive with mod.so, mod.json and mod.params. + + +####################################################################### +# Deploy & Run +# ------------ +# Deploy and run the compiled model on RPC +# Let tvmc fill inputs using random + +# Run on RPC setup +if local_demo: + result = tvmc.run(tvmc_package, device="cpu", fill_mode="random") +else: + result = tvmc.run( + tvmc_package, + device="cl", + rpc_key=rpc_key, + hostname=rpc_tracker_host, + port=rpc_tracker_port, + fill_mode="random", + ) + +# result is a dictionary of outputs. +print("Result:", result) diff --git a/tests/python/relay/opencl_texture/test_network.py b/tests/python/relay/opencl_texture/test_network.py index 46ee79697ea6..1d0e996f9f97 100644 --- a/tests/python/relay/opencl_texture/test_network.py +++ b/tests/python/relay/opencl_texture/test_network.py @@ -27,26 +27,25 @@ from utils.adreno_utils import build_run_compare, get_model, gpu_preprocess -def convert_to_fp16(mod, dtype): - from tvm.ir import IRModule - - mod = IRModule.from_expr(mod) - seq = tvm.transform.Sequential( - [relay.transform.InferType(), relay.transform.ToMixedPrecision()] - ) - with tvm.transform.PassContext(opt_level=3): - mod = seq(mod) - return mod - - -def _test_mobilenet_v1(remote, target, dtype): +def _test_mobilenet_v1(remote, target, calc_dtype, acc_dtype): mod, params, inputs, dtypes = get_model( "https://github.com/mlcommons/mobile_models/raw/main/v0_7/tflite/mobilenet_edgetpu_224_1.0_float.tflite", "mobilenet_edgetpu_224_1.0_float.tflite", "tflite", ) - if dtype == "float16": - mod = convert_to_fp16(mod["main"], dtype) + if calc_dtype == "float16": + from tvm.driver.tvmc.transform import apply_graph_transforms + + mod = apply_graph_transforms( + mod, + { + "mixed_precision": True, + "mixed_precision_ops": ["nn.conv2d", "nn.dense"], + "mixed_precision_calculation_type": calc_dtype, + "mixed_precision_acc_type": acc_dtype, + }, + ) + build_run_compare(remote, mod, params, inputs, dtypes, target, []) @@ -55,14 +54,21 @@ def _test_mobilenet_v1(remote, target, dtype): @tvm.testing.parametrize_targets("opencl -device=adreno") @pytest.mark.skipif(tvm.testing.utils.IS_IN_CI, reason="CI doesn't support fp16(half datatypes)") def test_mobilenet_v1_fp16(remote, target): - _test_mobilenet_v1(remote, target, "float16") + _test_mobilenet_v1(remote, target, "float16", "float16") @pytest.mark.skip(reason="See https://github.com/apache/tvm/issues/13443") @tvm.testing.requires_opencl @tvm.testing.parametrize_targets("opencl -device=adreno") def test_mobilenet_v1_fp32(remote, target): - _test_mobilenet_v1(remote, target, "float32") + _test_mobilenet_v1(remote, target, "float32", "float32") + + +@pytest.mark.skip(reason="See https://github.com/apache/tvm/issues/13443") +@tvm.testing.requires_opencl +@tvm.testing.parametrize_targets("opencl -device=adreno") +def test_mobilenet_v1_fp16_acc32(remote, target): + _test_mobilenet_v1(remote, target, "float16", "float32") if __name__ == "__main__": diff --git a/tests/scripts/setup-adreno-env.sh b/tests/scripts/setup-adreno-env.sh new file mode 100755 index 000000000000..44eb7edce8bc --- /dev/null +++ b/tests/scripts/setup-adreno-env.sh @@ -0,0 +1,113 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +ENVIRONMENT="" +RPC_PORT="" +ADB_SERIAL="" + +function usage() { + echo "Helper script to setup the environment for Tracker, RPC Device and for application" + echo "Usage (Help) : source setup-adreno-env.sh -h" + echo "Usage (Tracker): source setup-adreno-env.sh -e tracker -p " + echo "Usage (Device): source setup-adreno-env.sh -e device -p -d " + echo "Usage (Query): source setup-adreno-env.sh -e query -p " +} + +while [[ $# -gt 0 ]]; do + case $1 in + -e|--environment) + ENVIRONMENT="$2" + shift # past argument + shift # past value + ;; + -p|--rpc-port) + RPC_PORT="$2" + shift # past argument + shift # past value + ;; + -d|--android-device) + ADB_SERIAL="$2" + shift # past argument + shift # past value + ;; + -h|--help) + usage + return 0 + ;; + -*|--*) + usage + return 0 + ;; + *) + ;; + esac +done + +echo "ENVIRONMENT = ${ENVIRONMENT}" +echo "RPC_PORT = ${RPC_PORT}" +echo "ADB_SERIAL = ${ADB_SERIAL}" + + +function def_environment() { + source tests/scripts/setup-pytest-env.sh + export PYTHONPATH=${PYTHONPATH}:${TVM_PATH}/apps/extension/python + export LD_LIBRARY_PATH="${TVM_PATH}/build:${LD_LIBRARY_PATH}" + export TVM_TRACKER_HOST=0.0.0.0 + export TVM_TRACKER_PORT=$RPC_PORT + export RPC_DEVICE_KEY="android" + export RPC_TARGET="adreno" + export TVM_NDK_CC="${ANDROID_NDK_HOME}/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang" +} + +def_environment + +case ${ENVIRONMENT} in + + "tracker") + echo "Starting Tracker on port :${TVM_TRACKER_PORT}" + def_environment + python3 -m tvm.exec.rpc_tracker --host "${TVM_TRACKER_HOST}" --port "${TVM_TRACKER_PORT}" + ;; + + "device") + echo "Running RPC on device : ${ADB_SERIAL} with key $RPC_DEVICE_KEY" + def_environment + export ANDROID_SERIAL=${ADB_SERIAL} + + adb shell "mkdir -p /data/local/tmp/tvm_ci" + adb push build-adreno-target/tvm_rpc /data/local/tmp/tvm_ci/tvm_rpc_ci + adb push build-adreno-target/libtvm_runtime.so /data/local/tmp/tvm_ci + + adb reverse tcp:${TVM_TRACKER_PORT} tcp:${TVM_TRACKER_PORT} + adb forward tcp:5000 tcp:5000 + adb forward tcp:5001 tcp:5001 + adb forward tcp:5002 tcp:5002 + adb shell "cd /data/local/tmp/tvm_ci; killall -9 tvm_rpc_ci; sleep 2; LD_LIBRARY_PATH=/data/local/tmp/tvm_ci/ ./tvm_rpc_ci server --host=0.0.0.0 --port=5000 --port-end=5010 --tracker=127.0.0.1:${TVM_TRACKER_PORT} --key=${RPC_DEVICE_KEY}" + ;; + + "query") + def_environment + echo "Setting dev environment with Tracker Port : $TVM_TRACKER_HOST} and the available devices are" + python3 -m tvm.exec.query_rpc_tracker --port ${TVM_TRACKER_PORT} + ;; + + *) + usage + ;; +esac diff --git a/tests/scripts/task_build_adreno_bins.sh b/tests/scripts/task_build_adreno_bins.sh index f65794106ee3..87f50367440c 100755 --- a/tests/scripts/task_build_adreno_bins.sh +++ b/tests/scripts/task_build_adreno_bins.sh @@ -28,7 +28,6 @@ cd ${output_directory} cp ../cmake/config.cmake . -echo set\(USE_MICRO OFF\) >> config.cmake if [ -f "${ADRENO_OPENCL}/CL/cl_qcom_ml_ops.h" ] ; then echo set\(USE_CLML "${ADRENO_OPENCL}"\) >> config.cmake echo set\(USE_CLML_GRAPH_EXECUTOR "${ADRENO_OPENCL}"\) >> config.cmake diff --git a/tests/scripts/task_config_build_adreno.sh b/tests/scripts/task_config_build_adreno.sh index d378b5f842b5..62e6ffecbced 100755 --- a/tests/scripts/task_config_build_adreno.sh +++ b/tests/scripts/task_config_build_adreno.sh @@ -23,7 +23,6 @@ mkdir -p "$BUILD_DIR" cd "$BUILD_DIR" cp ../cmake/config.cmake . -echo set\(USE_OPENCL ON\) >> config.cmake if [ -f "${ADRENO_OPENCL}/CL/cl_qcom_ml_ops.h" ] ; then echo set\(USE_CLML ${ADRENO_OPENCL}\) >> config.cmake fi