Add blog about auto tuning for all hardware platforms by merrymercy · Pull Request #21 · tvmai/tvmai.github.io

merrymercy · 2018-09-29T15:39:42Z

preview http://lmzheng.net/posts/2018/10/auto-tune-all

merrymercy · 2018-10-02T17:46:00Z

@eqy Can you do a round of review if you have time?

eqy

just a bunch of nits, overall great work on this blog post
really nice to see the big picture milestone done

eqy · 2018-10-02T18:06:51Z

+  - tvm
+---
+
+How to optimize the performance of deep neural network on a diverse range of hardware platforms is still a hard


Optimizing deep neural network performance on a diverse range of hardware platforms...

eqy · 2018-10-02T19:28:16Z

+How to optimize the performance of deep neural network on a diverse range of hardware platforms is still a hard
+problem for AI developers. In terms of system support, we are facing a many-to-many problem here:
+deploying trained models from multiple frontends (e.g. Tensorflow, ONNX, MXNet) to multiple
+hardware platforms (e.g. CPU, GPU, Accelerators). On the most performance critical part of


The most performance critical part of this problem is obtaining high performance kernel implementations...

eqy · 2018-10-02T19:29:42Z

+this problem is how to get high performance kernel implementation for growing model
+architectures and hardware platforms.
+
+To address this challenge, TVM takes a full stack compiler approach. Combining code generator and auto-tuner in TVM,


TVM combines code generation and auto-tuning to generate kernels... , obtaining state-of-the-art inference performance including...

eqy · 2018-10-02T19:29:59Z

+and obtain the state-of-the-art inference performance on hardware platforms including
+ARM CPUs, Intel CPUs, Mali GPUs, NVIIDA GPUs and AMD GPUs.
+
+In this blog post, I will show the workflow of automatic kernel optimization in TVM compiler stack and 


In this blog post, I show...

eqy · 2018-10-02T19:30:29Z

+
+Kernel optimization in TVM is done in an iterative loop fashion.
+As shown in Figure 1, the automatic kernel optimization takes a neural network (typically in computational graph representation)
+from frontend frameworks as input, and generates kernels for all the operators in this network.


... for all operators in the network.

eqy · 2018-10-02T19:44:57Z

+
+Finally let we take a look at AMD GPU. TVM supports OpenCL and [ROCm](https://rocm.github.io/) backend. We found ROCm is better since
+it is more specialized for AMD GPUs. In terms of baseline, [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) is a vendor provided
+kernel library. We integrate its kernel implementation in TVM graph runtime.


TVM's graph runtime integrates its kernel implementations (maybe clarify that this is optional and not relied upon for generating optimized code)

eqy · 2018-10-02T19:45:40Z

+it is more specialized for AMD GPUs. In terms of baseline, [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) is a vendor provided
+kernel library. We integrate its kernel implementation in TVM graph runtime.
+
+We didn't do any specific optimization for AMD GPU. Instead, all computation definition/schedule code for NVIDIA GPU is directly reused. As for the results, TVM is a little bit slower then MIOpen in most cases.


In this case, we reuse ... from NVIDIA GPUs

As a result, TVM ... , (but maybe mention that there is room for improvement)

eqy · 2018-10-02T19:46:27Z

+| |
+
+* Note 1: Out of memory on this board.
+* Note 2: We didn't tune some small networks on GPU due to time limit. TVM can use its fallback mechanism to compile them but the performance is not guaranteed.


... due to time contraints...
When profiling data is not available... TVM can use fallback code generation (but competitive performance is not guaranteed in this scenario).

eqy · 2018-10-02T19:46:36Z

+
+* Note 1: Out of memory on this board.
+* Note 2: We didn't tune some small networks on GPU due to time limit. TVM can use its fallback mechanism to compile them but the performance is not guaranteed.
+So their results are omitted here.


(can delete this)

eqy · 2018-10-02T19:46:57Z

+[NVIDIA/AMD GPU](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_cuda.html)
+are all available. Try tuning for your custom network and hardware devices.
+
+For Intel CPU, right now it is under refactor, but you can take a look at the


(Intel CPU is currently being refactored...

merrymercy · 2018-10-02T22:48:04Z

Thanks! Review comments are addressed. Learned a lot about writing.

tqchen · 2018-10-02T22:57:16Z

Merge figure 1 and 2, say a bit more in text about AutoTVM and traditional AutoTuning
- point 1: scales to more devices, point 2: use ml to speedup optimization
- Always try to make a blogpost relatively self-contained and only put reference in the end.
Use different color for ML-based method and black box method(to highlight M-based method).

tqchen · 2018-10-02T22:58:48Z

Also maybe the comparison figures are better in landscape mode vs the current vertical mode

merrymercy · 2018-10-03T06:09:05Z

Figure 1 is already complicated enough. Another figure is required to highlight the difference.
Color is changed and text is added.

I think is ready for publish. I changed the date to Oct. 3.

tqchen · 2018-10-03T06:31:00Z

OK, some final comments:

Put the link of the AutoTVM paper in the end, to show all resources
Use bullet point and keywords to highlight the full stack approach
- Scalable to heterogenous cluster of devices
- Learning to optimize tensor programs

tqchen · 2018-10-03T06:37:46Z

+With an expressive code generator and an efficient search algorithm, we are able to
+generate kernels that are comparable to heavily hand-optimized ones.
+Since programmer time is expensive and machine time is getting cheaper,
+we believe the auto-tuning with real hardware and data in the loop will be the standard workflow


auto-tuning-> automatic program optimization

tqchen · 2018-10-03T06:38:52Z

+### NVIDIA GPU
+
+On NVIDIA GPU, [CuDNN](https://developer.nvidia.com/cudnn) and [TensorRT](https://developer.nvidia.com/tensorrt) are two vendor-provided libraries for training and inference respectively. Since we focus on inference,
+we run our benchmark in the unbatched setting. Another tensor compiler [PlaidML](https://github.com/plaidml/plaidml) is also reported as baseline.


we also include PlaidML as a baseline as there is a previous benchmark of it compared against a pre-AutoTVM version of TVM.

merrymercy · 2018-10-03T06:52:20Z

Fixed with up-to-date preview http://lmzheng.net/posts/2018/10/auto-tune-all

merrymercy added 4 commits September 29, 2018 08:38

Add blog about auto tuning for all hardware platforms

bb5e14d

fix typo

b05c23b

update

297a607

update

32d1a52

merrymercy added 2 commits October 2, 2018 10:48

fix typo

2cbe510

update benchmark numbers

f264e61

eqy suggested changes Oct 2, 2018

View reviewed changes

merrymercy added 2 commits October 2, 2018 15:41

address comments

970bfe9

address comments

6196533

Fix typo

0940ccb

eqy approved these changes Oct 3, 2018

View reviewed changes

rename

0a17edf

tqchen requested changes Oct 3, 2018

View reviewed changes

tweak some words

170d572

tqchen merged commit c3ada80 into tvmai:master Oct 3, 2018

merrymercy added 6 commits October 3, 2018 22:54

update figures

3089e1a

fix

5ca2e30

rearrange links

4af4bf2

update statement about plaidml

9348bfb

address comments

2d7d162

fix typo

c83e471

Conversation

merrymercy commented Sep 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Oct 2, 2018

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Oct 2, 2018

Uh oh!

tqchen commented Oct 2, 2018

Uh oh!

tqchen commented Oct 2, 2018

Uh oh!

merrymercy commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

merrymercy commented Sep 29, 2018 •

edited

Loading

merrymercy commented Oct 3, 2018 •

edited

Loading

tqchen commented Oct 3, 2018 •

edited

Loading

merrymercy commented Oct 3, 2018 •

edited

Loading