Skip to content

Add blog about auto tuning for all hardware platforms #21

Merged
tqchen merged 17 commits intotvmai:masterfrom
merrymercy:master
Oct 3, 2018
Merged

Add blog about auto tuning for all hardware platforms #21
tqchen merged 17 commits intotvmai:masterfrom
merrymercy:master

Conversation

@merrymercy
Copy link
Copy Markdown
Contributor

@merrymercy merrymercy commented Sep 29, 2018

@merrymercy
Copy link
Copy Markdown
Contributor Author

@eqy Can you do a round of review if you have time?

Copy link
Copy Markdown
Contributor

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a bunch of nits, overall great work on this blog post
really nice to see the big picture milestone done

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
- tvm
---

How to optimize the performance of deep neural network on a diverse range of hardware platforms is still a hard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizing deep neural network performance on a diverse range of hardware platforms...

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
How to optimize the performance of deep neural network on a diverse range of hardware platforms is still a hard
problem for AI developers. In terms of system support, we are facing a many-to-many problem here:
deploying trained models from multiple frontends (e.g. Tensorflow, ONNX, MXNet) to multiple
hardware platforms (e.g. CPU, GPU, Accelerators). On the most performance critical part of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most performance critical part of this problem is obtaining high performance kernel implementations...

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
this problem is how to get high performance kernel implementation for growing model
architectures and hardware platforms.

To address this challenge, TVM takes a full stack compiler approach. Combining code generator and auto-tuner in TVM,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVM combines code generation and auto-tuning to generate kernels... , obtaining state-of-the-art inference performance including...

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
and obtain the state-of-the-art inference performance on hardware platforms including
ARM CPUs, Intel CPUs, Mali GPUs, NVIIDA GPUs and AMD GPUs.

In this blog post, I will show the workflow of automatic kernel optimization in TVM compiler stack and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this blog post, I show...

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated

Kernel optimization in TVM is done in an iterative loop fashion.
As shown in Figure 1, the automatic kernel optimization takes a neural network (typically in computational graph representation)
from frontend frameworks as input, and generates kernels for all the operators in this network.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... for all operators in the network.

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated

Finally let we take a look at AMD GPU. TVM supports OpenCL and [ROCm](https://rocm.github.io/) backend. We found ROCm is better since
it is more specialized for AMD GPUs. In terms of baseline, [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) is a vendor provided
kernel library. We integrate its kernel implementation in TVM graph runtime.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVM's graph runtime integrates its kernel implementations (maybe clarify that this is optional and not relied upon for generating optimized code)

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
it is more specialized for AMD GPUs. In terms of baseline, [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) is a vendor provided
kernel library. We integrate its kernel implementation in TVM graph runtime.

We didn't do any specific optimization for AMD GPU. Instead, all computation definition/schedule code for NVIDIA GPU is directly reused. As for the results, TVM is a little bit slower then MIOpen in most cases.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we reuse ... from NVIDIA GPUs

As a result, TVM ... , (but maybe mention that there is room for improvement)

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
| |

* Note 1: Out of memory on this board.
* Note 2: We didn't tune some small networks on GPU due to time limit. TVM can use its fallback mechanism to compile them but the performance is not guaranteed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... due to time contraints...
When profiling data is not available... TVM can use fallback code generation (but competitive performance is not guaranteed in this scenario).

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated

* Note 1: Out of memory on this board.
* Note 2: We didn't tune some small networks on GPU due to time limit. TVM can use its fallback mechanism to compile them but the performance is not guaranteed.
So their results are omitted here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can delete this)

Comment thread _posts/2018-10-02-auto-tune-all.md Outdated
[NVIDIA/AMD GPU](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_cuda.html)
are all available. Try tuning for your custom network and hardware devices.

For Intel CPU, right now it is under refactor, but you can take a look at the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Intel CPU is currently being refactored...

@merrymercy
Copy link
Copy Markdown
Contributor Author

Thanks! Review comments are addressed. Learned a lot about writing.

@tqchen
Copy link
Copy Markdown
Collaborator

tqchen commented Oct 2, 2018

  • Merge figure 1 and 2, say a bit more in text about AutoTVM and traditional AutoTuning
    • point 1: scales to more devices, point 2: use ml to speedup optimization
    • Always try to make a blogpost relatively self-contained and only put reference in the end.
  • Use different color for ML-based method and black box method(to highlight M-based method).

@tqchen
Copy link
Copy Markdown
Collaborator

tqchen commented Oct 2, 2018

Also maybe the comparison figures are better in landscape mode vs the current vertical mode

@merrymercy
Copy link
Copy Markdown
Contributor Author

merrymercy commented Oct 3, 2018

  • Figure 1 is already complicated enough. Another figure is required to highlight the difference.
  • Color is changed and text is added.

autotvm

I think is ready for publish. I changed the date to Oct. 3.

@tqchen
Copy link
Copy Markdown
Collaborator

tqchen commented Oct 3, 2018

OK, some final comments:

  • Put the link of the AutoTVM paper in the end, to show all resources
  • Use bullet point and keywords to highlight the full stack approach
    • Scalable to heterogenous cluster of devices
    • Learning to optimize tensor programs

Comment thread _posts/2018-10-03-auto-tune-all.md Outdated
With an expressive code generator and an efficient search algorithm, we are able to
generate kernels that are comparable to heavily hand-optimized ones.
Since programmer time is expensive and machine time is getting cheaper,
we believe the auto-tuning with real hardware and data in the loop will be the standard workflow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-tuning-> automatic program optimization

Comment thread _posts/2018-10-03-auto-tune-all.md Outdated
### NVIDIA GPU

On NVIDIA GPU, [CuDNN](https://developer.nvidia.com/cudnn) and [TensorRT](https://developer.nvidia.com/tensorrt) are two vendor-provided libraries for training and inference respectively. Since we focus on inference,
we run our benchmark in the unbatched setting. Another tensor compiler [PlaidML](https://github.com/plaidml/plaidml) is also reported as baseline.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also include PlaidML as a baseline as there is a previous benchmark of it compared against a pre-AutoTVM version of TVM.

@merrymercy
Copy link
Copy Markdown
Contributor Author

merrymercy commented Oct 3, 2018

Fixed with up-to-date preview http://lmzheng.net/posts/2018/10/auto-tune-all

@tqchen tqchen merged commit c3ada80 into tvmai:master Oct 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants