[WIP] Heterogeneous execution of TVM #1688

zhiics · 2018-09-05T07:24:06Z

Background

Current TVM runtime takes the entire computation graph and imposes computation on it iteratively with the assumption that all operators are equally performant on the same backend. However, in reality, different backends may provide their own libraries to support various high performance operators. These operators are normally heavily optimized for a limited number of target devices, which sometimes makes efficient execution of a whole computation graph difficult in TVM. This PR proposes to design a mechanism to support heterogenous execution for TVM when multiple processors are present on the same hardware.

Major steps

1. Context/device annotation

Annotation is implemented as a separate pass which could be invoked multiple times before and after some optimizations when necessary (like infershape/type).
Nodes will be annotated with the target information in this pass. It also helps us to 1) extract the target right before some optimizations (e.g. altering layout) and/or compute. This removes the "with target" in the build stage, 2) prevent across target operation fusion.

2. Copy node insertion

Copy node is designed as a special node that is inserted in the annotated graph. These nodes are very light, i.e. they don't need to be lowered and no fcompute/scheduled are needed. The runtime will detect this op and performs data copy directly.

3. Compilation

TVM will compile the annotated graph into multiple binaries depending on the number of devices needed. Each binary contains all the compiled (fused) operators for a specific target. We might be able to just use one binary, but I haven't investigated yet. However, only one json file and one param file will be generated.

4. Runtime

Runtime allocates memory for each node on different devices. Instead of only handling tvm_op, it also identifies copy op and performs data copy.

5. TODO

Currently, I have tested it on a MacBook with Intel Graphics processors and CPU using SSD (resnet50+multibox+nms) where nms is scheduled to CPU and everything else is on GPU. It seems the overhead of data transferring is extremely high. Need more investigation on it.
The current default device is kDLOPENCL and users can specify operators from python API to CPU. We can probably have a config file containing all supported operators and specifying which operator should be scheduled to which device. In the future, we may want to have a more intelligent algorithm to decide effective annotation, therefore making annotation transparent to users.
For build, we have build_heterogenous API because there is a "with tophub" in the build which cannot take multiple targets. Need to see how to move it so that only one API is needed.

remove upstream add missing files check_computation turn dlpack submodule local make annotation appliable in multiple passes intel graphics conv2d alter layout bug fixed Handle the case where a node has two edges going into a node Handle multiple devices (>2) and fix lint move definition of GraphRuntime class to graphruntime.h fix some comments

tqchen · 2018-09-06T01:25:12Z

Thanks for the effort. Let us separate this into two PRs.

First PR contains the necessary runtime change, and with a manually constructed json graph for doing cpu gpu execution.
- Provide a bit document on what that additional part is
The compiler components.
- We might need a bit more discussion on the API. In particular, we really want everything to work under a single build function, possible with options.

Some general comments on current code

Avoid touching C FFI, always relies on tvm PackedFunc mechanism for registering new functions
DLPack should be kept the same

tqchen · 2018-09-06T01:25:55Z

cross ref #1242

zhiics · 2018-09-06T03:44:58Z

@tqchen Thanks for the comment. For the runtime, I am little concerned about the testing if we want to have it in a separate PR because it seems to me that it requires the compiled binary for heterogeneous execution. Or I can have it first and then put more unit tests together with the compiler PR. Another option is that we can have two separate APIs for the compiler (build) for now, but having another PR to combine them together. I would probably prefer the latter so we can have a full test. How do you think?

Don't worry, I removed the changes in dlpack. This PR shows some changes in dlpack is because I pull the master which is one commit ahead the one used in the TVM project. I will solve that next time.

tqchen · 2018-09-06T04:33:53Z

It is not hard to construct testcases with just the runtime change, even if we only make the change of runtime. Here are the steps

Build a graph with explicit copy, like a-> conv2d -> copy-> relu
Use TVM build to build the graph
Amend the result json a bit by adding the device placement plans

see also https://github.com/dmlc/tvm/blob/master/tests/python/unittest/test_runtime_graph.py

tqchen · 2018-09-06T04:36:12Z

The main reason that we want to have the runtime change first, is because that is the minimum thing we can do and we should do it cleanly and agree on a convention. We always favor minimum and elegant improvements, especially in the core runtime

zhiics · 2018-09-06T05:28:27Z

Thanks. I totally understand why we want to have separate PRs. I wasn't aware of this way for testing. Let me take a look at that.

zhiics · 2018-09-07T18:23:23Z

@tqchen Created a separate PR (#1695) for the runtime part. PTAL when you have time. Thanks.

zhiics force-pushed the master branch from fa16ca6 to 79811e1 Compare September 5, 2018 20:16

remove kDLInvalid from dlpack.h

4782b1f

tqchen self-assigned this Sep 6, 2018

zhiics mentioned this pull request Sep 7, 2018

Heterogeneous Runtime #1695

Merged

yzhliu added the status: WIP label Sep 10, 2018

zhiics mentioned this pull request Sep 25, 2018

support of multiple devices for tvm.build #1773

Merged

zhiics closed this Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Heterogeneous execution of TVM #1688

[WIP] Heterogeneous execution of TVM #1688

Uh oh!

zhiics commented Sep 5, 2018 •

edited

Loading

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

zhiics commented Sep 6, 2018 •

edited

Loading

Uh oh!

tqchen commented Sep 6, 2018 •

edited

Loading

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

zhiics commented Sep 6, 2018

Uh oh!

zhiics commented Sep 7, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] Heterogeneous execution of TVM #1688

[WIP] Heterogeneous execution of TVM #1688

Uh oh!

Conversation

zhiics commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Major steps

1. Context/device annotation

2. Copy node insertion

3. Compilation

4. Runtime

5. TODO

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

zhiics commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Sep 6, 2018

Uh oh!

zhiics commented Sep 6, 2018

Uh oh!

zhiics commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhiics commented Sep 5, 2018 •

edited

Loading

zhiics commented Sep 6, 2018 •

edited

Loading

tqchen commented Sep 6, 2018 •

edited

Loading

zhiics commented Sep 7, 2018 •

edited

Loading