Skip to content

Conversation

@zhiics
Copy link
Member

@zhiics zhiics commented Sep 5, 2018

Background

Current TVM runtime takes the entire computation graph and imposes computation on it iteratively with the assumption that all operators are equally performant on the same backend. However, in reality, different backends may provide their own libraries to support various high performance operators. These operators are normally heavily optimized for a limited number of target devices, which sometimes makes efficient execution of a whole computation graph difficult in TVM. This PR proposes to design a mechanism to support heterogenous execution for TVM when multiple processors are present on the same hardware.

Major steps

1. Context/device annotation

Annotation is implemented as a separate pass which could be invoked multiple times before and after some optimizations when necessary (like infershape/type).
Nodes will be annotated with the target information in this pass. It also helps us to 1) extract the target right before some optimizations (e.g. altering layout) and/or compute. This removes the "with target" in the build stage, 2) prevent across target operation fusion.

2. Copy node insertion

Copy node is designed as a special node that is inserted in the annotated graph. These nodes are very light, i.e. they don't need to be lowered and no fcompute/scheduled are needed. The runtime will detect this op and performs data copy directly.

3. Compilation

TVM will compile the annotated graph into multiple binaries depending on the number of devices needed. Each binary contains all the compiled (fused) operators for a specific target. We might be able to just use one binary, but I haven't investigated yet. However, only one json file and one param file will be generated.

4. Runtime

Runtime allocates memory for each node on different devices. Instead of only handling tvm_op, it also identifies copy op and performs data copy.

5. TODO

  1. Currently, I have tested it on a MacBook with Intel Graphics processors and CPU using SSD (resnet50+multibox+nms) where nms is scheduled to CPU and everything else is on GPU. It seems the overhead of data transferring is extremely high. Need more investigation on it.
  2. The current default device is kDLOPENCL and users can specify operators from python API to CPU. We can probably have a config file containing all supported operators and specifying which operator should be scheduled to which device. In the future, we may want to have a more intelligent algorithm to decide effective annotation, therefore making annotation transparent to users.
  3. For build, we have build_heterogenous API because there is a "with tophub" in the build which cannot take multiple targets. Need to see how to move it so that only one API is needed.

remove upstream

add missing files check_computation

turn dlpack submodule local

make annotation appliable in multiple passes

intel graphics conv2d alter layout bug fixed

Handle the case where a node has two edges going into a node

Handle multiple devices (>2) and fix lint

move definition of GraphRuntime class to graphruntime.h

fix some comments
@tqchen
Copy link
Member

tqchen commented Sep 6, 2018

Thanks for the effort. Let us separate this into two PRs.

  • First PR contains the necessary runtime change, and with a manually constructed json graph for doing cpu gpu execution.
    • Provide a bit document on what that additional part is
  • The compiler components.
    • We might need a bit more discussion on the API. In particular, we really want everything to work under a single build function, possible with options.

Some general comments on current code

  • Avoid touching C FFI, always relies on tvm PackedFunc mechanism for registering new functions
  • DLPack should be kept the same

@tqchen tqchen self-assigned this Sep 6, 2018
@tqchen
Copy link
Member

tqchen commented Sep 6, 2018

cross ref #1242

@zhiics
Copy link
Member Author

zhiics commented Sep 6, 2018

@tqchen Thanks for the comment. For the runtime, I am little concerned about the testing if we want to have it in a separate PR because it seems to me that it requires the compiled binary for heterogeneous execution. Or I can have it first and then put more unit tests together with the compiler PR. Another option is that we can have two separate APIs for the compiler (build) for now, but having another PR to combine them together. I would probably prefer the latter so we can have a full test. How do you think?

Don't worry, I removed the changes in dlpack. This PR shows some changes in dlpack is because I pull the master which is one commit ahead the one used in the TVM project. I will solve that next time.

@tqchen
Copy link
Member

tqchen commented Sep 6, 2018

It is not hard to construct testcases with just the runtime change, even if we only make the change of runtime. Here are the steps

  • Build a graph with explicit copy, like a-> conv2d -> copy-> relu
  • Use TVM build to build the graph
  • Amend the result json a bit by adding the device placement plans

see also https://github.com/dmlc/tvm/blob/master/tests/python/unittest/test_runtime_graph.py

@tqchen
Copy link
Member

tqchen commented Sep 6, 2018

The main reason that we want to have the runtime change first, is because that is the minimum thing we can do and we should do it cleanly and agree on a convention. We always favor minimum and elegant improvements, especially in the core runtime

@zhiics
Copy link
Member Author

zhiics commented Sep 6, 2018

Thanks. I totally understand why we want to have separate PRs. I wasn't aware of this way for testing. Let me take a look at that.

@zhiics zhiics mentioned this pull request Sep 7, 2018
@zhiics
Copy link
Member Author

zhiics commented Sep 7, 2018

@tqchen Created a separate PR (#1695) for the runtime part. PTAL when you have time. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants