-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[WIP] Heterogeneous execution of TVM #1688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
remove upstream add missing files check_computation turn dlpack submodule local make annotation appliable in multiple passes intel graphics conv2d alter layout bug fixed Handle the case where a node has two edges going into a node Handle multiple devices (>2) and fix lint move definition of GraphRuntime class to graphruntime.h fix some comments
|
Thanks for the effort. Let us separate this into two PRs.
Some general comments on current code
|
|
cross ref #1242 |
|
@tqchen Thanks for the comment. For the runtime, I am little concerned about the testing if we want to have it in a separate PR because it seems to me that it requires the compiled binary for heterogeneous execution. Or I can have it first and then put more unit tests together with the compiler PR. Another option is that we can have two separate APIs for the compiler (build) for now, but having another PR to combine them together. I would probably prefer the latter so we can have a full test. How do you think? Don't worry, I removed the changes in dlpack. This PR shows some changes in dlpack is because I pull the master which is one commit ahead the one used in the TVM project. I will solve that next time. |
|
It is not hard to construct testcases with just the runtime change, even if we only make the change of runtime. Here are the steps
see also https://github.com/dmlc/tvm/blob/master/tests/python/unittest/test_runtime_graph.py |
|
The main reason that we want to have the runtime change first, is because that is the minimum thing we can do and we should do it cleanly and agree on a convention. We always favor minimum and elegant improvements, especially in the core runtime |
|
Thanks. I totally understand why we want to have separate PRs. I wasn't aware of this way for testing. Let me take a look at that. |
Background
Current TVM runtime takes the entire computation graph and imposes computation on it iteratively with the assumption that all operators are equally performant on the same backend. However, in reality, different backends may provide their own libraries to support various high performance operators. These operators are normally heavily optimized for a limited number of target devices, which sometimes makes efficient execution of a whole computation graph difficult in TVM. This PR proposes to design a mechanism to support heterogenous execution for TVM when multiple processors are present on the same hardware.
Major steps
1. Context/device annotation
Annotation is implemented as a separate pass which could be invoked multiple times before and after some optimizations when necessary (like infershape/type).
Nodes will be annotated with the target information in this pass. It also helps us to 1) extract the target right before some optimizations (e.g. altering layout) and/or compute. This removes the "with target" in the build stage, 2) prevent across target operation fusion.
2. Copy node insertion
Copy node is designed as a special node that is inserted in the annotated graph. These nodes are very light, i.e. they don't need to be lowered and no fcompute/scheduled are needed. The runtime will detect this op and performs data copy directly.
3. Compilation
TVM will compile the annotated graph into multiple binaries depending on the number of devices needed. Each binary contains all the compiled (fused) operators for a specific target. We might be able to just use one binary, but I haven't investigated yet. However, only one json file and one param file will be generated.
4. Runtime
Runtime allocates memory for each node on different devices. Instead of only handling tvm_op, it also identifies copy op and performs data copy.
5. TODO
build_heterogenousAPI because there is a "with tophub" in thebuildwhich cannot take multiple targets. Need to see how to move it so that only one API is needed.