-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[RUNTIME][TRACE] Support trace primitive. #1973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eef6670 to
5bedd3e
Compare
|
Thanks for the contribution. We can likely further simplify this by using tvm's runtime system for callback. TVM support generated code to callback any PackedFunc (including ones from python). Here is a rough example, it is only used to demonstrate how the lowered callback works. Some future wrapping around this mechanism might make the trace API easier to use import tvm
import numpy as np
@tvm.register_func
def my_debug(x, i):
print("array=", x.asnumpy())
return 0
x = tvm.placeholder((4,), name="x", dtype="int32")
xbuffer = tvm.decl_buffer(x.shape, dtype=x.dtype)
y = tvm.compute(x.shape, lambda i: tvm.call_packed("my_debug", xbuffer))
s = tvm.create_schedule(y.op)
print(tvm.lower(s, [x, y], binds={x:xbuffer}, simple_mode=True))
f = tvm.build(s, [xbuffer, y], binds={x:xbuffer})
xnd = tvm.nd.array(np.ones((4,), dtype=x.dtype))
ynd = tvm.nd.array(np.zeros((4,), dtype=y.dtype))
f(xnd, ynd)In the above case, the call_packed calls my_debug in every iteration passing the array handle(xbuffer). Some further wrapping could make it looks nicer. The good part is, we can use any piece of python code to monitor the content of x |
|
Hi @tqchen thanks for the answer. |
|
This is a good question :) The current principle is: if some API is stabilized and needed across languages, we bring it to C++. If something is in flux and not yet critical to the developments, we will keep it in the language bindings(e.g. python), let it plays out and wait until it stablizes. Admittedly there are tradeoffs, having everything in c++ makes the API available to all languages, but it is also not necessarily flexible. For example, in the case of monitoring, printing things out to a log may not be as flexible as defining a customized callback and let the user do what they want to do. Having some of the things in python(for example, most of our test cases are) gives some speed up to the development process and allows us to write more test cases easily. In both cases, we should document it clearly, so it would be really nice to have a tutorial to teach users on all the possible ways to do tracing |
|
As a side note, in this particular case, the my_debug function can be registered in c++ as well, because the runtime system allows interop between languages |
|
@tqchen thanks for the answer, For example "sanitizers" from llvm/compiler-rt have two parts:
The library can be build as unique dso and would require to manually link model with it. |
|
The current features could likely be implemented using packed function calls, to define a set of tracing functions like you do but declare them as packed function. Then optionally we could register these runtime function as a packed function, or registering different variant of it depending on whether tracing is on/off So to summarize, I agree that tracing would need runtime support, but these runtime feature could likely be built around tvm packed function system |
|
@tqchen got it now, thanks, |
|
@denis0x0D I'm considering move the dump ir to C++ side. |
5bedd3e to
6abdbcb
Compare
|
@tqchen the strange thing I've faced with packed function - when I increase the shape of the buffer or so on. I got Segmentation fault on 3 different platform and spent some time to debug, the segfault happens in hash table which registry the packed function. May be I miss something or may it is a bug, it would be good to get confirmation from someone else. Also I have a backtrace of error and tests case on c++. |
804592d to
89e6a89
Compare
|
@tqchen patch is updated, tests added, python api is added alongside with c++ api. |
89e6a89 to
f965a1d
Compare
|
Thanks for the updates, the new set of changes makes use the packed_func runtime and reduces the runtime greatly. I am thinking how to even simplify this further, to consider most user's use case. Most of my reasoning comes from the fact that this is still a new feature and we may not settle down on what API user might prefer, so maybe it is a good idea to start quick mocks in python side, before we commit to heavy engineering which could become a technical debt. Here are some specific comments:
Here is a mock up version of the code. from tvm.contrib import trace
tvm.register_func("tvm.trace_callback", trace.log_to_file("mytrace.log")
# run the trace code |
|
@tqchen thanks for review, I'll update it. |
a96246b to
f23b0e3
Compare
|
@tqchen updated regarding to latest comments. Thanks. |
|
sorry for the delayed review, I am just recovering from NeurIPS and TVM conference.
@grwlf please also comment on this. |
|
Let us double check the implementation of call_packed_lowered to make sure that the return value of the trace function is lowered correctly |
|
@tqchen should I add more tests ? |
|
@denis0x0D never mind, I checked the current code and it should support return value correctly. |
|
Thanks @denis0x0D for the comment, the curent logic appears to be correct. However, it is never-the-less not fast, because everytime we have to look up the callback function by name. Here is a possibly better strategy, which is not that different from current one:
The advantage of this way is that we reuse the packed function caching mechanism, so there is not repeatitive lookup of packed functions, it does add a special logic to the code generator block to support tracing (mainly for returning the correct value) |
tqchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my latest comment
src/runtime/trace.cc
Outdated
| // Check the global registry for the user defined callback. | ||
| auto pf = runtime::Registry::Get(str); | ||
| if (pf) { | ||
| if (size == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no switch is necessary, you can use pf->CallPacked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment may no longer apply given my latest comment
|
@tqchen thanks for review, I agree, your strategy is faster, without additional check in global registry. |
00800b2 to
a3d6a34
Compare
|
@yzhliu @grwlf Please take another round of look |
|
Looks good to me. My use-cases all work. |
|
@denis0x0D thanks for the change. The logic can be improved further, here are some parts that can be improved
Here is one possible way to improve the current logic.
|
a3d6a34 to
f1d69fc
Compare
|
@tqchen thanks for review, sorry for the late response, I was on a holidays.
Also tests added for that functional too. |
tqchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good improvements, some final comments
src/codegen/llvm/codegen_cpu.cc
Outdated
| llvm::Value *CodeGenCPU::CreateCallTracePacked(const Call *op) { | ||
| using llvm::BasicBlock; | ||
| CHECK_EQ(op->args.size(), 6U); | ||
| std::string func_name = op->args[0].as<StringImm>()->value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please to merge the part of the CallPacked logic with CallPacked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, we can likely do things like MakeCallPacked(args, tcode, begin, end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. Also added some additional arguments as rvalue, r_type and returns basic block, so it's easy to create a phi node in this case.
src/codegen/llvm/codegen_cpu.cc
Outdated
| llvm::Value *cmp; | ||
| llvm::Type *rvalue_type = rvalue->getType(); | ||
| // The cmp instruction depends on processing type. | ||
| if (rvalue_type->isFloatTy() || rvalue_type->isDoubleTy() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not check on the return value, instead check if return tcode == kNull(pyhon code not returning anything)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for mention tcode, I was spending couple of hours to understand how to check return value, and didn't find a better solution to return 0 from call back when we want to process origin value, otherwise return user-defined value, but the true way to do it is by checking tcode.
f1d69fc to
01ac54e
Compare
Add trace call expr to allow trace Tensor data at the runtime. By default the default handler is enabled which prints a tracing data to stdout, otherwise user should specify a call_back as global_function (aka @tvm.register_func). The issue is related to: https://discuss.tvm.ai/t/idea-trace-expression/945
01ac54e to
f813889
Compare
|
@tqchen thanks for review, I've updated the patch regarding to latest comments. |
|
Thanks @denis0x0D @grwlf @yzhliu !This is now merged ! |
Add trace call expr to allow trace Tensor data at the runtime. By default the default handler is enabled which prints a tracing data to stdout, otherwise user should specify a call_back as global_function (aka @tvm.register_func). The issue is related to: https://discuss.tvm.ai/t/idea-trace-expression/945
Add trace call expr to allow trace Tensor data at the runtime. By default the default handler is enabled which prints a tracing data to stdout, otherwise user should specify a call_back as global_function (aka @tvm.register_func). The issue is related to: https://discuss.tvm.ai/t/idea-trace-expression/945
Add trace call expr to allow trace Tensor data at the runtime. By default the default handler is enabled which prints a tracing data to stdout, otherwise user should specify a call_back as global_function (aka @tvm.register_func). The issue is related to: https://discuss.tvm.ai/t/idea-trace-expression/945
The patch is related to issue:
https://discuss.tvm.ai/t/idea-trace-expression/945
This patch allows to trace Tensor data at the runtime.
Use case example:
1. Write model.
...
tvm::Arraytvm::Expr shape = {n, n, n};
tvm::Tensor C = tvm::compute (shape, [&] (tvm::Var i, tvm::Var j, tvm::Var k) {
return tvm::trace ("A", {i,j, k}, A[i][j][k]) + B[i][j][k];
}
...
2. Run model with specif data.
output is
...
A [1][4][5] = $actual_tensor_value
...
How it works:
tvm::trace adds ir::Call expr with type External and then lowers by llvm codegen part
to external call, so, the runtime support is added. The runtime just processes added function expanding arguments by "var_arg" and writes to file specific data, by default it is a stdout, but log file could be specified by env variable TVM_TRACE, also tracing process could be disable/enable.
Example:
export TVM_TRACE=trace_log_file=path/to/file:process_tracing=1