Skip to content

Implementation of TVM codegen library#888

Merged
ke1337 merged 2 commits intomasterfrom
gt/Nuphar_non_EP
Jul 3, 2019
Merged

Implementation of TVM codegen library#888
ke1337 merged 2 commits intomasterfrom
gt/Nuphar_non_EP

Conversation

@ke1337
Copy link
Contributor

@ke1337 ke1337 commented Apr 23, 2019

This is a subset of changes in #881. Splitting it to make review easier per reviewers' request.

@ke1337 ke1337 requested a review from a team as a code owner April 23, 2019 21:40
@ke1337 ke1337 force-pushed the gt/Nuphar_non_EP branch from 451355c to 306d137 Compare June 13, 2019 19:29
if (it->second == ScheduleType::ScheduleClosure)
return false;
else if (it->second == ScheduleType::ScheduleRoot) {
return false;
Copy link
Contributor

@weixingzhang weixingzhang May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can if and else if clause be merged? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


In reply to: 288775459 [](ancestors = 288775459)

Copy link
Contributor

@snnn snnn Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, if you have

if(...){
   ...
  return xxxx;
} else {
   ... 
   return yyyy;
}

The 'else' keyword can be removed.

see: https://clang.llvm.org/extra/clang-tidy/checks/readability-else-after-return.html #Resolved


bool TryGetConstantInput(int input_index, const Tensor** constant_input_value) const;

bool TryGetConstantInput(const std::string& name, const Tensor** constant_input_value) const;
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& name, [](start = 27, length = 24)

why you want to compare with name? we should already have the name to idx mapping in sesseion_state, all the apis in OpKernel is idx based, as compare string will hurt perf. #ByDesign

Copy link
Contributor Author

@ke1337 ke1337 Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in kernel constructor, so perf impact is not much. The reason to use name is that there are initializers in subgraph that are not listed as the subgraphs' inputs, and using name will make sure they could be found.


In reply to: 294412651 [](ancestors = 294412651)

// including 1) IR builder passes
// 2) Weight layout transformer passes
// 3) Scheduler passses, etc.

Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeZhang, many components under this "codegen/common" folder, like registry/profile/dispatcher, seems should be the task-transparent component that could shared with runtime. I feel it is a little weird if we have a "registry" type in runtime codebase but not used in other places. #WontFix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some discussion about this before. For the sake of minimize change to framework, these utils are kept in codegen/common.


In reply to: 294414368 [](ancestors = 294414368)

public:
RegistryBase() = default;

bool Contains(const std::string& name) const {
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& name [](start = 16, length = 23)

if we hardcode the key type to string, how could you handle the other cases, like op's domain/version? I understand that for initial implementation, we don't need to consider version/domain, but after you go to production, you will hit this issue soon or later. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. For this generic class, string could be anything. We can expand the key to include all necessary info.


In reply to: 294431551 [](ancestors = 294431551)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently opset and domain is handled inside the op, see codegen/common/target/op_ir_creator/slice.cc as an example. If not critical let's revisit the design in later PRs.


In reply to: 294436742 [](ancestors = 294436742,294431551)

int64_t TotalSize(const std::vector<int64_t>& shape);

template <typename T1, typename T2>
void DumpArrayRecursive(const T1* data, int64_t& data_offset, const std::vector<T2>& shape, int idx) {
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DumpArrayRecursive [](start = 5, length = 18)

it looks like more suitable to put in debug folder. #Resolved

Copy link
Contributor Author

@ke1337 ke1337 Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting it in dump_array.h


In reply to: 294436241 [](ancestors = 294436241)

if (tvm::runtime::TypeMatch(dtype, kDLFloat, 32)) {
float* data = reinterpret_cast<float*>(static_cast<char*>(X->data) + X->byte_offset);
DumpArray("float tensor:", data, shape);
} else if (tvm::runtime::TypeMatch(dtype, kDLInt, 8)) {
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this print function only works on CPU right? I would prefer to put i under nuphar, to avoid in case people thought it is target in-dependent and abuse it. #ByDesign

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for stdio printf support only. FPGA probably won't need it. CUDA, ARM CPU might still have. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for any CPU devices, like ARM/X64/PPC. It is target independent, but not for all targets though


In reply to: 294438000 [](ancestors = 294438000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document that this debug tool is for cpu only.


In reply to: 296499584 [](ancestors = 296499584,294438000)

// Gemm Convolution
const int64_t* batch_size = tvm::as_const_int(input->shape[0]);
if (batch_size != nullptr && *batch_size == 1)
return Conv2D_gemm(input, filter, output_shape, stride, padding);
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should put the conv2d_gemm here. It is efficient for some platform, but for others, like braisnlice, it can be scheduled in this way. You can keep this customized lowering in nuphar and register it during nuphar's graph iteration, but don't put it as generic conversion. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value to share GEMM based conv2d implementation for ARM/X86?


In reply to: 294443494 [](ancestors = 294443494)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if i want to use this mti lib to convert Con2D to tvm, for any non-cpu target, i don't think i expect to get this Gemm based conv expression.
yes it could be shared between several cpu architecture, but I feel we need some clarify on this 'target-independent' mti lib, does this 'target-independent' means could works with different devices, or just different cpus?


In reply to: 297448544 [](ancestors = 297448544,294443494)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing Conv2D_gemm from Conv2D


In reply to: 298400188 [](ancestors = 298400188,297448544,294443494)

Status Evaluate(
const tvm::Tensor& tensor,
const Node* node,
CodeGenContext& ctx,
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeGenContext& ctx [](start = 6, length = 19)

as we talked before, today if we use the generic IR building in mti, this CodeGenContext in this signature is useless, I have write another iteration to build the context. I think it will be better if we take a customized BuildContextFunc here. #ByDesign

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What hold BuildContextFunc?
Builder or OpIRCreator?
We tried to make builder and OpIRCreator stateless.
CodeGenContext is passed as a state.

#Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useful to handle domain/opset. See codegen/target/generic/op_ir_creator/tensor/slice.cc.


In reply to: 294445998 [](ancestors = 294445998)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, CodeGenContext is the state, but based on current implementation, different user has different state, CodeGenContext is just a handle. But if i use the generic ir builder in mti, it doesn't give me any chance to deal with the handle.


In reply to: 294552479 [](ancestors = 294552479)

@@ -0,0 +1,186 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of "target" in this folder structure? it seems contain many tvm ir building stuff, i didn't see the relationship with "target" #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target means target hardware ISAs, like ARM/x86/CUDA/etc. It is in line with the concept of target in TVM.


In reply to: 294446965 [](ancestors = 294446965)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but i doesn't see any subfolder here for x86/arm/cuda/etc, and the contents in this folder looks like some ir builder / schedule for particualr target, it is very confusing.


In reply to: 297449201 [](ancestors = 297449201,294446965)

Copy link
Contributor

@liwchang liwchang Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just the first PR. :) Later, we will increase them. #Closed

}

outputs.push_back(Y);
return Status::OK();
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so those are the impl to iterate onnx graph and convert to tvm ir, right? do we plan let this one to be the standard interface for onnx graph --> tvm? or it is just nuphar's current implementation? if this is only for nuphar, it is better move to nuphar folder. #ByDesign

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get the question. Is this related to cast? or int8 to bool thing? #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tvm bool and ONNX bool are different types (1-bit vs. 8-bit), so this is general for ONNX to tvm


In reply to: 294448930 [](ancestors = 294448930)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... it looks codeflow has bug, it put my question to the wrong places..... :(
I just want to have a standard way to iterate the onnx graph with a visitor pattern. anyway, we could discuss it later.


In reply to: 296499535 [](ancestors = 296499535,294448930)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget about it, codeflow bug, i guess


In reply to: 294552124 [](ancestors = 294552124)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We has a way to iterate onnx graph, but we just don't know whether it is representative enough to be a "standard". That is why we move it to nuphar. If it turns out to be common enough, we can move it back.


virtual Scheduler* Find(const tvm::Tensor&,
const Node*,
tvm_codegen::CodeGenContext&) = 0;
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the key to find a schedule pass is ""Tensor + Node + CodeGenContext"? it looks like only suitable for per-op schedule.

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really per-op schedule. Context can hold a graph and corresponding analysis.
Tensor can be used as Key to get previous analyzed result and applied to schedule. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Li-wen mentioned, context can have more complicated schedule if needed. Please check Nuphar implementations as example.


In reply to: 294452737 [](ancestors = 294452737)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like mentioned in another comments, if that's the case, why you list tvm::Tensor here, context could cover everything. It makes the interface more confusing.


In reply to: 297449610 [](ancestors = 297449610,294452737)

@snnn
Copy link
Contributor

snnn commented Jun 17, 2019

Could you please also remove this line:
https://github.com/microsoft/onnxruntime/blob/master/cmake/CMakeLists.txt#L426

#Resolved

// If a tvm::Tensor satisfies more than one TVM scheduler passes,
// the first dispatched pass will be applied.

class TVMScheduleBuilder {
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class TVMScheduleBuilder [](start = 0, length = 25)

So what's the usage of this ScheduleBuilder? iterate each op in the tvm graph and apply a schedule pass? i feel in many case, the schedule pass we write is not this kind of per-op schedule. #Pending

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really per-op schedule. Context can hold a graph and corresponding analysis.
Tensor can be used as Key to get previous analyzed result and applied to schedule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess your implementation won't use the default scheduler at all. This is more for sharing the logic between ARM/X86/etc


In reply to: 294454687 [](ancestors = 294454687)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then should we move the IRBuilder and schedule that only works for ARM/x86/etc to a seperate package?


In reply to: 297448953 [](ancestors = 297448953,294454687)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's the case, why you put tvm::Tensor as the key in the signature? I feel the way you mentioned is more like a hacky workaround to use this interface for cross-op schedule, but it will make the interface more confusing.


In reply to: 294551886 [](ancestors = 294551886)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design separates analysis from transformation. You apply analysis, record where and what you want to transform, and then apply the transformation. tvm::Tensor is the finest unit in low IR, and Ort::Node is the finest unit in high IR. That is why we use these two as default key. You can use any other key in customized CodeGenContext. After tvm::IR is built, it is just a graph of tvm::Tensor. Op boundary is not that important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then should we move the IRBuilder and schedule that only works for ARM/x86/etc to a seperate package?

I had same concern and discussed with Ke a few weeks ago.


tvm::Expr halideir_log(const tvm::Expr& x_full);

tvm::Expr fast_log(const tvm::Expr& x);
Copy link
Member

@souptc souptc Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beleive those are also for CPU only, right? #Resolved

Copy link
Contributor

@liwchang liwchang Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-hw accelerated only. It is a polynomial special function calculation.
If there are hw-accelerated instruction, like FPGA usually building its own lookup table for SF.
Then, it is not applicable. Movable
#Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out


In reply to: 294524957 [](ancestors = 294524957)

@ke1337
Copy link
Contributor Author

ke1337 commented Jun 26, 2019

Done.


In reply to: 502802718 [](ancestors = 502802718)

@ke1337 ke1337 force-pushed the gt/Nuphar_non_EP branch from 306d137 to c50ab1f Compare June 26, 2019 02:37
@ke1337
Copy link
Contributor Author

ke1337 commented Jun 26, 2019

Please bear with the force push after rebase, to solve confliction of this branch vs. master. #Resolved

@ke1337 ke1337 force-pushed the gt/Nuphar_non_EP branch from c50ab1f to 026ac75 Compare June 26, 2019 03:58
Copy link
Member

@souptc souptc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@souptc
Copy link
Member

souptc commented Jul 3, 2019

Thanks Ke, as we synced offline, for other issues we could improve it later.

@ke1337
Copy link
Contributor Author

ke1337 commented Jul 3, 2019

Thanks Ke, as we synced offline, for other issues we could improve it later.

Thanks for the review!

@ke1337 ke1337 merged commit 0d204f3 into master Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants