-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Virtual Machine] Implementation of 'set_output_zero_copy' #11358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
47eff9c to
e6abfc0
Compare
mbs-octoml
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Valery.
src/runtime/vm/vm.cc
Outdated
| WriteRegister(instr.dst, from_obj); | ||
| pc_++; | ||
| goto main_loop; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This weird control flow may be the result of someone trying to improve I$ behavior so we should check. I recall there's a known issue with the VM being slightly slower than the GraphExecutor due to cache issues, but I suspect that's more likely to be due to D$ effects with the extra indirections around registers or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I always think that simple loop is better than jump approach, it can be checked further. Just now I'm aware about the feature design.
src/runtime/vm/vm.cc
Outdated
| // TODO(vvchernov): can it be endless loop? | ||
| do { | ||
| instr = code_[op_ind++]; | ||
| } while (instr.op == Opcode::Ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!= Opcode::Ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much, it was nasty misprint. About endless loop, we do not know size of code_ array and from one side it potentially does not have return op, from other side it also leads to endless loop in RunLoop method. By default I will think that somebody took care of it.
src/runtime/vm/vm.cc
Outdated
| Index frame_start = frames_.size(); | ||
| while (true) { | ||
| main_loop: | ||
| Index res_reg_index = GetResultRegisterIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tuple results we'd need to redirect the allocs for the tuple indexes, which may be scattered quite widely within the code. So that suggests the compiler should record metadata for all that.
But I'm wondering if it would be better to bite-the-bullet and switch the VM to DPS. After that only Invoke-without-outputs would be the special case, requiring inspection of metadata to alloc the output tensors, make the call, and optionally construct the tuple result. I know that's a much bigger change, and I guess we could make that change after this PR given the invoke APIs will be the same.
Is there a high pressure customer use case which justifies that two step approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we can see from RunLoop design, and my tests with multiple outputs showed the same, only one ObjectRef (e.g. ADT) is return as result. It means we need only one index. Nevertheless tests are in progress to check possible issues. I'm not sure that current implementation is ready to merge.
@tmoreau89 what do you think there are any deadlines for this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mbs-octoml! I've updated my code for tuple results. I get register indices from AllocADT instruction and further there is the same action as for AllocTensor
ab17265 to
cde68a5
Compare
18a98af to
2138efb
Compare
e3d3b3b to
e10a8f8
Compare
d530f12 to
977e028
Compare
|
Hello @tkonolige! Could you review this PR instead of Mark? |
3bb9853 to
0e3ad64
Compare
tkonolige
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests, including ones that test this functionality over rpc.
python/tvm/runtime/vm.py
Outdated
| self.set_input(func_name, *args, **kwargs) | ||
| self._invoke_stateful(func_name) | ||
|
|
||
| def invoke_with_outputs(self, func_name, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this function take input arguments too instead of requiring set_input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do it in the following way: invoke_with_outputs(self, func_name, *args, **kwargs), where args are output and kwargs are input tensors correspondingly. It is complicated task to split input and output tensors from args. Is this scenario good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option would be invoke_with_outputs(self, func_name, input_args, output_args). Choose whichever you think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tip, I've updated it
include/tvm/runtime/vm/vm.h
Outdated
| * tensors pre-allocated outside. Scenario is when `set_output` is used | ||
| * \param func_name The function's name. | ||
| */ | ||
| void CollectOutputTensorRegIndices(const std::string& func_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to return the tensor indices instead of setting some internal state. There's now a lot of stateful functions and it is unclear how they all interact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored it, but I think the main reason of the problem stays. It is func name role. I do not know somebody uses anything except "main" I did not. I do not know the idea of func name and main thing is code_ does not depend on func name, but RunLoop depends on code_ only. Thus I can assume that using of two func names in front-end leads to error or unexpected behaviour. Looks like that VM design is still raw.
src/runtime/vm/vm.cc
Outdated
| while (true) { | ||
| main_loop: | ||
| bool iterate = true; | ||
| while (iterate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the change from goto to loop. It doesn't seem necessary for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've reverted this change. But who will do it? I do not think that goto is good solution here (anythere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to change it you can submit a separate PR to do so. I don't have enough knowlege to say if there was a good reason behind it.
src/runtime/vm/vm.cc
Outdated
| auto reshaped_tensor = ex_arr.CreateView(ref_shape, ex_dtype); | ||
| WriteRegister(instr.dst, reshaped_tensor); | ||
| } else { | ||
| LOG_ERROR << "Internal and external output tensor shapes are mismatched"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG_ERROR << "Internal and external output tensor shapes are mismatched"; | |
| LOG(FATAL) << "Internal and external output tensor shapes are mismatched"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! fixed
src/runtime/vm/vm.cc
Outdated
| } else if (op_code == Opcode::ReshapeTensor) { | ||
| reg_indices.push_back(preres_instr.reshape_tensor.tensor); | ||
| } else { | ||
| LOG(WARNING) << "Operation " << size_t(op_code) << " is not supported for set_outputs method"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I considered scenario when results can not be contained into external tensor and default way is used. But client should know that something wrong.
| * \param name The function name | ||
| * \param args outputs to the function. | ||
| */ | ||
| void SetOutputs(std::string name, TVMArgs args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that this only applies to the next single Invoke call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extended description
| ICHECK(outputs_.count(func_name)) | ||
| << "Outputs have not been set for function " << func_name; | ||
| *rv = Invoke(func, input_args, outputs_[func_name]); | ||
| set_outputs_enabled_[func_name] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to clear outputs_ here to you don't hold an unnecessary reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on scenario of using output tensors allocated outside. I see two options: 1. In-place scenario: memory for output tensors is allocated once. Each new inference (invoke) writes result in this memory. 2. We should insert new inputs/outputs for each new infer. Just now the second scenario is implemented and in that case outputs_ can be cleared. But in the first scenario we should store outputs_. What do you think should I support the first scenario instead of second one or both ones or keep the current one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The register file holds a reference to the outputs too. So if we clear outputs_, they will be released only when the register file is reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done it due to the second scenario is valid in this case but It should be noted that the first one is not. Each invoke the register file is refilled, we need to keep outputs_ if we want to save result in the same memory. Perhaps the API for the first scenario should be thought out more deeply and implemented separately. Just now the second scenario works only.
2fe6f94 to
024dc21
Compare
tkonolige
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @vvchernov. Can you add two more things:
- tests
- document how long the output tensors will live and how often set outputs needs to be called (every invocation I think).
|
Hello @tkonolige! Sorry, I was on vacation and the development was stopped for a moment. Additional questions: Where should documentation be? In description for python and native methods or in some tutorials? |
|
I'd put the documentation in the description for the methods. I think that's where people will look if using them. |
…rent funcs through func name
…od was implemented
c192db3 to
c335fdb
Compare
tkonolige
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work @vvchernov!
) There is python API function 'set_output' which save external outputs in VM outputs_ field (map) for specified func name. It looks like 'set_input' method. During 'invoke' outputs_ are saved in register_file. For this the register indices of output tensors are found from code_ field. I observed in tests for different models that AllocTensor and AllocADT ops are used for result tensors. Let's consider these two cases: result index is destination for AllocTensor op or AllocADT op. At the first case instead of construction new NDArray the outside output tensor is used. At the second one the fields of AllocADT are analyzed and register indices are extracted. During tests I observed that ReshapeTensor operation is rarely used as final one (SqueezeNet-v1.0 and DUC). Mechanism for replacement by external output tensors was also implemented for this op.
This is a draft of implementation of 'set_output_zero_copy' method on VirtualMachine side.
Brief description of approach.
Notes: 1. I'm not sure that it works for many frames. Practically it looks like we need code_ not frame and any number of frames does not change ops stack (code_). Other thing I observed that code_ does not depend on func_name, may be it should, it is not threadsafe just now.
2. It was implemented for CPU, I plan to check GPU specifics.
3. It seems that tensor(s) is(are) allocated over storage with prepared memory. It means that skipped AllocTensor and AllocADT can keep memory in RAM, it is not good thing but it generates questions about scenarios and VM flexibility.
4. Possibly CI tests are needed to cover some base scenarios
Hello @altanh and @mbs-octoml! Could you see the draft?