[Virtual Machine] Implementation of 'set_output_zero_copy' #11358

vvchernov · 2022-05-18T14:30:55Z

This is a draft of implementation of 'set_output_zero_copy' method on VirtualMachine side.

Brief description of approach.

There is python API function 'set_output' which save external outputs in VM outputs_ field (map) for specified func name. It looks like 'set_input' method.
During 'invoke' outputs_ are saved in register_file. For this the register indices of output tensors are found from code_ field. I observed in tests for different models that AllocTensor and AllocADT ops are used for result tensors. Let's consider these two cases: result index is destination for AllocTensor op or AllocADT op. At the first case instead of construction new NDArray the outside output tensor is used. At the second one the fields of AllocADT are analyzed and register indices are extracted. During tests I observed that ReshapeTensor operation is rarely used as final one (SqueezeNet-v1.0 and DUC). Mechanism for replacement by external output tensors was also implemented for this op.

Notes: 1. I'm not sure that it works for many frames. Practically it looks like we need code_ not frame and any number of frames does not change ops stack (code_). Other thing I observed that code_ does not depend on func_name, may be it should, it is not threadsafe just now.
2. It was implemented for CPU, I plan to check GPU specifics.
3. It seems that tensor(s) is(are) allocated over storage with prepared memory. It means that skipped AllocTensor and AllocADT can keep memory in RAM, it is not good thing but it generates questions about scenarios and VM flexibility.
4. Possibly CI tests are needed to cover some base scenarios

Hello @altanh and @mbs-octoml! Could you see the draft?

mbs-octoml

Thanks for working on this Valery.

mbs-octoml · 2022-05-18T20:26:34Z

src/runtime/vm/vm.cc

        WriteRegister(instr.dst, from_obj);
        pc_++;
-        goto main_loop;
+        break;


nit: This weird control flow may be the result of someone trying to improve I$ behavior so we should check. I recall there's a known issue with the VM being slightly slower than the GraphExecutor due to cache issues, but I suspect that's more likely to be due to D$ effects with the extra indirections around registers or something.

Thanks! I always think that simple loop is better than jump approach, it can be checked further. Just now I'm aware about the feature design.

mbs-octoml · 2022-05-18T20:37:37Z

src/runtime/vm/vm.cc

+  // TODO(vvchernov): can it be endless loop?
+  do {
+    instr = code_[op_ind++];
+  } while (instr.op == Opcode::Ret);


!= Opcode::Ret

Thanks very much, it was nasty misprint. About endless loop, we do not know size of code_ array and from one side it potentially does not have return op, from other side it also leads to endless loop in RunLoop method. By default I will think that somebody took care of it.

mbs-octoml · 2022-05-18T20:50:58Z

src/runtime/vm/vm.cc

  Index frame_start = frames_.size();
-  while (true) {
-  main_loop:
+  Index res_reg_index = GetResultRegisterIndex();


For tuple results we'd need to redirect the allocs for the tuple indexes, which may be scattered quite widely within the code. So that suggests the compiler should record metadata for all that.

But I'm wondering if it would be better to bite-the-bullet and switch the VM to DPS. After that only Invoke-without-outputs would be the special case, requiring inspection of metadata to alloc the output tensors, make the call, and optionally construct the tuple result. I know that's a much bigger change, and I guess we could make that change after this PR given the invoke APIs will be the same.

Is there a high pressure customer use case which justifies that two step approach?

As we can see from RunLoop design, and my tests with multiple outputs showed the same, only one ObjectRef (e.g. ADT) is return as result. It means we need only one index. Nevertheless tests are in progress to check possible issues. I'm not sure that current implementation is ready to merge.
@tmoreau89 what do you think there are any deadlines for this feature?

Hello @mbs-octoml! I've updated my code for tuple results. I get register indices from AllocADT instruction and further there is the same action as for AllocTensor

vvchernov · 2022-08-29T13:49:43Z

Hello @tkonolige! Could you review this PR instead of Mark?

tkonolige

Can you add tests, including ones that test this functionality over rpc.

tkonolige · 2022-09-08T18:18:05Z

python/tvm/runtime/vm.py

            self.set_input(func_name, *args, **kwargs)
        self._invoke_stateful(func_name)

+    def invoke_with_outputs(self, func_name, *args):


Can you make this function take input arguments too instead of requiring set_input.

I can do it in the following way: invoke_with_outputs(self, func_name, *args, **kwargs), where args are output and kwargs are input tensors correspondingly. It is complicated task to split input and output tensors from args. Is this scenario good?

The other option would be invoke_with_outputs(self, func_name, input_args, output_args). Choose whichever you think is best.

Thanks for tip, I've updated it

tkonolige · 2022-09-08T18:29:24Z

include/tvm/runtime/vm/vm.h

+   * tensors pre-allocated outside. Scenario is when `set_output` is used
+   * \param func_name The function's name.
+   */
+  void CollectOutputTensorRegIndices(const std::string& func_name);


I'd prefer this to return the tensor indices instead of setting some internal state. There's now a lot of stateful functions and it is unclear how they all interact.

I've refactored it, but I think the main reason of the problem stays. It is func name role. I do not know somebody uses anything except "main" I did not. I do not know the idea of func name and main thing is code_ does not depend on func name, but RunLoop depends on code_ only. Thus I can assume that using of two func names in front-end leads to error or unexpected behaviour. Looks like that VM design is still raw.

tkonolige · 2022-09-08T18:39:01Z

src/runtime/vm/vm.cc

-  while (true) {
-  main_loop:
+  bool iterate = true;
+  while (iterate) {


Remove the change from goto to loop. It doesn't seem necessary for this PR.

Ok, I've reverted this change. But who will do it? I do not think that goto is good solution here (anythere).

If you want to change it you can submit a separate PR to do so. I don't have enough knowlege to say if there was a good reason behind it.

tkonolige · 2022-09-08T18:42:12Z

src/runtime/vm/vm.cc

+      auto reshaped_tensor = ex_arr.CreateView(ref_shape, ex_dtype);
+      WriteRegister(instr.dst, reshaped_tensor);
+    } else {
+      LOG_ERROR << "Internal and external output tensor shapes are mismatched";


Suggested change

LOG_ERROR << "Internal and external output tensor shapes are mismatched";

LOG(FATAL) << "Internal and external output tensor shapes are mismatched";

Thanks! fixed

tkonolige · 2022-09-08T18:50:15Z

src/runtime/vm/vm.cc

+  } else if (op_code == Opcode::ReshapeTensor) {
+    reg_indices.push_back(preres_instr.reshape_tensor.tensor);
+  } else {
+    LOG(WARNING) << "Operation " << size_t(op_code) << " is not supported for set_outputs method";


This should be a fatal.

done. I considered scenario when results can not be contained into external tensor and default way is used. But client should know that something wrong.

tkonolige · 2022-09-08T18:52:43Z

include/tvm/runtime/vm/vm.h

+   * \param name The function name
+   * \param args outputs to the function.
+   */
+  void SetOutputs(std::string name, TVMArgs args);


Clarify that this only applies to the next single Invoke call.

I've extended description

tkonolige · 2022-09-08T18:57:33Z

src/runtime/vm/vm.cc

+          ICHECK(outputs_.count(func_name))
+              << "Outputs have not been set for function " << func_name;
+          *rv = Invoke(func, input_args, outputs_[func_name]);
+          set_outputs_enabled_[func_name] = false;


I think you need to clear outputs_ here to you don't hold an unnecessary reference.

It depends on scenario of using output tensors allocated outside. I see two options: 1. In-place scenario: memory for output tensors is allocated once. Each new inference (invoke) writes result in this memory. 2. We should insert new inputs/outputs for each new infer. Just now the second scenario is implemented and in that case outputs_ can be cleared. But in the first scenario we should store outputs_. What do you think should I support the first scenario instead of second one or both ones or keep the current one?

The register file holds a reference to the outputs too. So if we clear outputs_, they will be released only when the register file is reset.

I have done it due to the second scenario is valid in this case but It should be noted that the first one is not. Each invoke the register file is refilled, we need to keep outputs_ if we want to save result in the same memory. Perhaps the API for the first scenario should be thought out more deeply and implemented separately. Just now the second scenario works only.

tkonolige

Thanks for the changes @vvchernov. Can you add two more things:

tests
document how long the output tensors will live and how often set outputs needs to be called (every invocation I think).

vvchernov · 2022-09-28T16:12:30Z

Hello @tkonolige! Sorry, I was on vacation and the development was stopped for a moment. Additional questions: Where should documentation be? In description for python and native methods or in some tutorials?

tkonolige · 2022-09-28T16:37:16Z

I'd put the documentation in the description for the methods. I think that's where people will look if using them.

…rent funcs through func name

…uts networks

…od was implemented

tkonolige

Thanks for all the hard work @vvchernov!

) There is python API function 'set_output' which save external outputs in VM outputs_ field (map) for specified func name. It looks like 'set_input' method. During 'invoke' outputs_ are saved in register_file. For this the register indices of output tensors are found from code_ field. I observed in tests for different models that AllocTensor and AllocADT ops are used for result tensors. Let's consider these two cases: result index is destination for AllocTensor op or AllocADT op. At the first case instead of construction new NDArray the outside output tensor is used. At the second one the fields of AllocADT are analyzed and register indices are extracted. During tests I observed that ReshapeTensor operation is rarely used as final one (SqueezeNet-v1.0 and DUC). Mechanism for replacement by external output tensors was also implemented for this op.

vvchernov force-pushed the vc/vm_set_output branch 3 times, most recently from 47eff9c to e6abfc0 Compare May 18, 2022 19:42

mbs-octoml suggested changes May 18, 2022

View reviewed changes

vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from ab17265 to cde68a5 Compare June 6, 2022 18:58

vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from 18a98af to 2138efb Compare August 18, 2022 13:28

vvchernov force-pushed the vc/vm_set_output branch from e3d3b3b to e10a8f8 Compare August 19, 2022 10:59

vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from d530f12 to 977e028 Compare August 29, 2022 09:10

vvchernov changed the title ~~WIP: [Virtual Machine] Implementation of 'set_output_zero_copy'~~ [Virtual Machine] Implementation of 'set_output_zero_copy' Aug 29, 2022

vvchernov force-pushed the vc/vm_set_output branch from 3bb9853 to 0e3ad64 Compare August 29, 2022 16:10

tkonolige requested changes Sep 8, 2022

View reviewed changes

vvchernov force-pushed the vc/vm_set_output branch 2 times, most recently from 2fe6f94 to 024dc21 Compare September 9, 2022 08:58

tkonolige requested changes Sep 15, 2022

View reviewed changes

high-level design for invoke_with_output method was implemented

1dae13b

Valery Chernov added 20 commits September 29, 2022 14:48

GetResultRegisterIndex was implemented

3795e04

SetOutputs method was implemented

db4fe13

update writting to register for AllocTensor op

c598286

update SetOutputs based on number of outputs. Take into account diffe…

d9cbaf5

…rent funcs through func name

clean duplicated code in Invoke methods

82d1bb4

support multiple outputs

830d2d3

lint fix

31fc069

update for support multi output network

156202f

extend set_output method for ReshapeTensor Op in VM

605eeaf

small fix. code cleaning

dec5022

fix excess passing during shape and data type check for multiple outp…

6318515

…uts networks

update fatal error logs

ef3e096

clean CollectOutputTensorRegIndices method

d8df1d5

extend description

9463552

clear outputs_ after invoke

1493e59

update invoke_with_outputs by input args

2dc148a

small fix in invoke_with_outputs method of VM. rpc test for this meth…

4c117d3

…od was implemented

local test of invoke_with_outputs of VM was implemented

2a9d1b3

update description for set_outputs scenario

6a34710

lint fixes

c335fdb

vvchernov force-pushed the vc/vm_set_output branch from c192db3 to c335fdb Compare September 29, 2022 11:48

tkonolige approved these changes Sep 29, 2022

View reviewed changes

tkonolige merged commit 3e3d900 into apache:main Sep 29, 2022

vvchernov deleted the vc/vm_set_output branch October 4, 2022 07:58

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

	LOG_ERROR << "Internal and external output tensor shapes are mismatched";
	LOG(FATAL) << "Internal and external output tensor shapes are mismatched";

[Virtual Machine] Implementation of 'set_output_zero_copy' #11358

[Virtual Machine] Implementation of 'set_output_zero_copy' #11358

Uh oh!

Conversation

vvchernov commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbs-octoml left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvchernov commented Aug 29, 2022

Uh oh!

tkonolige left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkonolige left a comment

Choose a reason for hiding this comment

Uh oh!

vvchernov commented Sep 28, 2022

Uh oh!

tkonolige commented Sep 28, 2022

Uh oh!

tkonolige left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

vvchernov commented May 18, 2022 •

edited

Loading