Skip to content

Conversation

@vvchernov
Copy link
Contributor

@vvchernov vvchernov commented May 18, 2022

This is a draft of implementation of 'set_output_zero_copy' method on VirtualMachine side.

Brief description of approach.

  1. There is python API function 'set_output' which save external outputs in VM outputs_ field (map) for specified func name. It looks like 'set_input' method.
  2. During 'invoke' outputs_ are saved in register_file. For this the register indices of output tensors are found from code_ field. I observed in tests for different models that AllocTensor and AllocADT ops are used for result tensors. Let's consider these two cases: result index is destination for AllocTensor op or AllocADT op. At the first case instead of construction new NDArray the outside output tensor is used. At the second one the fields of AllocADT are analyzed and register indices are extracted. During tests I observed that ReshapeTensor operation is rarely used as final one (SqueezeNet-v1.0 and DUC). Mechanism for replacement by external output tensors was also implemented for this op.

Notes: 1. I'm not sure that it works for many frames. Practically it looks like we need code_ not frame and any number of frames does not change ops stack (code_). Other thing I observed that code_ does not depend on func_name, may be it should, it is not threadsafe just now.
2. It was implemented for CPU, I plan to check GPU specifics.
3. It seems that tensor(s) is(are) allocated over storage with prepared memory. It means that skipped AllocTensor and AllocADT can keep memory in RAM, it is not good thing but it generates questions about scenarios and VM flexibility.
4. Possibly CI tests are needed to cover some base scenarios

Hello @altanh and @mbs-octoml! Could you see the draft?

@vvchernov vvchernov force-pushed the vc/vm_set_output branch 3 times, most recently from 47eff9c to e6abfc0 Compare May 18, 2022 19:42
Copy link
Contributor

@mbs-octoml mbs-octoml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Valery.

WriteRegister(instr.dst, from_obj);
pc_++;
goto main_loop;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This weird control flow may be the result of someone trying to improve I$ behavior so we should check. I recall there's a known issue with the VM being slightly slower than the GraphExecutor due to cache issues, but I suspect that's more likely to be due to D$ effects with the extra indirections around registers or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I always think that simple loop is better than jump approach, it can be checked further. Just now I'm aware about the feature design.

// TODO(vvchernov): can it be endless loop?
do {
instr = code_[op_ind++];
} while (instr.op == Opcode::Ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!= Opcode::Ret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much, it was nasty misprint. About endless loop, we do not know size of code_ array and from one side it potentially does not have return op, from other side it also leads to endless loop in RunLoop method. By default I will think that somebody took care of it.

Index frame_start = frames_.size();
while (true) {
main_loop:
Index res_reg_index = GetResultRegisterIndex();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tuple results we'd need to redirect the allocs for the tuple indexes, which may be scattered quite widely within the code. So that suggests the compiler should record metadata for all that.

But I'm wondering if it would be better to bite-the-bullet and switch the VM to DPS. After that only Invoke-without-outputs would be the special case, requiring inspection of metadata to alloc the output tensors, make the call, and optionally construct the tuple result. I know that's a much bigger change, and I guess we could make that change after this PR given the invoke APIs will be the same.

Is there a high pressure customer use case which justifies that two step approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can see from RunLoop design, and my tests with multiple outputs showed the same, only one ObjectRef (e.g. ADT) is return as result. It means we need only one index. Nevertheless tests are in progress to check possible issues. I'm not sure that current implementation is ready to merge.
@tmoreau89 what do you think there are any deadlines for this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mbs-octoml! I've updated my code for tuple results. I get register indices from AllocADT instruction and further there is the same action as for AllocTensor

@vvchernov vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from ab17265 to cde68a5 Compare June 6, 2022 18:58
@vvchernov vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from 18a98af to 2138efb Compare August 18, 2022 13:28
@vvchernov vvchernov force-pushed the vc/vm_set_output branch 5 times, most recently from d530f12 to 977e028 Compare August 29, 2022 09:10
@vvchernov vvchernov changed the title WIP: [Virtual Machine] Implementation of 'set_output_zero_copy' [Virtual Machine] Implementation of 'set_output_zero_copy' Aug 29, 2022
@vvchernov
Copy link
Contributor Author

Hello @tkonolige! Could you review this PR instead of Mark?

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests, including ones that test this functionality over rpc.

self.set_input(func_name, *args, **kwargs)
self._invoke_stateful(func_name)

def invoke_with_outputs(self, func_name, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this function take input arguments too instead of requiring set_input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it in the following way: invoke_with_outputs(self, func_name, *args, **kwargs), where args are output and kwargs are input tensors correspondingly. It is complicated task to split input and output tensors from args. Is this scenario good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would be invoke_with_outputs(self, func_name, input_args, output_args). Choose whichever you think is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tip, I've updated it

* tensors pre-allocated outside. Scenario is when `set_output` is used
* \param func_name The function's name.
*/
void CollectOutputTensorRegIndices(const std::string& func_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to return the tensor indices instead of setting some internal state. There's now a lot of stateful functions and it is unclear how they all interact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored it, but I think the main reason of the problem stays. It is func name role. I do not know somebody uses anything except "main" I did not. I do not know the idea of func name and main thing is code_ does not depend on func name, but RunLoop depends on code_ only. Thus I can assume that using of two func names in front-end leads to error or unexpected behaviour. Looks like that VM design is still raw.

Comment on lines 526 to 634
while (true) {
main_loop:
bool iterate = true;
while (iterate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the change from goto to loop. It doesn't seem necessary for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've reverted this change. But who will do it? I do not think that goto is good solution here (anythere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to change it you can submit a separate PR to do so. I don't have enough knowlege to say if there was a good reason behind it.

auto reshaped_tensor = ex_arr.CreateView(ref_shape, ex_dtype);
WriteRegister(instr.dst, reshaped_tensor);
} else {
LOG_ERROR << "Internal and external output tensor shapes are mismatched";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG_ERROR << "Internal and external output tensor shapes are mismatched";
LOG(FATAL) << "Internal and external output tensor shapes are mismatched";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! fixed

} else if (op_code == Opcode::ReshapeTensor) {
reg_indices.push_back(preres_instr.reshape_tensor.tensor);
} else {
LOG(WARNING) << "Operation " << size_t(op_code) << " is not supported for set_outputs method";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I considered scenario when results can not be contained into external tensor and default way is used. But client should know that something wrong.

* \param name The function name
* \param args outputs to the function.
*/
void SetOutputs(std::string name, TVMArgs args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this only applies to the next single Invoke call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extended description

ICHECK(outputs_.count(func_name))
<< "Outputs have not been set for function " << func_name;
*rv = Invoke(func, input_args, outputs_[func_name]);
set_outputs_enabled_[func_name] = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to clear outputs_ here to you don't hold an unnecessary reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on scenario of using output tensors allocated outside. I see two options: 1. In-place scenario: memory for output tensors is allocated once. Each new inference (invoke) writes result in this memory. 2. We should insert new inputs/outputs for each new infer. Just now the second scenario is implemented and in that case outputs_ can be cleared. But in the first scenario we should store outputs_. What do you think should I support the first scenario instead of second one or both ones or keep the current one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register file holds a reference to the outputs too. So if we clear outputs_, they will be released only when the register file is reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done it due to the second scenario is valid in this case but It should be noted that the first one is not. Each invoke the register file is refilled, we need to keep outputs_ if we want to save result in the same memory. Perhaps the API for the first scenario should be thought out more deeply and implemented separately. Just now the second scenario works only.

@vvchernov vvchernov force-pushed the vc/vm_set_output branch 2 times, most recently from 2fe6f94 to 024dc21 Compare September 9, 2022 08:58
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @vvchernov. Can you add two more things:

  1. tests
  2. document how long the output tensors will live and how often set outputs needs to be called (every invocation I think).

@vvchernov
Copy link
Contributor Author

Hello @tkonolige! Sorry, I was on vacation and the development was stopped for a moment. Additional questions: Where should documentation be? In description for python and native methods or in some tutorials?

@tkonolige
Copy link
Contributor

I'd put the documentation in the description for the methods. I think that's where people will look if using them.

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work @vvchernov!

@tkonolige tkonolige merged commit 3e3d900 into apache:main Sep 29, 2022
@vvchernov vvchernov deleted the vc/vm_set_output branch October 4, 2022 07:58
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
)

There is python API function 'set_output' which save external outputs in VM outputs_ field (map) for specified func name. It looks like 'set_input' method.
During 'invoke' outputs_ are saved in register_file. For this the register indices of output tensors are found from code_ field. I observed in tests for different models that AllocTensor and AllocADT ops are used for result tensors. Let's consider these two cases: result index is destination for AllocTensor op or AllocADT op. At the first case instead of construction new NDArray the outside output tensor is used. At the second one the fields of AllocADT are analyzed and register indices are extracted. During tests I observed that ReshapeTensor operation is rarely used as final one (SqueezeNet-v1.0 and DUC). Mechanism for replacement by external output tensors was also implemented for this op.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants