diff --git a/rfcs/0046-module-based-model-runtime-for-aot.md b/rfcs/0046-module-based-model-runtime-for-aot.md new file mode 100644 index 00000000..ac717248 --- /dev/null +++ b/rfcs/0046-module-based-model-runtime-for-aot.md @@ -0,0 +1,363 @@ +# Module-based Model Runtime Interface for AOT + +- Feature Name: module_based_model_runtime_for_aot +- Start Date: 2021-09-17 +- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046) +- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) + +# **Summary** + +This RFC describes a [Module-based Model Runtime +interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025) for +the [Ahead-of-Time Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), thereby +enabling its use from the TVM C++ Runtime. + +# **Motivation** + +The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled +Relay models. At the time of writing, it's now possible to codegen a TIR function which executes +Relay models that have known shapes, don't have graph-level control flow, and execute only on the +CPU device. Right now, the C runtime is the only such runtime environment which can interact with +this generated code. However, significant interest exists in enabling the C++ runtime to use the +Ahead-of-Time executor. + +# **Guide-level explanation** + +Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow +(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including `--executor=aot` in the Target +[1]. The return value of `tvm.relay.build` in this case is an `AotExecutorFactory` Module +object. Users instantiate the AOT executor via `AotExecutorFactory` as they do with `GraphExecutor`: + +```bash +ir_mod = tvm.parser.fromtext("""\ + #[version = "0.0.5"] + def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) { + %0 = %a + %b; + %0 + }""" + ) + +with PassConfig(opt_level=3): + factory : AotExecutorFactory = tvm.relay.build( + ir_mod, "llvm -executor=aot", module_name="my_mod") + +aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) +``` + +`AotExecutor` supports the traditional Module-Based Model Runtime Interface and can be used as a +user normally would `GraphExecutor`: + +```bash +aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8"))) +aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8"))) +aot_exec.run() +output = aot_exec.get_output(0) +assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8") +``` + +[1] NOTE: The target string is not the final place this customization should be made. However, it's +been the place where we've been putting runtime-related stuff. A separate RFC will split the Target +string into Target options (which affect tuning) and runtime options. + +# **Reference-level explanation** + +Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function +which invokes the Relay operators (implemented in TIR) in a correct order. An example is given +below: + +```bash +PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} { + // attr [(nullptr)] device_id = 0 + // attr [(nullptr)] device_type = 1 + tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output) +} +``` + +The AotExecutor is a runtime wrapper component around this function that needs to accomplish the +following to meet Module-based Model Runtime Interface: + +1. Allocate input and output tensors as defined in the `run_model` function using the correct Device + API. +2. Provide a mapping from relay parameter name to positional argument. +3. Invoke the generated TIR function and provide profiling. + +In the future, AOT will support heterogenous execution e.g. allocating tensors and driving inference +on `DLDevice` other than `kDLCPU`. Note that to align this code generator with the sensitive +environment present on a bare-metal microcontroller, the TIR top-level function intentionally +presumes that the input and output tensors already live on the `DLDevice`. This allows the user to +decide whether the AotExecutor generic runtime component will be used to fill input tensors or +whether they prefer to handle this in their application (or e.g. through background DMA). + +### Compiler ↔ Runtime Metadata + +In order to implement (1) and (2) above, additional metadata about the `run_model` function needs to +be communicated from Compiler to Runtime: + +- The mapping between Relay parameter name and TIR argument position +- The number of inputs and outputs +- The type of each parameter +- Information sufficient to choose a Device API to allocate memory for that data. + +At present, Metadata is passed from Compiler to Runtime in several different ways: + +1. Constant DLTensor can be bundled with code and supplied to `runtime::Module` via + `runtime::MetadataModule` +2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, `sdaccel`, `rocm`, + `vulkan`) have adopted the convention of including a + [`runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106) + (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their serialization: + + ```bash + /*! \brief function information needed by device */ + struct FunctionInfo { + std::string name; + std::vector arg_types; + std::vector launch_param_tags; + } + ``` + +3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of producing the + graph-level + [`tvm::relay::backend::ExecutorCodegenMetadata`](https://github.com/apache/tvm/blob/c3ace209253507dcb109c12ab8b82575fc668862/src/relay/backend/utils.h#L89): + + ```bash + /*! + * \brief Structure that can be optionally used by the executor codegen + */ + class MetadataNode : public Object { + public: + /*! \brief input information for the main function */ + Array inputs; + /*! \brief number of outputs of the main function */ + int num_outputs = 1; + /*! \brief the executor to be used to run the model */ + String executor = kTvmExecutorGraph; + + String mod_name = ""; + } + ``` + +4. The recent AOTExecutor implementation has created `tvm::relay::transform::FunctionInfo` which + communicates statistics about memory usage and I/O operation for each TIR operator and aggregate + statistics for the top-level AOT function: + + ```bash + struct FunctionInfoNode : public Object { + Map workspace_sizes; + Map io_sizes; + Map constant_sizes; + Map tir_primfuncs; + Map relay_primfuncs; + } + ``` + + +Some duplication of information is already present. Likely this is due in part to the existing +middle-end compiler design, in which a separate `IRModule` is produced for each backend. This means +that any metadata which requires whole-program analysis must be computed by an upstream TIR pass and +stored on the function whose code-generator needs it, rather than centrally. + +Another factor may be: since `runtime::Module` are responsible for their own serialization, +and passing `tvm::Node` across `PackedFunc` requires a cast, the lack of a centralized facility for +`runtime::Modules` to obtain module-level Metadata has led backend authors to roll their own. This +pattern means that it's very difficult to assess the full scope of metadata handed to the runtime, +particularly across all backends. + +This RFC argues for creating a centralized `tvm::runtime::metadata::Metadata` struct which contains +all Metadata consumed at runtime. Unifying runtime Metadata allows us to reduce the amount of +serialization logic and eliminate duplication of metadata. The current compiler design stores +centrally-produced Metadata in a side channel, but this could be improved in future RFCs e.g. should +we move away from splitting IRModules per backend. + +This RFC argues for a restructuring of the way we export Metadata through the following steps: + +1. Rename `runtime::MetadataModule` to `runtime::ConstLoaderModule` to disambiguate the two and make + its purpose in life clearer. +2. Expand the function metadata in the existing `relay::backend::ExecutorCodegenMetadata` to parity with + `runtime::FunctionInfo`, plus include `_sizes` from `tvm::relay::transform::FunctionInfoNode` and + the required `shape` and `dtype` information from the beginning of this section. +3. Introduce `ModelMetadataModule` to contain this information for use with the C++ runtime. + + ```bash + class ModelMetadataModule { + virtual GetFunction(const std::string& name, ObjectPtr& sptr_to_self) { + if (name == "get_model_metadata") { + return PackedFunc([](TVMArgs args, TVMRetValue* rv) { + *rv = ModelMetadata(metadata_); + }); + } else { + return PackedFunc(); + } + } + + const struct ModelMetadata* metadata_; + }; + ``` + +4. Introduce an optional implementation for the C runtime. +5. Export runtime::Metadata to Model Library Format. + +The new proposed definition of `runtime::Metadata` is as follows. NOTE that this is a C definition +because it will be made available both the C and C++ runtimes. A C++ wrapper will be written. + +```bash +struct ParameterInfo { + const char* relay_name_hint; + const char* tir_name_hint; + int64_t* shape; + int64_t ndim; + DLDataType dtype; + TargetDevice target_device; // NOTE: future addition; not covered in this RFC. +}; + +struct FunctionInfo { + const char* function_name; + struct ParameterInfo* params; + int num_inputs; + int num_outputs; + int64_t workspace_size_bytes; + int64_t io_size_bytes; + int64_t constant_size_bytes; +}; + +typedef struct Metadata { + int version; + struct FunctionInfo* functions; + const char* module_name; +}; +``` + +### Internal workings of AotExecutor (`--runtime=c++ --interface-api=packed`) + +Given the above, we can now sketch out the way AotExecutor should behave (for C++ runtime). + +Module initialization will: + +1. Load the `ModelMetadata` using `get_model_metadata` PackedFunc. +2. Allocate space for the parameters to `tvmgen__run_model`. +3. Lookup and load any linked parameters using the `--link-params` mechanism. + +- `set_input`, `get_input`, `get_output` all work as they do in `GraphExecutor`. +- `run` assembles `TVMArgs` containing inputs + outputs and invokes `tvmgen__run_model`. +- `time_evaluator` is implemented in the same way as it is in `GraphExecutor`. Timing `run_model` is + done using the CPU timer. + +### Internal workings of AotExecutor (`--runtime=c --interface-api=packed`) + +The C runtime version works in a very similar way with C accessor functions for the `ModelMetadata`. + +### No AotExecutor implementation planned (`--runtime=c --interface-api=c`) + +When `-interface-api=c` is present in the Target string, the `run_model` function no longer accepts +the PackedFunc interface and instead accepts `arg_values` directly as positional args: + +```bash +TVM_DLL int32_t tvmgen_default_run_model(void* arg0, void* arg1, void* arg2) { + void* input = arg0; + void* input1 = arg1; + void* output = arg2; + (void)tvmgen_default_fused_multiply(input, input1, output); + return 0; +} +``` + +Additional work is underway to wrap this in a firmware-friendly interface. A core design goal of +this interface is to offload all memory management tasks to the calling code to facilitate +integration with bare-metal embedded devices. + +Therefore, it would go against the goals of the C interface to introduce a generic runtime wrapper +compatible with PackedFunc calling convention. It may be possible to do so in the future, but it +would be great to motivate such an implementation with rationale more related to the embedded +runtime setting. + +### Operator Calling Convention + +TVM uses 3 internal calling conventions: + +1. `call_packed` - the traditional calling convention used in the C++ runtime +2. `call_cpacked` - similar to `call_packed`, but TVM presumes a symbol is linked into the binary + containing that function name (e.g. `TVMBackendGetFuncFromEnv` is not used to lookup the + PackedFunc) +3. `unpacked` - used with microTVM to avoid overhead of PackedFunc calls in statically-linked + binaries. See [AOT optimisations for Embedded Targets + RFC](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849). + +The AOT `run_func` can use a different calling convention externally (e.g. `--interface-api`) than +that used internally with Implemented Operators (`--unpacked-args`). However, there are some +circumstances under which not all choices can be used: + +- When targeting the C++ runtime: `call_packed` must be used when non-DSO-exportable modules exist; + otherwise `call_cpacked` may be used. `unpacked` may not be used with AOT Executor as the + interface has not settled. +- When targeting the C runtime: any calling convention may be selected for either the interface API + or the operator calling convention. However, when using `--interface-api=c` (e.g. `unpacked` + `run_func` calling convention), you must also use the `unpacked` calling convention with + Implemented Operators. + +# **Drawbacks** + +Why should we  *not*  do this? + +- This requires quite a bit of rework of the Metadata-passing mechanism, with potential for breakage. +- It also introduces yet another Executor to the runtime to maintain. +- It may introduce additional constraints on the `` implementation, which + may make it more difficult to make progress on microTVM. + +# **Rationale and alternatives** + +- Why is this design the best in the space of possible designs? +- What other designs have been considered and what is the rationale for not choosing them? +- What is the impact of not doing this? + +This RFC doesn't address the question of "why add an AOT executor?" The RFC which added it in the +first place is a better location to look for rationale to motivate that. In general, not following +through with this RFC would relegate the AOT executor to a C-runtime-only component. There is +significant interest in AOT from C++ runtime users, and maintaining compatibility with both +increases the chances that AOT executor will support all TVM runtime features. + +The controversial pieces of this RFC addressed are as follows: + +### Should we maintain a unified approach to code-generating the AOT executor? + +An alternative approach could introduce an additional e.g. `aot_cpp_executor_codegen.cc` and create +a third pathway (in the Graph/AOT build flow). Doing this allows us to implement runtime-specific +compiler primitives, which may simplify both pipelines. However, soon those pipelines will grow more +complicated as features are added to leverage AOT, such as Unified Static Memory Planning. The +burden of double-maintenance of those features outweighs the advantage of a simplified +implementation. It also makes it easier for newcomers to understand the compiler. + +### Should we attempt to unify the Metadata? + +Metadata could be left in the scattered form it is now. It may be that the implementation of this +RFC prioritizes expansion of `ModelMetadata` over propagating it to the various non-DSO-exportable +`runtime::Module`. Ultimately though, maintaining separate function-level metadata adds confusion +and code bloat. It also makes it harder to reason about the compiler as a whole. For these reasons, +this RFC advocates for centralizing the Metadata. + +# **Prior art** + +There is no known prior art of a C++-runtime-compatible AOT implementation. + +# **Unresolved questions** + +- Who will we break if we unify Model metadata? +- Will this play nicely with the VM compilation flow when it is unified? +- How will TargetDevice come in to play here? + +# **Future possibilities** + +Not covered in this RFC, but particularly useful with the C++ runtime, is heterogenous execution. In +the present PoC, AotExecutor will CHECK-fail if a non-cpu device is given. A future implementation +will annotate the parameters with one of: + +- A `device_type` — in which case mapping from `device_type` to `tvm::Device` will be done in the + same way as the `GraphExecutor` +- A `target_device` — in which case a new mapping will be defined + +Aside from that, the larger unresolved bit which makes it difficult to add heterogenous execution is: + +- How should AOT codegen invoke the Device API? + +Before this question can be answered, some progress needs to be made on the [C device +API](https://discuss.tvm.apache.org/t/pre-rfc-c-device-api/10874) and we need to define TIR +bindings.