From d9abc0db12f01bf2a445916244b14003518f4747 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Tue, 6 Jul 2021 12:45:26 +0100
Subject: [PATCH 1/8] [RFC] TVM Unified Static Memory Planning

This commits adds the RFC (.md) for USMP
---
 rfcs/000y_Unified_Static_Memory_Planning.md | 467 ++++++++++++++++++++
 1 file changed, 467 insertions(+)
 create mode 100644 rfcs/000y_Unified_Static_Memory_Planning.md

diff --git a/rfcs/000y_Unified_Static_Memory_Planning.md b/rfcs/000y_Unified_Static_Memory_Planning.md
new file mode 100644
index 00000000..3b8e8305
--- /dev/null
+++ b/rfcs/000y_Unified_Static_Memory_Planning.md
@@ -0,0 +1,467 @@
+    Feature Name: Unified Static Memory Planner
+    Start Date: 2021 June 1
+    RFC PR: #000y
+    GitHub Issue: https://github.com/apache/tvm/issues/8404
+
+# Background
+
+Currently, given a ML model primarily TVM will generate two main artifacts :
+
+* A1 : Description of the sequential execution of operators :
+  1. If the "executor" is "graph", this would be a JSON
+  2. if the "executor" is "aot", this would be a main function describing call graph of operators
+* A2 : library of operators (in the form of runtime.Module)
+
+A1 is generally created out of lowering the "main" relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library.
+
+### Is there some sort of memory planning already being performed ?
+
+Yes, there is.
+
+For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.
+
+For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called StorageRewrite that more or less does the same thing as "GraphPlanMemory" but on TIR for the tensors visible within (fused) operators and are not live simultaneously.
+
+# Motivation
+
+For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.
+
+Therefore, there exist two main shortcomings of the current approach :
+
+* The memory used by intermediary tensors within operators are not shared between memory used by inter-operator tensors.
+
+Example TIR :
+```
+    primfn(placeholder_3: handle, placeholder_4: handle, placeholder_5: handle, T_cast_1: handle) -> ()
+      attr = { "global_symbol" :  "fused_nn_conv2d_add_fixed_point_multiply_clip_cast_cast_21" ,  "tir.noalias" : True}
+      buffers = {T_cast: Buffer(T_cast_2: Pointer(int16), int16, [ 1 ,  56 ,  56 ,  128 ], []),
+      placeholder_2: Buffer(placeholder_6: Pointer(int32), int32, [ 1 ,  1 ,  1 ,  128 ], []),
+      placeholder: Buffer(placeholder_7: Pointer(int16), int16, [ 1 ,  56 ,  56 , 128 ], []),
+      placeholder_1: Buffer(placeholder_8: Pointer(int16), int16, [ 3 ,  3 ,  128 ,  1 ], [])}
+
+       buffer_map = {placeholder_3: placeholder, placeholder_4: placeholder_1, placeholder_5: placeholder_2, T_cast_1: T_cast} {
+       attr [PaddedInput: Pointer(int16)]  "storage_scope" =  "global" ;
+       allocate(PaddedInput, int16, [ 430592 ]);
+       attr [DepthwiseConv2d: Pointer(int32)]  "storage_scope" =  "global" ;
+
+       allocate(DepthwiseConv2d, int32, [ 401408 ]) {
+         for (i1: int32,  0 ,  58 ) {
+           for (i2: int32,  0 ,  58 ) {
+            for(i3: int32,0,128) {
+               PaddedInput[(((i1*7424) + (i2*128)) + i3)] = @tir.if_then_else(((((1<= i1) && (i1 < 57)) && (1<= i2)) && (i2 < 57)), (int16*)placeholder_7[((((i1*7168) + (i2* 128 )) + i3) - 7296)], 0i16, dtype=int16)
+             }
+```
+
+The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.
+
+* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.
+
+Thus, for the embedded use-cases, we'd need a unified static memory planner that performs memory planning of all tensors holistically to achieve best memory utilization.
+
+# Goals
+
+G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.
+
+Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.
+
+G2. The static memory planning algorithm should be changeable.
+
+There are a variety of memory planning algorithms in discussion with different tradeoffs (See https://discuss.tvm.apache.org/t/discussion-alignment-memory-planning/9730 and https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html). Depending on the topology and schedules of intermediary buffers, the memory planning algorithm should easily be able to be change able. However, the current design ties the algorithm intimately to the IR constructs – making it harder to modularize / change the algorithm w/o inventing a whole new pass. In reality, the outcome of USMP's algorithm is offsets within a given workspace buffer. Moreover, to produce that it should only need to know the sizes of each tensor and their relative liveness. Therefore, the algorithm interface to USMP should be kept simple to be able to add more algorithms.
+
+G3. Multiple pool support (including constants)
+
+Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user
+
+# Guide-level explanation
+
+## U1: Most simple use case
+
+### TVMC
+
+
+```
+tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
+```
+
+ ### Codegen'd artifacts
+
+
+```
+    `//Codegen'd artifacts in metadata.c (lib0.c)`
+    const TVMModel my_model = {
+       ...
+       .entrypoint = &entrypoint,
+    }
+
+    static uint8_t workspace_buffer[WORKSPACE_BUFFER_SIZE];
+    static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;
+
+    static int32_t entrypoint(TVMInputs_my_model* inputs, 
+                              TVMOutputs_my_model* outputs,
+                               TVMContext* context){
+        return my_model_main(inputs.input0, 
+                             outputs.output0,
+                             &workspace_buffer,
+                             parameters_buffer,
+                             context.resource_handle);
+    }
+```
+```
+// metadata.h
+
+    typedef struct {
+       uint8_t* input0;
+    }  TVMInputs_my_model;
+
+    typedef struct {
+       uint8_t* output0;
+    }  TVMOutputs_my_model;
+```
+
+### User Application
+```
+
+    // The User Application 
+        extern  const TVMModel my_model;
+           int main(...) {
+                ...
+                TVMInputs_my_model inputs = {my_data};
+                TVMOutputs_my_model outputs = {output_space};
+                TVMExecute(&my_model,
+                           &inputs,
+                           &outputs,  
+                           NULL);
+            }
+```
+## U2: User wants to share workspaces
+
+### TVMC
+```
+    tvmc compile my_model_1.tflite
+    --executor=aot 
+    --output-format=mlf
+    --target=accel,c  
+    --with-workspace-buffer= "name=sram;target=c,accel"
+
+    tvmc compile my_model_2.tflite 
+    --executor=aot
+    --output-format=mlf 
+    --target=accel,c
+    --with-workspace-buffer= "name=sram;target=c,accel"
+```
+### Codegen'd Artifacts
+```
+    //Codegen'd artifacts in metadata.c (lib0.c)
+    const TVMModel my_model_1 = {
+       ...
+       .entrypoint = &entrypoint,
+    }
+
+    static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;
+
+     static int32_t entrypoint(TVMInputs_my_model_1* inputs, 
+                               TVMOutputs_my_model_1* outputs, 
+                               TVMContext* context){
+        return my_model_1_main(inputs.input0,
+                               outputs.output0,
+                               parameters_buffer,
+                               context.workspaces.sram, 
+                               context.resource_handle);
+    }
+```
+```
+// metadata.h
+
+    #define TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE xxxx
+
+    typedef struct {
+       uint8_t* sram;
+    }  TVMWorkspaces_my_model_1;
+
+    typedef struct {
+       uint8_t* input0;
+    }  TVMInputs_my_model_1;
+
+    typedef struct {
+       uint8_t* output0;
+    }  TVMOutputs_my_model_1;
+
+`//Codegen'd artifacts in metadata.c (lib0.c)`
+
+    const TVMModel my_model_2 = {
+       ...
+       .entrypoint = &entrypoint,
+    }
+```
+```
+    static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;
+
+    static int32_t entrypoint(TVMInputs_my_model_2* inputs, 
+                              TVMOutputs_my_model_2* outputs, 
+                              TVMContext* context){
+        return my_model_2_main(inputs.input0,
+        outputs.output0,
+                              parameters_buffer,
+                              context.workspaces.sram, 
+                              context.resource_handle);
+    }
+```
+```
+// metadata.h
+
+    #define TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE xxxx
+
+    typedef struct {
+       uint8_t* sram;
+    }  TVMWorkspaces_my_model_2;
+
+    typedef struct {
+       uint8_t* input0;
+    }  TVMInputs_my_model_2;
+
+    typedef struct {
+       uint8_t* output0;
+    }  TVMOutputs_my_model_2;
+```
+### User Application
+```
+    // The User Application    
+        extern  const TVMModel my_model_1;
+        extern  const TVMModel my_model_2;
+
+        // Please calculate the maximum of TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE and TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE and define it as TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE
+        // Alternatively, user could use a malloc (if permitted and desired) for runtime calculation of the max
+        static uint8_t workspace_buffer[TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE];
+
+            int main(...) {
+                ...
+                TVMContext context;
+                TVMInputs_my_model_1 inputs = {my_data_1};
+                TVMOutputs_my_model_1 outputs = {output_space_1};
+                TVMWorkspaces_my_model_1 workspaces1 = {
+                    .sram = &workspace_buffer,
+                };
+                TVMSetWorkspaces(&context, &workspaces1);
+                TVMExecute(&my_model_1, &inputs_1, &outputs_1, &context);
+                ...
+                TVMInputs_my_model_2 inputs = {my_data_2};
+                TVMOutputs_my_model_2 outputs = {output_space_2};
+                TVMWorkspaces_my_model_2 workspaces2 = {
+                    .sram = &workspace_buffer,
+                };
+                TVMSetWorkspaces(&context, &workspaces2);
+                TVMExecute(&my_model_2, &inputs_2, &outputs_2, &context);
+                ...
+            }
+```
+## U3 : User wants to pin buffers to different memories
+
+### TVMC
+```
+    tvmc compile my_model.tflite 
+    --executor=aot 
+    --target=accel,c  
+    --with-workspace-buffer= "name=dtcm;target=c;size=1000" # Here the size is more of a hint/guide provided to USMP
+    --with-workspace-buffer= "name=sram;target=c,accel"
+    --with-parameter-buffer= "name=itcm;target=c;size=5000" # Here the size is more of a hint/guide provided to USMP
+    --with-parameter-buffer= "name=flash;target=c,accel"
+```
+### Codegen'd Artifacts
+```
+    //Codegen'd artifacts in metadata.c (lib0.c)
+    const TVMModel my_model = {
+       ...
+       .entrypoint = &entrypoint,
+    }
+
+    static int32_t entrypoint(TVMInputs_my_model* inputs, 
+                               TVMOutputs_my_model* outputs, 
+                               TVMContext* context){
+
+         return my_model_main(inputs.input0,
+                              outputs.output0,
+                              context.workspaces.dtcm,
+                              context.workspaces.sram,
+                              context.parameters.itcm,
+                              context.parameters.flash, 
+                              context.resource_handle);
+    }
+```
+```
+// metadata.h
+
+    #define TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE xxxx
+    #define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx
+    #define TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE xxxx
+    #define TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE xxxx
+
+    typedef struct {
+       uint8_t* dtcm;
+       uint8_t* sram;
+    }  TVMWorkspaces_my_model;
+
+    typedef struct {
+       uint8_t* itcm;
+       uint8_t* flash;
+    }  TVMParameters_my_model;
+
+    typedef struct {
+       uint8_t* input0;
+    }  TVMInputs_my_model;
+
+    typedef struct {
+       uint8_t* output0;
+    }  TVMOutputs_my_model;
+```
+### User Application
+```
+    // The User Application 
+        extern  const TVMModel my_model;
+        __attribute__((section( "ITCM" )  const uint8_t   my_model_params_1[TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE] = <param_1_data>;
+        __attribute__((section( "FLASH" ), aligned( 16 )))  const uint8_t my_model_params_2[TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE] = <param_2_data>;
+        __attribute__((section( "DTCM" )  static uint8_t workspace_buffer_1[TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE];
+        __attribute__((section( "SRAM" ), aligned( 16 )))  static uint8_t workspace_buffer_2[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE];
+
+    int main(...) {
+         ...
+         TVMContext context;
+         TVMInputs_my_model_1 inputs = {input};
+         TVMOutputs_my_model_1 outputs = {output};
+         TVMWorkspaces_my_model workspaces = {
+             .sram = &workspace_buffer_1,
+             .dtcm = &workspace_buffer_2,
+         };
+         TVMParameters_my_model parameters = {
+             .flash = &my_model_params_1,
+             .itcm = &my_model_params_2
+         };
+         TVMSetWorkspaces(&context, &workspaces);
+         TVMSetParameters(&context, parameters);
+         TVMExecute(&my_model, &inputs, &outputs, &context);
+    }
+```
+# Reference-level explanation
+
+## Overview
+
+This should be a IRModule (TIR) → IRModule (TIR) pass.
+
+Inputs : 
+* AoT TIR PrimFunc ( the control function describing the call graph to operators)
+* All Operator Functions
+* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.
+
+The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)
+
+Outputs : 
+* AoT TIR PrimFunc accepting pool buffers from the user.
+* All Operator functions accepting pool buffers.
+  * Each operator function should address using the correct offset in the correct pool buffer
+
+Special Parametric Inputs : 
+* function : The algorithm to be used for planning From a component PoV, the algorithm is a special input with a defined interface.
+
+The current proposal for the interface is as follows :
+```
+    struct BufferInfo {
+        Integer uid;
+        Integer size_bytes;
+        Integer alignment;
+        Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
+        Array<Integer> pool_candidates;`
+        String pool_name;`
+        Integer pool_offset;`
+    }
+```
+```
+void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes)
+```
+### Special Considerations :
+
+* tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one.
+Therefore, in this work as a pre-requisite we should aim to fix this by supporting tir.constants (similiar to relay.constants).
+  * Why do we need constants expressed in TIR ?
+    * If not, it should be represented as inputs to TIR main function (logic : anything that is not expressible in TIR will become inputs). In which case, we would need to associate that Var with a special tag to indicate its constant and its metadata (e.g., desired pools, alignment requirements, etc.)
+* Currently "with" or "let" scopes are tree structured and carry transitive property. E.g, if tensor A is live with tensor B && tensor B is live with tensor C → tensor A is live with tensor C – which may not be true always.
+Thus current "let" or "with" scopes are unable to express liveness information. Therefore, we'd need a side-channel to express this information.
+
+### How the input TIR to USMP should be lowered ?
+
+##### Step 1 : The bound relay.const in Relay IR should be lowered via TE → TIR as tir.constants
+After Step 1 (introducing tir.constants to hold constant data) : the TIR code should like as follows :
+```
+# This snippet shows the format of pre-USMP pseudo TIR code.
+
+    def main(input1: ty.handle, output1: ty.handle):
+       my_model_fused_op1 = tir.allocate(..., pinned_memory=["dtcm", "sram"])
+       my_model_fused_op2 = tir.allocate(..., pinned_memory=["sram])
+       tir.call("my_model_fused_op1", input1, my_model_fused_op1, fused_op1_weights, fused_op1_biases)
+       tir.call( "my_model_fused_op2" , my_model_fused_op1, my_model_fused_op2, fused_op2_weights, fused_op2_biases)
+
+    def my_model_fused_op1(input : ty.handle, output : ty.handle):
+       tir.func_attr({"global_symbol":"my_model_fused_op1","tir.noalias": True})
+       intermediate_tensor_1 = tir.allocate(..., pinned_memory=["dtcm", "sram"]) # By  default they will have all possible memories
+       intermediate_tensor_2 = tir.allocate(..., pinned_memory=["dtcm", "sram"]) # unless scheduler removes them
+       weights = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       biases = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       ...
+       <compute>
+       ...
+
+    def my_model_fused_op2(input : ty.handle, output : ty.handle):
+       tir.func_attr({"global_symbol":"my_model_fused_op2", "tir.noalias": True})
+       intermediate_tensor_1 = tir.allocate(..., pinned_memory=[1, 2])
+       intermediate_tensor_2 = tir.allocate(..., pinned_memory=[1, 2])
+       weights = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       biases = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       ...
+       <compute>
+       ...
+```
+##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).
+
+##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes
+
+##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.
+
+##### Step 5 : Use the updated Map<tir::Var, BufferInfo> (with pool_id and pool_offset) mutate the IR that would result as following :
+```
+# This snippet shows the format of post-USMP pseudo TIR code.
+
+    def main(input1: ty.handle, output1: ty.handle, params_1 : ty.handle, params_2 : ty.handle, workspace_1 : ty.handle, workspace_2 : ty.handle):
+       tir.call("my_model_fused_op1", input1, params1, params2, workspace_1, workspace_2)
+       tir.call("my_model_fused_op2", params1, params2, workspace_1, workspace_2)
+
+    def my_model_fused_op1(input, params_1, params_2, workspace_1, workspace_2):
+       tir.func_attr({"global_symbol":"my_model_fused_op1","tir.noalias":True})
+       intermediate_tensor_1=tir.load("uint8", workspace_1.data, <offset>)
+       intermediate_tensor_2=tir.load("uint8", workspace_1.data, <offset>)
+       output=tir.load("uint8", workspace_1.data, <offset>)
+       weights=tir.load("uint8", params_1.data, <offset>)
+       biases=tir.load("uint8", params_1.data, <offset>)
+       ...
+       <compute>
+       ...
+
+    def my_model_fused_op2(params_1, params_2, workspace_1, workspace_2):
+       tir.func_attr({"global_symbol":"my_model_fused_op2","tir.noalias":True})
+       input=tir.load("uint8", workspace_1.data, <offset>)
+       intermediate_tensor_1=tir.load("uint8", workspace_1.data, <offset>)
+       intermediate_tensor_2=tir.load("uint8", workspace_2.data, <offset>)
+       output=tir.load("uint8", workspace_2.data, <offset>)
+       weights=tir.load("uint8", params_1.data, <offset>)
+       biases=tir.load("uint8", params_2.data, <offset>)
+       ...
+       <compute>
+       ...
+```
+# Code Structure
+
+* src/tir/usmp/analysis/ -- this is where analysis pases of USMP will live
+* src/tir/usmp/transforms/ -- this is where transform pases of USMP will live
+* src/tir/usmp/usmp.cc -- this is main intergration of USMP that exposes the full TIR --> TIR transformation as described.
+* tests/python/unittest/test_tir_usmp_*.py -- this where unittests for each of the passes and pass pipeline for USMP as a component will live.
+
+NOTE 1: All the above passes will have a mirror in the python.
+
+NOTE 2: to support tir.constants generally, we'll be enhancing the bound relay.constants to be lowered down to tir.constants to codegen. Those changes will appear through out the stack accordingly.
\ No newline at end of file

From e296e026653d9cdefec992552171f5002c072be1 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Tue, 6 Jul 2021 13:16:25 +0100
Subject: [PATCH 2/8] [RFC] TVM Unified Static Memory Planning

*Updating the RFC with PR number
---
 ...emory_Planning.md => 0009_Unified_Static_Memory_Planning.md} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename rfcs/{000y_Unified_Static_Memory_Planning.md => 0009_Unified_Static_Memory_Planning.md} (99%)

diff --git a/rfcs/000y_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
similarity index 99%
rename from rfcs/000y_Unified_Static_Memory_Planning.md
rename to rfcs/0009_Unified_Static_Memory_Planning.md
index 3b8e8305..7372a323 100644
--- a/rfcs/000y_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -1,6 +1,6 @@
     Feature Name: Unified Static Memory Planner
     Start Date: 2021 June 1
-    RFC PR: #000y
+    RFC PR: #0009
     GitHub Issue: https://github.com/apache/tvm/issues/8404
 
 # Background

From 9fe84bbdf4598a023addcbc2c70a5e1fad085e8f Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Wed, 7 Jul 2021 10:39:00 +0100
Subject: [PATCH 3/8] [RFC] TVM Unified Static Memory Planning

* addressing tristan's comments.

Change-Id: Ieb64ae6fc1de12374836c7f754a70b735fe5d379
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index 7372a323..4ba9bf9b 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -10,6 +10,7 @@ Currently, given a ML model primarily TVM will generate two main artifacts :
 * A1 : Description of the sequential execution of operators :
   1. If the "executor" is "graph", this would be a JSON
   2. if the "executor" is "aot", this would be a main function describing call graph of operators
+  3. if the "executor" is "vm", this would be a series of VM bytecode instructions
 * A2 : library of operators (in the form of runtime.Module)
 
 A1 is generally created out of lowering the "main" relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library.
@@ -361,10 +362,10 @@ Outputs :
 Special Parametric Inputs : 
 * function : The algorithm to be used for planning From a component PoV, the algorithm is a special input with a defined interface.
 
-The current proposal for the interface is as follows :
+The current proposal for the interface of the memory planning algorithm is as follows :
 ```
     struct BufferInfo {
-        Integer uid;
+        String name_hint; // this is the tir.buffer name
         Integer size_bytes;
         Integer alignment;
         Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
@@ -376,6 +377,7 @@ The current proposal for the interface is as follows :
 ```
 void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes)
 ```
+The memory planning algorithm is expected to populate the assigned pool_name and the offset. Additionally, the second argument provides size constraints for each pool (if any).
 ### Special Considerations :
 
 * tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one.
@@ -464,4 +466,8 @@ After Step 1 (introducing tir.constants to hold constant data) : the TIR code sh
 
 NOTE 1: All the above passes will have a mirror in the python.
 
-NOTE 2: to support tir.constants generally, we'll be enhancing the bound relay.constants to be lowered down to tir.constants to codegen. Those changes will appear through out the stack accordingly.
\ No newline at end of file
+NOTE 2: to support tir.constants generally, we'll be enhancing the bound relay.constants to be lowered down to tir.constants to codegen. Those changes will appear through out the stack accordingly.
+
+# Drawbacks
+
+* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.
\ No newline at end of file

From 83bd8588242209f37bf7da3be5f3e1cfa89804c7 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Mon, 12 Jul 2021 14:13:15 +0100
Subject: [PATCH 4/8] [RFC] TVM Unified Static Memory Planning

*Addressing further tristan's comments

Change-Id: I5eabfda362fa85fa4c377d20043f938ffc6de456
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index 4ba9bf9b..b5e6f8d3 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -375,9 +375,9 @@ The current proposal for the interface of the memory planning algorithm is as fo
     }
 ```
 ```
-void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes)
+Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
 ```
-The memory planning algorithm is expected to populate the assigned pool_name and the offset. Additionally, the second argument provides size constraints for each pool (if any).
+The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
 ### Special Considerations :
 
 * tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one.

From c520a3912279bcd3c432dafd4070661f614882cf Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Mon, 9 Aug 2021 15:34:03 +0100
Subject: [PATCH 5/8] [RFC] TVM Unified Static Memory Planning

* addressed comments

Change-Id: I12fa85e5ea10eee328be4c5d51c9a481a90dedb5
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index b5e6f8d3..24b8edba 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -7,11 +7,12 @@
 
 Currently, given a ML model primarily TVM will generate two main artifacts :
 
-* A1 : Description of the sequential execution of operators :
+* A1 : executor configuration : the description of the sequential execution of operators
   1. If the "executor" is "graph", this would be a JSON
   2. if the "executor" is "aot", this would be a main function describing call graph of operators
   3. if the "executor" is "vm", this would be a series of VM bytecode instructions
 * A2 : library of operators (in the form of runtime.Module)
+* A3 : compiled parameters of the model
 
 A1 is generally created out of lowering the "main" relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library.
 
@@ -19,7 +20,7 @@ A1 is generally created out of lowering the "main" relay function and A2 is crea
 
 Yes, there is.
 
-For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.
+For A1, the inter-(fused) operator tensors are visible in the "main" relay function. There exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.
 
 For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called StorageRewrite that more or less does the same thing as "GraphPlanMemory" but on TIR for the tensors visible within (fused) operators and are not live simultaneously.
 
@@ -53,7 +54,7 @@ Example TIR :
              }
 ```
 
-The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.
+The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach.
 
 * Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.
 
@@ -63,7 +64,7 @@ Thus, for the embedded use-cases, we'd need a unified static memory planner that
 
 G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.
 
-Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.
+Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates for which the size argument could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.
 
 G2. The static memory planning algorithm should be changeable.
 
@@ -377,7 +378,7 @@ The current proposal for the interface of the memory planning algorithm is as fo
 ```
 Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
 ```
-The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
+The memory planning algorithm is expected to populate pool_name and pool_offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
 ### Special Considerations :
 
 * tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one.
@@ -459,14 +460,16 @@ After Step 1 (introducing tir.constants to hold constant data) : the TIR code sh
 ```
 # Code Structure
 
-* src/tir/usmp/analysis/ -- this is where analysis pases of USMP will live
-* src/tir/usmp/transforms/ -- this is where transform pases of USMP will live
+* src/tir/usmp/analysis/ -- this is where analysis passes of USMP will live
+    * python/tir/usmp/analysis/ -- python interface to call analysis passes of USMP
+* src/tir/usmp/transforms/ -- this is where transform passes of USMP will live
+    * python/tir/usmp/transform/ -- python interface to call analysis pases of USMP
 * src/tir/usmp/usmp.cc -- this is main intergration of USMP that exposes the full TIR --> TIR transformation as described.
+    * python/tir/usmp/ -- python interface to call the integrated the USMP
 * tests/python/unittest/test_tir_usmp_*.py -- this where unittests for each of the passes and pass pipeline for USMP as a component will live.
 
-NOTE 1: All the above passes will have a mirror in the python.
 
-NOTE 2: to support tir.constants generally, we'll be enhancing the bound relay.constants to be lowered down to tir.constants to codegen. Those changes will appear through out the stack accordingly.
+NOTE : to support tir.constants generally, we'll be enhancing the bound relay.constants to be lowered down to tir.constants to codegen. Those changes will appear through out the stack accordingly.
 
 # Drawbacks
 

From f10b0e6d528c47c2741588d2bf795b11668bdcc9 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Mon, 20 Sep 2021 07:09:21 +0100
Subject: [PATCH 6/8] [RFC] TVM Unified Static Memory Planning

*reflecting the partial changes for tir pinned memory representation
*addressing Andrew's comments

Change-Id: I40019ecb8e75ba46b1bf415ea70718bbeab3d26b
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 81 ++++++++++++++-------
 1 file changed, 54 insertions(+), 27 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index 24b8edba..28469042 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -76,6 +76,8 @@ Ideally, the user would expect to provide these buffers in the granularity of th
 
 # Guide-level explanation
 
+NOTE : the embedded runtime interface used in the example are for demonstration purposes and the actual runtime API is defined and discussed [here.](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951)
+
 ## U1: Most simple use case
 
 ### TVMC
@@ -348,12 +350,27 @@ tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
 
 This should be a IRModule (TIR) → IRModule (TIR) pass.
 
-Inputs : 
-* AoT TIR PrimFunc ( the control function describing the call graph to operators)
-* All Operator Functions
-* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.
+Inputs :
+* IRModule containing 
+    * AoT TIR PrimFunc (the control function describing the call graph to operators)
+    * All Operator Functions
+    * Each tir.allocate in the IRModule annotated with candidate pools ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md))
+
+
+```
+struct PoolInfoNode : public Object {
+  String  pool_name;
+  Integer size_bytes;
+  Integer alignment;
+  Integer pool_offset;
+  Map<Target,String> target_access; // 'rw' or 'ro'
+}
+```
+
 
-The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)
+We could use "candidate_memory_pools" ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md)) to tag buffers with suggested priority order determined by the scheduler.
+
+The idea is USMP will try to pool them using the preferred "candidate_memory_pools" and fallback whenever the size is exceeding the user provided max size for each pool (if any). The fallback only happens if the user provide more than one candidate memory pool. If the fallback is not desired by the user, the user need not to provide multiple candidate_memory_pools with size constraints or the scheduling. If the fallback is not desired by the scheduler, the scheduling passes could remove the memory pools from the candidate_memory_pools.
 
 Outputs : 
 * AoT TIR PrimFunc accepting pool buffers from the user.
@@ -370,21 +387,27 @@ The current proposal for the interface of the memory planning algorithm is as fo
         Integer size_bytes;
         Integer alignment;
         Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
-        Array<Integer> pool_candidates;`
-        String pool_name;`
-        Integer pool_offset;`
+        Array<PoolInfo> pool_candidates;
+    }
+```
+
+```
+    struct PoolAllocation {
+        PoolInfo pool;
+        Integer offset;
     }
 ```
+
+
 ```
-Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
+Map<BufferInfo, PoolAllocation> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
 ```
-The memory planning algorithm is expected to populate pool_name and pool_offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
+The memory planning algorithm is expected to return a Map of BufferInfo to PoolAllocation with the planned offsets into respective pool.
 ### Special Considerations :
 
 * tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one.
-Therefore, in this work as a pre-requisite we should aim to fix this by supporting tir.constants (similiar to relay.constants).
-  * Why do we need constants expressed in TIR ?
-    * If not, it should be represented as inputs to TIR main function (logic : anything that is not expressible in TIR will become inputs). In which case, we would need to associate that Var with a special tag to indicate its constant and its metadata (e.g., desired pools, alignment requirements, etc.)
+Therefore, in this work as a pre-requisite we should aim to fix this by supporting tir.constants (similiar to relay.constants). Please refer to the [TIR non-scalar constants RFC](https://github.com/apache/tvm-rfcs/pull/22).
+
 * Currently "with" or "let" scopes are tree structured and carry transitive property. E.g, if tensor A is live with tensor B && tensor B is live with tensor C → tensor A is live with tensor C – which may not be true always.
 Thus current "let" or "with" scopes are unable to express liveness information. Therefore, we'd need a side-channel to express this information.
 
@@ -396,38 +419,42 @@ After Step 1 (introducing tir.constants to hold constant data) : the TIR code sh
 # This snippet shows the format of pre-USMP pseudo TIR code.
 
     def main(input1: ty.handle, output1: ty.handle):
-       my_model_fused_op1 = tir.allocate(..., pinned_memory=["dtcm", "sram"])
-       my_model_fused_op2 = tir.allocate(..., pinned_memory=["sram])
+       my_model_fused_op1 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"]
+       my_model_fused_op2 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"]
        tir.call("my_model_fused_op1", input1, my_model_fused_op1, fused_op1_weights, fused_op1_biases)
        tir.call( "my_model_fused_op2" , my_model_fused_op1, my_model_fused_op2, fused_op2_weights, fused_op2_biases)
 
     def my_model_fused_op1(input : ty.handle, output : ty.handle):
        tir.func_attr({"global_symbol":"my_model_fused_op1","tir.noalias": True})
-       intermediate_tensor_1 = tir.allocate(..., pinned_memory=["dtcm", "sram"]) # By  default they will have all possible memories
-       intermediate_tensor_2 = tir.allocate(..., pinned_memory=["dtcm", "sram"]) # unless scheduler removes them
-       weights = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
-       biases = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       intermediate_tensor_1 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"] 
+       intermediate_tensor_2 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"] 
+       weights = tir.allocate_const(...) # attrs.candidate_memory_pools = ["itcm", "flash"]
+       biases = tir.allocate_const(...) # attrs.candidate_memory_pools = ["itcm", "flash"]
        ...
        <compute>
        ...
 
     def my_model_fused_op2(input : ty.handle, output : ty.handle):
        tir.func_attr({"global_symbol":"my_model_fused_op2", "tir.noalias": True})
-       intermediate_tensor_1 = tir.allocate(..., pinned_memory=[1, 2])
-       intermediate_tensor_2 = tir.allocate(..., pinned_memory=[1, 2])
-       weights = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
-       biases = tir.allocate_const(..., pinned_memory=["itcm", "flash"])
+       intermediate_tensor_1 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"]
+       intermediate_tensor_2 = tir.allocate(...) # attrs.candidate_memory_pools = ["dtcm", "sram"]
+       weights = tir.allocate_const(...) # attrs.candidate_memory_pools = ["itcm", "flash"]
+       biases = tir.allocate_const(...) # attrs.candidate_memory_pools = ["itcm", "flash"]
        ...
        <compute>
        ...
 ```
-##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).
+##### Step 2 : Run an analysis pass to populate a Map<BufferInfo, tir.StmtNode> that contains buffer information as defined above (See the struct BufferInfo).
+
+Note : here tir.StmtNode is treated as a Union[tir.AllocateNode, tir.AllocateConstNode]
+
+This actual pass would traverse full TIR program and construct BufferInfo objects that captures liveness conflicts between allocates that are live together.
 
-##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes
+##### Step 3 : Use the updated Map<BufferInfo, tir.StmtNode> to generate Array\<BufferInfo>
 
-##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.
+##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<BufferInfo> buffers) to generate Map<BufferInfo, PoolAllocation>
 
-##### Step 5 : Use the updated Map<tir::Var, BufferInfo> (with pool_id and pool_offset) mutate the IR that would result as following :
+##### Step 5 : Use the updated Map<BufferInfo, PoolAllocation> and Map<BufferInfo, tir.StmtNode> to mutate the IR that would result as following :
 ```
 # This snippet shows the format of post-USMP pseudo TIR code.
 

From d42c0d489c4d9d013a208635f148156d8bfc4609 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Mon, 27 Sep 2021 16:46:02 +0100
Subject: [PATCH 7/8] [RFC] TVM Unified Static Memory Planning

*explaining the fallback and candidate_memory_pools

Change-Id: Iab59de953bd931fe44ae77004f8c014e25b126f8
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 25 ++++++++++++---------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index 28469042..4aa82a1b 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -357,20 +357,23 @@ Inputs :
     * Each tir.allocate in the IRModule annotated with candidate pools ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md))
 
 
-```
-struct PoolInfoNode : public Object {
-  String  pool_name;
-  Integer size_bytes;
-  Integer alignment;
-  Integer pool_offset;
-  Map<Target,String> target_access; // 'rw' or 'ro'
-}
-```
+        ```
+        struct PoolInfoNode : public Object {
+        String  pool_name;
+        Integer size_bytes;
+        Integer alignment;
+        Integer pool_offset;
+        Map<Target,String> target_access; // 'rw' or 'ro'
+        }
+        ```
+        The input IRModule is expected to have "candidate_memory_pools" annotation populated with a orderered list of PoolInfo objects. The ordering will indicate to the planner the order of preference each allocate will be pinned to each Pool. The core compiler will run a pass to assign each tir.allocate with candidate_memory_pools based on the target each PrimFunc is executed, prior to invoking the USMP.
+
 
+The idea is USMP will try to pool them using the preferred "candidate_memory_pools" and fallback whenever the size is exceeding the user provided max size for each pool (if any). The fallback only happens if the tir.allocate is annotated with more than one candidate memory pool. Initially, it will take the ordering provided to the TVMC interface.
 
-We could use "candidate_memory_pools" ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md)) to tag buffers with suggested priority order determined by the scheduler.
+If the fallback is not desired, the user need not to provide multiple candidate_memory_pools with size constraints to TVMC interface.
 
-The idea is USMP will try to pool them using the preferred "candidate_memory_pools" and fallback whenever the size is exceeding the user provided max size for each pool (if any). The fallback only happens if the user provide more than one candidate memory pool. If the fallback is not desired by the user, the user need not to provide multiple candidate_memory_pools with size constraints or the scheduling. If the fallback is not desired by the scheduler, the scheduling passes could remove the memory pools from the candidate_memory_pools.
+If the fallback is not desired by the scheduler, the scheduling passes could remove the memory pools from the candidate_memory_pools.
 
 Outputs : 
 * AoT TIR PrimFunc accepting pool buffers from the user.

From 0b298350c288f0e88c03f1e731a633fea79bac04 Mon Sep 17 00:00:00 2001
From: Manupa Karunaratne <manupa.karunaratne@arm.com>
Date: Mon, 4 Oct 2021 15:30:26 +0100
Subject: [PATCH 8/8] [RFC] TVM Unified Static Memory Planning

* Improving text
* Adding more specifics to how to handle fallback pool
* Renamed TVMC arugments to be pools instead of buffer
---
 rfcs/0009_Unified_Static_Memory_Planning.md | 42 +++++++++++++--------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/rfcs/0009_Unified_Static_Memory_Planning.md b/rfcs/0009_Unified_Static_Memory_Planning.md
index 4aa82a1b..517eb314 100644
--- a/rfcs/0009_Unified_Static_Memory_Planning.md
+++ b/rfcs/0009_Unified_Static_Memory_Planning.md
@@ -7,10 +7,10 @@
 
 Currently, given a ML model primarily TVM will generate two main artifacts :
 
-* A1 : executor configuration : the description of the sequential execution of operators
-  1. If the "executor" is "graph", this would be a JSON
-  2. if the "executor" is "aot", this would be a main function describing call graph of operators
-  3. if the "executor" is "vm", this would be a series of VM bytecode instructions
+* A1 : A compilation artifact that describes a sequence of calls to the operators
+  1. If building for graph executor, this would be a JSON
+  2. If building for AoT executor, this would be a main function describing call graph of operators
+  3. If building for VM executor, this would be a series of VM bytecode instructions
 * A2 : library of operators (in the form of runtime.Module)
 * A3 : compiled parameters of the model
 
@@ -26,7 +26,7 @@ For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called St
 
 # Motivation
 
-For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.
+For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enabling memory planning for embedded use-cases using the AoT executor.
 
 Therefore, there exist two main shortcomings of the current approach :
 
@@ -56,7 +56,7 @@ Example TIR :
 
 The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach.
 
-* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.
+* Assumption of local optimization : performing sharing inside the operator first and subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.
 
 Thus, for the embedded use-cases, we'd need a unified static memory planner that performs memory planning of all tensors holistically to achieve best memory utilization.
 
@@ -72,7 +72,8 @@ There are a variety of memory planning algorithms in discussion with different t
 
 G3. Multiple pool support (including constants)
 
-Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user
+Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly for constant data, we need to have a mechanism to allow users to pin them to appropriate memories and addresses. In the IR, they would simply be offsets into the constant buffer(s) provided by the user
+
 
 # Guide-level explanation
 
@@ -144,14 +145,16 @@ tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
     tvmc compile my_model_1.tflite
     --executor=aot 
     --output-format=mlf
-    --target=accel,c  
-    --with-workspace-buffer= "name=sram;target=c,accel"
+    --target=accel,c
+    --usmp-workspace-pools=sram
+    --usmp-workspace-pool-sram= "target=c:rw,accel:rw"
 
     tvmc compile my_model_2.tflite 
     --executor=aot
     --output-format=mlf 
     --target=accel,c
-    --with-workspace-buffer= "name=sram;target=c,accel"
+    --usmp-workspace-pools=sram
+    --usmp-workspace-pool-sram= "target=c:rw,accel:rw"
 ```
 ### Codegen'd Artifacts
 ```
@@ -264,11 +267,13 @@ tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
 ```
     tvmc compile my_model.tflite 
     --executor=aot 
-    --target=accel,c  
-    --with-workspace-buffer= "name=dtcm;target=c;size=1000" # Here the size is more of a hint/guide provided to USMP
-    --with-workspace-buffer= "name=sram;target=c,accel"
-    --with-parameter-buffer= "name=itcm;target=c;size=5000" # Here the size is more of a hint/guide provided to USMP
-    --with-parameter-buffer= "name=flash;target=c,accel"
+    --target=accel,c
+    --usmp-workspace-pools=dtcm,sram
+    --usmp-parameter-pools=itcm,flash
+    --usmp-workspace-pool-dtcm= "target=c;size=1000" # Here the size is more of a hint/guide provided to USMP
+    --usmp-workspace-pool-sram= "target=c,accel"
+    --usmp-parameter-pool-itcm= "target=c;size=5000" # Here the size is more of a hint/guide provided to USMP
+    --usmp-parameter-pool-flash= "target=c,accel"
 ```
 ### Codegen'd Artifacts
 ```
@@ -375,6 +380,13 @@ If the fallback is not desired, the user need not to provide multiple candidate_
 
 If the fallback is not desired by the scheduler, the scheduling passes could remove the memory pools from the candidate_memory_pools.
 
+* How is the fallback pool decided ?
+
+Currently, the pool ordering the user provides to TVMC interface will be used as a priority order for determining pools for tir.allocate/tir.allocate_const nodes (i.e. USMP will try to use highest priority pool for each allocate). Each pool specifies the information on whether it could be acessed by a given set of targets. Therefore, the ordered list of pools will be further filtered by for each allocate node that belongs to different targets.
+
+It is important to note that scheduling stage could remove pools from candidate pools for performance reasons.
+
+
 Outputs : 
 * AoT TIR PrimFunc accepting pool buffers from the user.
 * All Operator functions accepting pool buffers.