diff --git a/rfcs/0039-buffer-physical-layout.md b/rfcs/0039-buffer-physical-layout.md
new file mode 100644
index 00000000..ee865e0d
--- /dev/null
+++ b/rfcs/0039-buffer-physical-layout.md
@@ -0,0 +1,786 @@
+- Feature Name: Buffer Physical Layout
+- Authors: Eric Lunderberg (@Lunderberg), Wuwei Lin (@vinx13)
+- Start Date: 2021-10-05
+- RFC PR: [apache/tvm-rfcs#0039](https://github.com/apache/tvm-rfcs/pull/0039)
+- GitHub Issue: Not Yet Written
+
+# Summary
+[summary]: #summary
+
+This RFC introduces layout transformations that can be applied to a
+buffer during the lowering process.  These transformations will be
+part of the schedule, allowing the same compute definition to be used
+across multiple different layouts.  These transformations can produce
+either flat memory buffers or multi-dimensional memory buffers to be
+exposed to the low-level code generators.
+
+# Motivation
+[motivation]: #motivation
+
+Currently, TVM assumes that all buffers can be treated as flat memory.
+That is, while a rank-N tensor requires N values to describe its shape
+and N indices to identify a particular value within it, the underlying
+buffer allocated by the low-level codegen has a single value defining
+the size, and access into that buffer is done using a single index.
+This assumptions holds for most cases, such as a CPU accessing RAM,
+but doesn't hold in all cases.  For example, texture memory on a GPU
+requires two indices to access.  These are currently handled on a
+case-by-case basis, such as using `tvm::tir::builtin::texture2d_store`
+in a `CallNode`.
+
+In addition, computations that are semantically identical (e.g. 2-d
+convolution) require independent compute definitions and schedules
+(e.g. `conv2d_nchw` and `conv2d_hwcn`) based on the format of the data
+accepted as input.
+
+This RFC introduces a mechanism to specify transformations to be
+applied to the layout of buffers in memory, along with a unified
+method of presenting multiple indices to the low-level code
+generators.  This will allow for target-specific handling of non-flat
+memory, and will allow for code re-use across compute definitions that
+differ only in memory layout.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+A buffer is represented by a `tvm::tir::Buffer` object, and has some
+shape associated with it.  This shape is initially defined from the
+buffer's shape in the compute definition.  Buffers can either be
+allocated within a `tvm::tir::PrimFunc` using a `tvm::tir::Allocate`
+node, or can be passed in as parameters to a `PrimFunc`.  Buffer
+access is done using `tvm::tir::BufferLoad` and
+`tvm::tir::BufferStore` for reads and writes, respectively.
+
+When a TIR graph is passed into the low-level code generator
+`tvm::codegen::Build`, the rank of each buffer must be supported by
+the target code generator.  Typically, this will mean generating a
+single index representing access into flat memory.  Some code
+generators may attach alternative semantics for `rank>1`
+buffers (e.g. rank-2 buffers to represent texture memory on OpenCL).
+A low-level code generator should check the rank of the buffers it is
+acting on, and give a diagnostic error for unsupported rank.
+
+To define the layout transformation in a TE schedule, use the
+`transform_layout` method of a schedule, as shown below.  The
+arguments to `transform_layout` is a function that accepts a list of
+`tvm.tir.Var` representing a logical index, and outputs a list of
+`tvm.tir.PrimExpr` giving a corresponding physical index.  If
+`transform_layout` isn't called, then no additional layout
+transformations are applied.
+
+For example, below defines the reordering from NHWC logical layout to
+NCHWc physical layout.
+
+```python
+# Compute definition, written in terms of NHWC logical axes
+B = te.compute(A.shape, lambda n,h,w,c: A[n,h,w,c])
+s = te.create_schedule(B.op)
+
+def nhwc_to_nchwc(n, h, w, c):
+    return [n, c//4, h, w, c%4]
+
+transformed_nchwc_axes = s[B].transform_layout(nhwc_to_nchwc)
+
+# Compute definition that would produce an equivalent physical layout
+B_equivalent = te.compute(
+    [A.shape[0], A.shape[3]//4, A.shape[1], A.shape[2], 4],
+    lambda n, c_outer, h, w, c_inner: A[n, h, w, 4*c_outer+c_inner],
+)
+```
+
+By default, after all explicitly specified layout transformations are
+applied, all axes are flattened to a single axis by following a
+row-major traversal.  This produces a 1-d buffer, which corresponds to
+flat memory.  To produce `rank>1` buffers in the physical layout,
+insert `te.AXIS_SEPARATOR` into the axis list return by the physical
+layout function.  These define groups of axes, where each group is
+combined into a single physical axis.
+
+```python
+B = te.compute(shape=(M,N,P,Q), ...)
+s = te.create_schedule(B.op)
+
+# Default, produces a 1-d allocation with shape (M*N*P*Q,)
+s[B].transform_layout(lambda m,n,p,q: [m,n,p,q])
+
+# One separator, produces a 2-d allocation with shape (M*N, P*Q).
+s[B].transform_layout(lambda m,n,p,q: [m, n, te.AXIS_SEPARATOR, p, q])
+
+# Two separators, produces a 3-d allocation with shape (M, N*P, Q).
+s[B].transform_layout(lambda m,n,p,q: [m, te.AXIS_SEPARATOR, n, p, te.AXIS_SEPARATOR, q])
+
+# Can be used along with reorders and splits.
+s[B].transform_layout(lambda m,n,p,q: [m, q//4, n, te.AXIS_SEPARATOR, p, q%4])
+```
+
+
+The `te.AXIS_SEPARATOR` object exists only within the API interface,
+and is not part of the representation of the layout transformation
+within the generated TIR graph.  Instead, the TIR graph will contain
+an integer list of axis separators, to be used when flattening buffers
+to device-supported rank in the `StorageFlatten` or `FlattenBuffer`
+passes.
+
+If the tensor whose layout is being transformed is the result of
+`te.compute`, then the loop iteration order over that tensor will be
+rewritten to be along the updated memory layout.  If the loop
+iteration order is modified, these new loop iteration variables will
+be returned from `transform_layout()`.
+
+```python
+A = te.placeholder(shape=[16,64,128])
+B = te.compute(A.shape, lambda i,j,k: 2*A[i,j,k])
+
+s = te.create_schedule(B.op)
+
+# A is an input placeholder, and doesn't have nested loops that
+# generate it.  Therefore, while the layout of A is rewritten along
+# with any reads/writes into A, there are no loop iterators to be
+# rewritten and no loop iterators are returned.
+s[A].transform_layout(lambda i,j,k: [i*64 + j, k//4, k%4])
+
+# B is a computed tensor, and is computed inside a sequence of nested
+# loops.  Therefore, when B's layout is rewritten, those nested loops
+# are also rewritten, and the corresponding loop iterators are
+# returned.
+i_outer, jk_merged, i_inner = s[B].transform_layout(lambda i,j,k: [i//4, 128*j + k, i%4])
+
+# The loop iterators returned by transform_layout() can be used later
+# in the schedule, if the iteration order should be different from the
+# layout order of the output tensor.
+s[B].reorder(i_outer, i_inner, jk_merged)
+```
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+For schedules written in either TE or TIR, the axis separators are stored
+in `BufferNode::axis_separators`.  For TIR-based schedules, the
+re-indexing of a buffer is performed on demand.  For TE-based schedules,
+the mapping used to re-index a buffer is stored in the
+`"layout_transform_map"` attribute of the `PrimFunc`, and is applied as
+part of lowering.  This attribute is a map whose keys are buffer var to
+be reshaped, and whose values are the transformations to be applied.
+
+Many of the utilities needed for this transformation already exist in
+`iter_affine_map.h`, and are used in the implementation.  For TIR-based
+schedules, the transformation primitive is appleid immediately.
+
+A buffer may be allocated with `AllocateNode`, and may be interacted
+with using `BufferLoadNode` and `BufferStoreNode`.
+`BufferRealizeNode` should only appear in TE-based schedules, and
+should be converted to `AllocateNode`.  `LoadNode` and `StoreNode`
+are deprecated.
+
+## Impacted TIR Nodes
+
+- BufferNode
+  - Describes a N-d buffer.  This may directly represent a tensor (N-d
+    buffer produced by TE), a flat memory array (1-d buffer as input
+    to the low-level codegen), or intermediates between them.
+
+- BufferRealizeNode
+  - Realization of a buffer, in logical layout.
+  - For external buffers, serves as an optional annotation.  For
+    internal buffers, results in allocation of memory.
+
+
+- BufferLoadNode/BufferStoreNode
+  - Read/write of a buffer.
+
+  - Change from previous behavior: Will exist throughout the lowering
+    process, and will be passed to the low-level code generators.
+    Transformations that previously created `Load` and `Store` nodes
+    will instead create `BufferLoad` and `BufferStore` nodes with 1-d
+    indices.
+
+
+
+- AllocateNode
+  - Allocation of a buffer, in physical layout.
+
+  - Declares an allocation of a buffer.
+
+  - Change from previous behavior: Previously, `AllocateNode` held the
+    `buffer_var`, datatype, and buffer extents directly. After
+    implementation of this RFC, `AllocateNode` will instead hold the
+    `Buffer` that is to be allocated.
+
+
+- LoadNode/StoreNode
+  - Read/write of a 1-d buffer, given a `Var` pointer to the start of
+    the buffer and a single index.
+
+  - Deprecated, should instead use `BufferLoad` and `BufferStore` with
+    a 1-d index.
+
+
+## Impacted tir Transformations
+
+- `ApplyBufferTransforms`
+  - A new pass that takes as input a TIR graph that may have buffer
+    transformations stored in the `PrimFunc` attributes.  Returns
+    a TIR graph with all buffer transforms applied as specified.
+
+  - Rewrite `indices` in BufferStore/BufferLoad nodes based on the
+    specified transformation.
+
+  - The transformations are stored as a `Map<Var, Array<IndexMap>>` in
+    the `"layout_transform_map"` attribute of a primfunc.
+    All buffers whose `BufferNode::data` is a key in this map should
+    have their physical layout rewritten.  If the array contains
+    multiple transformations, they are applied sequentially.
+
+    A possible structure for the `IndexMap` node is shown
+    below.
+
+    ```
+    class IndexMapNode : public Object {
+    public:
+      /*! \brief Variables representing the indices prior to remapping.
+       *
+       * If initial_index is empty, then final_index should also be
+       * empty, and no mapping is applied.
+       */
+      Array<Var> initial_index;
+
+      /*!
+       * \brief Expressions defining the indices after remapping.
+       *
+       * These expressions should only be in terms of the initial_index,
+       * and must be expressible as a `tvm::arith::IterSumExpr`.  The
+       * mapping from `initial_index` to `final_index` must be injective.
+       *
+       * If final_index is empty, then initial_index should also be
+       * empty, and the map is an identity function.
+       */
+      Array<PrimExpr> final_index;
+    };
+    ```
+
+  - After applying the transformations, the
+    `"layout_transform_map"` attribute should be removed.
+    This ensures that additional application of
+    `ApplyBufferTransforms` has no effect.
+
+- FlattenBuffer/StorageFlatten
+
+  - Existing passes that convert from logical layout to physical
+    layout for TE schedules (StorageFlatten) or TensorIR schedules
+    (FlattenBuffer).
+
+  - The transformations are stored in the `Buffer` object as the
+    `BufferNode::axis_separators`.  All buffers that share the same
+    `BufferNode::data` should be flattened to an
+    output buffer of rank `axis_separators.size()+1`.  All other
+    buffers should be flattened to a 1-d output buffer.
+
+  - After flattening a buffer to an N-d output, the corresponding
+    value in the `axis_separators` should be set to `range(N-1)`.
+    This ensures that repeated application of the flattening passes
+    have no additional effect.  (The list shouldn't be deleted
+    entirely, as that would cause a flattened rank-`N` buffer and an
+    unflattened rank-`N` buffer to have identical representations.)
+
+
+## Examples
+
+The following are intended as pseudo-code, and exclude details not
+relevant to this RFC (e.g. dtype).  These do not correspond with the
+final version of TensorIR that implements this RFC.  Numeric values
+are shown unsimplified to indicate where they come from.
+
+The first example shows a 2-d buffer with no layout transformations
+explicitly specified.  The generated `PrimFunc` has no
+`"layout_transform_map"` attribute, and so the default
+behavior is used, applying a row-major traversal to generate a flat
+1-d buffer.
+
+```python
+# In TE schedule, no call to transform_layout.
+
+# Initial TIR graph
+x = Buffer(name="x", shape=[64,128])
+with Allocate(x):
+    val = BufferLoad(x, [10, 15])
+    BufferStore(x, 7, [20, 23])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[64*128]):
+    val = BufferLoad(x, [10*128 + 15])
+    BufferStore(x, 7, [20*128 + 23])
+```
+
+This next example shows a 2-d logical buffer, which is lowered to a
+1-d physical buffer.  `transform_layout` has been used to define a
+physical layout whose fastest changing dimension corresponds to the
+first index in the logical layout.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda i,j: [j,i])
+
+# Initial TIR graph
+attrs["layout_transform_map"][x] = lambda i,j: [j,i]
+x = Buffer(name="x", shape=[64,128])
+with Allocate(x):
+    val = BufferLoad(x, [10, 15])
+    BufferStore(x, 7, [20, 23])
+
+# After applying the explicit reordering
+x = Buffer(name="x", shape=[128,64])
+with Allocate(x):
+    val = BufferLoad(x, [15, 10])
+    BufferStore(x, 7, [23, 20])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[128*64]):
+    val = BufferLoad(x, [15*64 + 10])
+    BufferStore(x, 7, [23*64 + 20])
+```
+
+The next example shows a remapping from NHWC logical layout to NCHWc
+physical layout.  The 4 logical axes are expanded to 5 logical axes
+during the `ApplyBufferTransforms` pass, then flattened into 1 physical
+axis during StorageFlatten/FlattenBuffer.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, w, c%4])
+
+# Initial TIR graph
+attrs["layout_transform_map"][x] = lambda n,h,w,c: [n, c//4, h, w, c%4]
+x = Buffer(name="x", shape=[16,64,64,128], reorder_splits=nhwc_to_nchwc, axis_separators=[])
+with Allocate(x):
+    val = BufferLoad(x, [11, 37, 23, 101])
+
+# After applying the explicit reordering
+x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4], reorder_splits=[], axis_separators=[])
+with Allocate(x):
+    val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[16 * (128/4) * 64 * 64 * 4]):
+    val = BufferLoad(x, index=[(128/4)*64*64*4*11 + 64*64*4*floor(101/4) + 64*4*37 + 4*23 + 101%4])
+```
+
+Lastly, an example of remapping from `NHWC` logical layout to `NCHWc`
+physical layout, packed into a 2-d physical layout with `NCH` in the
+first physical axis and `Wc` in the second physical axis.  This is the
+definition used by the current `"global.texture"` definition used for
+texture memory.  The change applied during SplitReorderIndices is
+identical to the previous example, but StorageFlatten produces a 2-d
+physical index.  The interpretation of this 2-d index depends on the
+target-specific codegen.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, te.AXIS_SEPARATOR, w, c%4])
+
+# Initial TIR graph
+attrs["layout_transform_map"][x] = lambda n,h,w,c: [n, c//4, h, w, c%4]
+x = Buffer(name="x", shape=[16,64,64,128], axis_separators=[2])
+with Allocate(x):
+    val = BufferLoad(x, [11, 37, 23, 101])
+
+# After applying the explicit reordering.
+x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4], axis_separators=[2])
+with Allocate(x):
+    val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])
+
+# After applying StorageFlatten or FlattenBuffer.  The final result is
+# 2-d, due to the te.AXIS_SEPARATOR used in the `.transform_layout`.
+# The `axis_separators` are set to [0], to distinguish this 2-d flattened
+# buffer from a 2-d unflattened buffer.
+
+x = Buffer(name="x", shape=[16 * (128/4) * 64, 64*4], axis_separators=[0])
+with Allocate(x, shape=[16 * (128/4) * 64, 64 * 4]):
+    val = BufferLoad(x, index=[(128/4)*64*11 + 64*floor(101/4) + 37, 4*23 + 101%4])
+```
+
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+This change may make it more difficult to reason about the memory
+layout when writing the `te.compute` definition.  When the physical
+layout differs from the logical layout, it isn't guaranteed that
+`A[i]` and `A[i+1]` will be adjacent.  For example, a tensor with
+compute definition defined in `NHWC` layout and with layout
+transformation to `NCHWc` defined by `[n, c//4, h, w, c%4]`, locations
+`(0,0,0,3)` and `(0,0,0,4)` in the compute definition will not be
+adjacent.
+
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Can these design goals be met with existing features?
+
+  The `te.compute` function can be used to define an updated layout.
+  However, this introduces a new tensor that must be inlined to avoid
+  additional memory allocation, and cannot be used for input
+  parameters.
+
+  This design applies equally to tensors defined as a result of a
+  computation and to input tensors.  In both cases, the
+  `transform_layout` causes all reads/writes to that buffer to obey
+  the specified layout.  In the case of input tensors, it states that
+  the tensors passed in will be in the specified format.
+
+
+- Should buffer transformations be a node within a TIR graph, or an
+  attribute?
+
+  Option 1 is preferred.
+
+  - Option 1: The transformations are stored in attributes of
+    `PrimFunc`.
+
+    This makes it clear that the transformations apply to all uses of
+    the buffer within the graph, and are not scoped to some region of
+    the TIR graph.
+
+  - Option 2: The transformations are stored in node that inherits
+    from `tir::Stmt`.
+
+    This would be easier for other passes to visit using
+    `StmtVisitor`, if the layout transformations require modification.
+    However, it would add confusion if a `Stmt` impacts buffers far
+    outside its own scope.
+
+
+
+- When should the `tir::transform::ApplyBufferTransforms` pass be
+  applied?
+
+  Applying it at the end of phase-2 in `driver_api.cc::CreatePassList`
+  satisfies these conditions.
+
+  - To ensure that host and device have the same definition for buffer
+    layout, it should occur before the host/device split in
+    `MakePackedAPI`.
+
+  - Since other transformations can make use of buffer
+    transformations, it should otherwise be as late as possible in the
+    lowering flow.  (e.g. `InjectDoubleBuffer` mapping to a new buffer
+    shape)
+
+
+
+- Should buffer transformations re-use functionality of other nodes?
+
+  Option 1 is preferred.
+
+  - Option 1: Add buffer transformations as an attribute to the
+    `PrimFunc`.
+
+  - Option 2: In TE-based schedules, `AttrStmtNode` could give the
+    buffer to be transformed, along with the transformation to be
+    applied, similar to how `buffer_bind_scope` is currently handled.
+
+    The `BufferTransform` must also contain multiple objects that are
+    not derived from `PrimExpr`, the buffer to be transformed and the
+    mapping to be applied, while `AttrStmtNode` only allows a single
+    `ObjectRef` node and a `PrimExpr` value.
+
+  - Option 3: In TensorIR-based schedules, `MatchBufferRegion` could
+    be extended to also include a transformation while performing the
+    buffer replacement.
+
+    However, this could make it more difficult to reason about which
+    locations in the buffer region are being accessed.
+
+  - Option 4: The `BufferNode` object could contain an array of
+    transformations that should be applied to it during the lowering
+    process.  This would be convenient and allow for arbitrarily many
+    transformations.
+
+    Wouldn't follow the TVM convention of having annotations external
+    to the node itself.
+
+
+- Where should transformations to be applied to the function inputs be
+  specified?
+
+  Option 1 is preferred.
+
+  - Option 1: Any `BufferTransform` that describes a buffer in the
+    `PrimFuncNode::buffer_map` gets applied to that buffer.
+
+    Would require two traversals, the first to locate all buffer
+    transforms, and the second to apply them.
+
+  - Option 2: `BufferTransform` nodes listed in the `PrimFunc::attrs`
+    under a `"buffer_argument_transforms"` key apply to the function arguments.
+
+    Would only need a single traversal to apply.
+
+    Would require other passes to be aware of where a buffer was first
+    defined, in order to add it to the appropriate location.
+
+
+- What arguments should the function passed to `transform_layout` accept?
+
+  In these examples, the tensor is rank `N` prior to the
+  transformation.
+
+  Option 3 is preferred.
+
+  - Option 1: Accept a list of length `N`.  Each element of the list
+    is a variable corresponding to a coordinate in the input tensor.
+
+    This would be the simplest python implementation, but would
+    require additional configuration to have named variables in the
+    mapping.
+
+  - Option 2: Accept `N` named positional arguments (`func(i,j,k)`),
+    where each argument is a variable corresponding to a coordinate in
+    the input tensor.
+
+    This follows the usual method of defining the `fcompute` function
+    passed to `te.compute`.  This also allows the named variables to
+    be used as the names in TIR, improving readability.
+
+    However, this wouldn't allow utility functions that define
+    transformations that apply to an arbitrary number of indices, such
+    as a layout transformation that changes the last index, while
+    leaving the other `N-1` indices untouched.
+
+  - Option 3: Accept either `N` named positional arguments
+    (`func(i,j,k)`), or a variable number of arguments
+    (`func(*indices)`).
+
+    This follows the same convention as the `fcompute` function passed
+    to `te.compute`.  This would allow either an explicit listing of
+    all indices as named arguments, or an arbitrary number of indices.
+
+
+- What convention should be used for buffer indexing?
+
+  Previously, the interpretation of an index into a buffer depended on
+  whether the buffer was being accessed with
+  `BufferStore`/`BufferLoad` (pre-flattening) or with `Store`/`Load`
+  (post-flattening).  Since the same data structures will be used at
+  all lowering stages, the indexing should have consistent semantics.
+
+  Option 1 is preferred.
+
+  - Option 1: When accessing a buffer, the type and offset are based on
+    `buffer->dtype`.
+
+    The offset of an element is given by `index *
+    sizeof(buffer->dtype)`.  The type of the element being accessed is
+    `buffer->dtype.with_lanes(index.lanes() * buffer->dtype.lanes())`.
+
+    This is the convention used by user-defined schedules in TE, and
+    in BufferLoad/BufferStore objects.  In this convention, scalar
+    loads and vectorized loads can be expressed for scalar buffers and
+    vectorized buffers.  Accessing a buffer to return a different
+    datatype requires declaring an aliasing buffer that shares the
+    same backing array.
+
+    ```python
+    @T.prim_func
+    def scalar_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
+        assert A[0].dtype == "float32"
+
+
+    @T.prim_func
+    def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
+        assert A[0].dtype == "float32x4"
+
+
+    @T.prim_func
+    def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
+        A_vector_2 = T.buffer_decl([32], "float32x2", data=A.data)
+        assert A[0].dtype == "float32x4"
+        assert A_vector_2[0].dtype == "float32x2"
+
+
+    @T.prim_func
+    def vector_load_from_scalar_buffer_option1(A: T.Buffer[(64,), "float32"]):
+        assert A[T.ramp(0, 1, 4)].dtype == "float32x4"
+
+
+    @T.prim_func
+    def vector_load_from_scalar_buffer_option2(A: T.Buffer[(64,), "float32"]):
+        A_vector = T.buffer_decl([16], "float32x4", data=A.data)
+        assert A_vector[0].dtype == "float32x4"
+
+
+    @T.prim_func
+    def scalar_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
+        A_scalar = T.buffer_decl([64], "float32", data=A.data)
+        assert A_scalar[0].dtype == "float32"
+    ```
+
+    - Pro: The return type of `buf[0]` is always `buf.dtype`, even
+      when `buf.dtype` is a vectorized type.
+
+    - Pro: No changes needed on the user-defined schedules.
+
+    - Con: Requires updates to code generators to follow this new
+      convention.  However, the code generators will already require
+      updates to support BufferLoad/BufferStore.
+
+  - Option 2: When accessing a buffer, the type and offset are based on
+    `buffer->dtype.element_of()`.
+
+    The offset of an element is given by `index *
+    sizeof(buffer->dtype.element_of())`.  The type of the element
+    being accessed is `buffer->dtype.with_lanes(index.lanes())`.
+
+    Prior to this RFC, this is the convention used by Load/Store
+    nodes.  In this convention, scalar loads and vectorized loads can
+    be expressed for scalar buffers and vectorized buffers.  Accessing
+    a buffer to return a vectorized datatype requires using a
+    vectorized index, even if the buffer holds a vectorized datatype.
+
+    ```python
+    @T.prim_func
+    def scalar_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
+        assert A[0].dtype == "float32"
+
+
+    @T.prim_func
+    def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
+        assert A[T.ramp(0, 1, 4)].dtype == "float32x4"
+
+
+    @T.prim_func
+    def scalar_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
+        assert A[0].dtype == "float32"
+
+
+    @T.prim_func
+    def vector_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
+        assert A[T.ramp(0, 1, 4)].dtype == "float32x4"
+    ```
+
+    - Pro: The number of lanes of output can be determined solely from
+      the index used to access the buffer.  That is, `A[0]` is
+      guaranteed to have one lane of output, and `A[Ramp(0, stride=1,
+      lanes=4)]` is guaranteed to have four lanes of output.
+
+    - Con: Access of a buffer with scalar index does not always have
+      the same datatype as the buffer.  If the buffer has a vectorized
+      datatype, then `buf[0].dtype != buf.dtype`.
+
+    - Con: Need explicit check for vectorized types at the codegen
+      level.
+
+    - Con: Requires updates to user-defined schedules.
+
+
+# Prior art
+[prior-art]: #prior-art
+
+- CuDNN has an [explicit enumeration of allowed input
+  formats](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorFormat_t),
+  which are specific to image formatting.
+
+- The reorder/split/flatten sequences is equivalent in numpy to using
+  `np.reshape` to split the logical axes, then `np.transpose` to
+  reorder them, then `np.reshape` to merge multiple axes into the N-d
+  physical axes.
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+- Should the `te.AXIS_SEPARATOR` appear in the TIR graph?
+
+  Option 1 is preferred.
+
+  - Option 1: The `te.AXIS_SEPARATOR` is a TE-specific concept, and
+    does not appear in the generated TIR graph.  Instead, it changes
+    the `BufferTransform` node that represent the flattening of
+    buffers to a device-supported number of indices.
+
+    This would be a unified way to represent all layout
+    transformations in the TIR graph, which may or may not change the
+    rank of the buffer.  The flattening of buffers to a
+    device-supported rank would be handled identically to any other
+    layout transformation, rather than having an implicit row-major
+    traversal.
+
+  - Option 2: The `te.AXIS_SEPARATOR` is represented in the TIR graph,
+    and alters the behavior of the `StorageFlatten` pass.  There is no
+    `BufferTransform` node that represents the flattening of
+
+    In a TIR graph without any other modifications, this would
+    maintain the current behavior of the `StorageFlatten` pass, which
+    reduces the N-d buffer to a 1-d buffer by a row-major traversal.
+    In a TIR graph with some additional annotation to represent the
+    `M` axis separators, the N-d buffer could instead be reduced to a
+    `M+1`-d buffer.
+
+- What is appropriate terminology for size/shape/extent of physical
+  and logical buffers?
+
+  If Allocate/BufferStore/BufferLoad each hold a reference to the
+  buffer they act upon, then this becomes a somewhat irrelevant
+  question, as there is only one `BufferNode::shape`.
+
+  - I am partial to using "shape" both for the N-d parameters, and
+    have attempted to use it consistently through this RFC.
+  - "size" implies a 1-d buffer, which wouldn't be appropriate for
+    an N-d parameter.
+  - "extent" would be a reasonable name, but is currently used by
+    `tvm::RangeNode` to indicate a range of values that may start at
+    a non-zero value.  Since the indices for logical and physical
+    buffers both start at zero, using "extents" for the maximum
+    index would imply some offset.
+
+
+
+- How should loops over an array be handled when re-writing the shape?
+
+  To avoid memory latency issues, loops should iterate over an array
+  sequentially when possible.  Iteration that is defined in terms of
+  the logical layout may be inappropriate for the physical layout.
+
+  Option 3 is preferred.
+
+  - Option 1: Do nothing, and always keep the same iteration order,
+    using the same iteration axes as defined in the compute
+    definition.
+
+    This would produce valid code, but not necessarily performant
+    code.  This can be a default behavior during development, to be
+    improved upon.
+
+  - Option 2: Automatically detect loops that are over the full extent
+    of an array in sequential order of the logical layout, and rewrite
+    to be in sequential order of the physical layout.
+
+    This would reduce the memory latency issues, but raises some
+    implementation questions.
+
+    - If a loop body references multiple tensors with different
+      physical layouts, which should define the loop iteration order?
+
+    - If a series of nested loops contains a `cache_read` or
+      `cache_write` stage, can these be recognized and reordered?
+
+  - Option 3: Expose the transformed axes to be used as part of a
+    schedule definition.  In TE, the return value from `AA =
+    s[A].transform_layout(...)` would be a tensor, and the transformed
+    axes `AA.op.axis` can then be used for the remainder of the
+    schedule.
+
+    This would allow the greatest flexibility, but would make the
+    schedule dependent on the transformed layout, beyond the one
+    definition.
+
+# Future possibilities
+[future-possibilities]: #future-possibilities
+
+- Could be used to simplify many of the `topi` schedules for image
+  processing.
+- Could introduce variation of physical layout during `cache_read` and
+  `cache_write` steps, as a potential source of optimization.