Skip to content

Conversation

@wrongtest-intellif
Copy link
Contributor

  • After [TE][TIR] Implement layout transformations, non-flat memory buffers #9727, the low level TIR memory access is on Buffer objects. The Buffer has elem_offset

    /*! \brief The offset in terms of number of dtype elements (including lanes) */
    PrimExpr elem_offset;

    Thus the addressing rule for BufferLoad, BufferStore and T.address_of should also take this offset field into consideration.

  • Also, the buffer combine functionality in StorageRewrite pass currently create alias buffers to the alloc buffer var, which denotes the start offset of the merged buffer. But it seems illness since the alias buffer's accessed indices exceed the alias buffer extent.

    # example from ut 
    @T.prim_func
    def func(A: T.Buffer[(4,), "float32"], A4: T.Buffer[(8,), "float32"]) -> None:
        A0 = T.allocate([16], "float32", "global:tag")
        A0_1 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        A0_2 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        A0_3 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        A0_4 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        for i in T.serial(8):
            A0_1[i] = A[i] + A[0] + T.float32(1)
        for i in T.serial(8):
            A0_2[8 + i] = A0_1[i] + A0_1[0] + T.float32(2)
        for i in T.serial(8):
            A0_3[i] = A0_2[8 + i] + A0_2[8] + T.float32(3)
        for i in T.serial(8):
            A0_4[8 + i] = A0_3[i] + A0_3[0] + T.float32(4)
        for i in T.serial(8):
            A4[i] = A0_4[8 + i] + A0_4[8] + T.float32(5)

    It would be great to set elem_offset to alias buffers, thus each alias buffer's address range is marked explicitly.

    @T.prim_func
    def func(A: T.Buffer[(4,), "float32"], A4: T.Buffer[(8,), "float32"]) -> None:
        A0 = T.allocate([16], "float32", "global:tag")
        A0_1 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        A0_2 = T.buffer_decl([8], dtype="float32", data=A0.data, elem_offset=8, scope="global:tag")
        A0_3 = T.buffer_decl([8], dtype="float32", data=A0.data, scope="global:tag")
        A0_4 = T.buffer_decl([8], dtype="float32", data=A0.data, elem_offset=8, scope="global:tag")
        for i in T.serial(8):
            A0_1[i] = A[i] + A[0] + T.float32(1)
        for i in T.serial(8):
            A0_2[i] = A0_1[i] + A0_1[0] + T.float32(2)
        for i in T.serial(8):
            A0_3[i] = A0_2[i] + A0_2[0] + T.float32(3)
        for i in T.serial(8):
            A0_4[i] = A0_3[i] + A0_3[0] + T.float32(4)
        for i in T.serial(8):
            A4[i] = A0_4[i] + A0_4[0] + T.float32(5)

@wrongtest-intellif wrongtest-intellif force-pushed the process_buffer_elem_offset_in_codegen branch 3 times, most recently from be779cc to 515133e Compare March 13, 2022 12:36
@junrushao
Copy link
Member

CC @Hzfengsy would you like to review this PR? Thanks a lot!

auto var_value = MakeValue(op->value);
var_map_[op->var.get()] = var_value;
var_value->setName(op->var->name_hint.c_str());
analyzer_->Bind(op->var, op->value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need bind here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bind means analyzer will always expand the value expression for simplify and other functionalities. It will break the evaluation order specified by lets, which should be respsected in codegen phase. The issue could be triggered on existing testcases by the simplify this PR adds.

I can use a local analyzer on this PR's purpose but I think this is still an issue to resolve.

ExprDeepEqual deep_equal_;
// binding of let variables. Enables duplicate var defs that map to same value
std::unordered_map<Var, const LetNode*, ObjectPtrHash, ObjectPtrEqual> let_binding_;
std::unordered_map<Var, PrimExpr, ObjectPtrHash, ObjectPtrEqual> let_binding_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A local expression can be visited (eg, a simplify res) and some pointer are recorded. But when the local scope ends the backing object is expired.

@wrongtest-intellif wrongtest-intellif force-pushed the process_buffer_elem_offset_in_codegen branch from 515133e to f53961d Compare March 14, 2022 05:04
@wrongtest-intellif
Copy link
Contributor Author

cc @Lunderberg

@tqchen
Copy link
Member

tqchen commented Mar 14, 2022

Thanks @wrongtest . I agree handling elem_offset is going to be useful.

One thing to note here is that we want to ensure that the alignment is handled properly. Specifically when elem_offset is non-zero, then low-level passes and analysis would need to take elem_offset into account when analyzing possible alignment properties of an access (this is something that we should note. Perhaps in the doc).

Something to take note as well in #10505, cc @vinx13

@wrongtest-intellif
Copy link
Contributor Author

Thanks for the remind @tqchen !
For the data_alignment , I have an extra question that whether the data_alignment is about the data ptr address or about
the first element's address?

If the former, I think we should ensure the alignment property every time we create alias buffer to existing buffer vars, no matter whether the elem_offset is non-zero. I would like to change the alignment handling in StorageRewrite pass since it has explicit buffer aliasing impl code path.

/*! \brief Alignment requirement of data pointer in bytes. */
int data_alignment;
/*!
* \brief Factor of elem_offset field,
* elem_offset is guaranteed to be multiple of offset_factor.
*/
int offset_factor;

@wrongtest-intellif wrongtest-intellif force-pushed the process_buffer_elem_offset_in_codegen branch from f53961d to 82d62dc Compare March 16, 2022 07:24
@tqchen
Copy link
Member

tqchen commented Mar 16, 2022

it is about data ptr's address

Copy link
Member

@vinx13 vinx13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer->elem_offset is already processed in FlattenBuffer / StorageFlatten, it is added to the buffer indices (https://github.com/apache/tvm/blob/main/src/tir/ir/buffer.cc#L307). Since subsequent passes can still declare buffer with elem_offset, I agree they still need to be handled. One option is to change GetFlattenBuffer(https://github.com/apache/tvm/blob/main/src/tir/ir/buffer.cc#L334) to have the output buffer elem_offset erased to ensure elem_offset are not being processed more than once.

cc @Lunderberg

@wrongtest-intellif
Copy link
Contributor Author

wrongtest-intellif commented Mar 17, 2022

Hi, @vinx13, many thanks for the notes:)

it is added to the buffer indices

I think why we may still need elem_offset is that passes after flatten may create aliased buffers, if we only add the indices, the alias buffer accesses may become not well-formed, assume a buffer A[8] is aliased to B + 8, if we add offset to the access index, it comes to

B = T.allocate([16], dtype="float32)
A = T.buffer_decl([8], dtype="float32", data=B.data)
for i in range(8):
    T.evaluate(A[i + 8])  # index out of bound form

Also note that the USMP use the form like

with T.let(A.data, T.address_of(B[8], dtype="handle")):
    for i in range(128):
        T.evaluate(A[i])

However, use let binding and addrees_of seems definitely increase the complexity if we want subsequent analyses on aliasing.

All three alternatives try to represent gep semantics since we can not directly add a buffer var of handle dtype. From my understanding, they all have some pro and cons:

  • add offset to access index

    • pro: cleanest
    • cons: the IR maybe in somewhat strange and ill form
  • bind a new buffer var

    • pro: the buffer access semantic is correct
    • cons: complexity for alias analyzing
  • use elem_offset field

    • pro: the buffer access semantic is correct
    • cons: handling the field is adhoc

@Lunderberg
Copy link
Contributor

I'm overall in favor of keeping logic in the lowering stages, so that it wouldn't need to be repeated across multiple different codegens. I agree that the current state where a non-zero elem_offset can be silently ignored isn't a good state, but I don't think it is good to handle at the codegen level, because the semantics being handled aren't specific to any one codegen.

What if we allow codegens to assume that the elem_offset is zero, but add a lowering pass that validates this assumption? Having such a check would also be a good place to define what other assumptions codegens are allowed to make about the TIR that they receive (e.g. no Prefetch nodes, no builtin::tvm_thread_allreduce, no warp scope memory), along error messages specifying which lowering pass was expected to have lowered those constructs.

@wrongtest-intellif
Copy link
Contributor Author

cc @junrushao1994 Hi, now I think this PR should be closed~, for
(1) we do not want elem_offset left to codegen part
(2) community are going on multi-dim elem offset refactor #10816

For new elem offset occured in lowering phase, I think some verifing pass as @Lunderberg suggested is a great idea, which could be introduced by other pr after refactoring done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants