Skip to content

Conversation

@Ubospica
Copy link
Contributor

@Ubospica Ubospica commented Nov 5, 2023

This PR focuses on adding supporting various mma intrinsics for matmul scheduling. To be specific, this PR:

  • Adds support for transposed A in ldmatrix and mma_sync
  • Changes all T.launch_thread(tx, 32) annotations to for loops for tx in T.thread_binding(0, WARP_SIZE, "threadIdx.x"). This is for the convenience of later transformation.
  • Refactor some logic and add a utility get_mma_intrin_group to get a group of intrinsics:
def get_mma_intrin_group(
    load_scope: Literal["shared", "shared.dyn"],
    store_scope: Literal["global", "shared", "shared.dyn"],
    in_dtype: Literal["float16", "int8"],
    out_dtype: Literal["float16", "float32", "int32"],
    trans_a: bool,
    trans_b: bool,
    not_use_mma_store_intrinic: bool = True,
    store_to_smem_dtype: Optional[Literal["float16", "float32", "int32"]] = None,
) -> Dict[str, str]
  • Avoid use the current mma_store intrinsic. Instread, use BufferStore statements.
    • This is because if we use mma_store intrinsic, during swizzling shared memory visits, our rearrangement scheme will involve areas accessed by different mma_store calls. This makes swizzling quite complex. But BufferStore will not face this problem

This PR is used and tested in the dlight matmul schedule rule.

@spectrometerHBH @vinx13 @Hzfengsy

@Ubospica Ubospica changed the title [TIR] Support more intrinsics and get_mma_intrin_group utility [TIR] Support more mma intrinsics and get_mma_intrin_group utility Nov 5, 2023
@vinx13 vinx13 merged commit db4290b into apache:main Nov 7, 2023
Ubospica added a commit to Ubospica/tvm-develop that referenced this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants