Enable user-customization of GEMM-like kernels and microkernels#583
Enable user-customization of GEMM-like kernels and microkernels#583
Conversation
[ci skip] until I fix the ARM ones.
|
@xrq-phys is there a reason why |
|
@fgvanzee all architectures that I can test are working now. |
|
Thanks, I'll begin my review. |
|
@devinamatthews Mostly done. I still have a few odds and ends to discuss during our next video chat. |
|
Hi, I just want to report that the indent in https://github.com/flame/blis/blob/master/frame/1m/packm/bli_packm_alloc.c is not so coherent. |
|
Thanks @hominhquan. I fixed the tab for the |
Hmm, I see more than the |
|
@hominhquan Hmm, strange. This may be a GitHub rendering issue. When I open the cloned file in Could you try opening the file in your editor to confirm what I see? |
|
That looks like an older version of the file. What are the first 7 digits of the git commit hash you're on? |
|
I took |
|
The PR branch is on 72e6499. So that's probably the reason you're still seeing spaces. |
|
Ah ok, the https://github.com/flame/blis/blob/72e6499/frame/1m/packm/bli_packm_alloc.c looks good now. Sorry my bad. |
|
No problem. I'm glad it was easy to track down. 🙂 Besides, I appreciate that you are looking out for whitespace! |
|
Hi. Sorry for the very late response. The SVE kernels seems working properly from the CI side and the performance shall go up at the end I guess :D. Just two things:
Issue 2 might be also related to @devinamatthews 's question here:
SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64. |
|
#593 for issue 2 above. Regarding issue 1 it seems that other generic-stride-supporting kernels ( |
I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use |
@fgvanzee shouldn't it be possible to unconditionally add |
|
I see. I'll try testing it out.
On Sun, Jan 2, 2022 at 0:25 Devin Matthews ***@***.***> wrote:
1. The pre-Asm macro used in ArmSVE kernels seems to be
GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do
the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY?
I removed the general-stride handling from all of the kernels in order to
simplify and streamline kernel maintenance. If you measure better
performance with general stride in the assembly then you can revert the
changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general
stride write is already a big performance hit because a) the prefetch
efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction
overhead. Inline ASM really only helps with c).
—
Reply to this email directly, view it on GitHub
<#583 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4GUGRHUM74PHUZ36I6Z7TUT4MFXANCNFSM5JUJSH5Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--------------------
*Xu* RuQing
*許* ルーキン
東大理学系研究科
物理学専攻藤堂研
所属研究室Eメール: ***@***.***
東京大学生Eメール: ***@***.***
|
|
Bear in mind that most people will be using armclang instead of gcc, because ARM promotes heavily to them that unless they use armclang, they won't get good performance on modern ARM64 targets. That being said, we were unable to get BLIS to compile with armclang, which might render the distinction moot.
Fortunately, even sluggish Red Hat is now using gcc >= 8. We've also seen that you need gcc >= 11 to start getting comparable general performance to clang, with gcc 12 even exceeding clang in performance.
…--
Jeff Diamond
***@***.***
On Sat, Jan 1, 2022, at 9:26 AM, Devin Matthews wrote:
> Issue 2 might be also related to @devinamatthews <https://github.com/devinamatthews> 's question here:
>
>> @xrq-phys <https://github.com/xrq-phys> is there a reason why armsve is not included in arm64?
>>
>
> SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
> After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 <#396> does so.
>
@fgvanzee <https://github.com/fgvanzee> shouldn't it be possible to unconditionally add `armsve` to the `arm64` config and then if necessary blacklist it based on compiler version?
—
Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHGHJK7X5BJMS2BV3G4EORTUT4MLXANCNFSM5JUJSH5Q>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
This might be a good example where you need to employ BLIS's use of families and subfamilies, since there's only one specifiable compiler per configuration. For example, "ARM64" isn't a "configuration" - it's a family of configurations like NEON64 and SVE1, or by analogy, x86_64.
…--
Jeff Diamond
***@***.***
On Sat, Jan 1, 2022, at 9:26 AM, Devin Matthews wrote:
> Issue 2 might be also related to @devinamatthews <https://github.com/devinamatthews> 's question here:
>
>> @xrq-phys <https://github.com/xrq-phys> is there a reason why armsve is not included in arm64?
>>
>
> SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
> After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 <#396> does so.
>
@fgvanzee <https://github.com/fgvanzee> shouldn't it be possible to unconditionally add `armsve` to the `arm64` config and then if necessary blacklist it based on compiler version?
—
Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHGHJK7X5BJMS2BV3G4EORTUT4MLXANCNFSM5JUJSH5Q>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@jdiamondGitHub Please open an issue for this. |
Yes, I think that would work. |
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations.
Indeed. Then shall we remove corresponding asm lines from the Armv8 and ArmSVE kernel files? |
|
Well, I though I had. Maybe something got messed up in the merge? If you're willing please submit a PR. |
flame/blis#583 introdues 4 additional user customizable function fields. As BLIS.jl was using an outdated duplication of blis.h:obj_t, upgrading blis_jll v0.8.1 would cause the whole BLAS interface to break down. This commit should resolve #9
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0)
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed.
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Fixed incorrect sizeof(type) in edge case macros. Details: - Fixed a typo in bli_edge_case_macro_defs.h that was causing too many elements to be allocated in the temporary microtile. This fix was first introduced in cb74202. (NOTE: This isn't quite a cherry-pick since cb74202 also fixes code that does not yet exist.) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.
Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Pass ct into microkernel via auxinfo_t. Details: - Added a .ct field to the auxinfo_t, which is used to pass a buffer local to the macrokernels into the microkernel to use for edge cases and storage/preference mismatches. NOTE: None of the microkernels currently *use* this .ct buffer. Fixed incorrect sizeof(type) in edge case macros. Details: - Fixed a typo in bli_edge_case_macro_defs.h that was causing too many elements to be allocated in the temporary microtile. This fix was first introduced in cb74202. (NOTE: This isn't quite a cherry-pick since cb74202 also fixes code that does not yet exist.) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.



This PR adds the ability for users to override the default "macrokernel" used in the various level-3 operations, as well as change the microkernel dynamically. Currently, users may override the macrokernel for any operation by using
bli_obj_set_ker_fn, but the microkernel (when using the default macrokernel) can only be overridden for GEMM operations by setting the appropriate field in thegemm_ker_params_tstruct.In order to support non-standard microkernels, e.g. those that do not write to C in the same format (or at all!), the responsibility for handling edge cases has been moved into the microkernel. This required updating all existing microkernels. @nicholaiTukanov @xrq-phys can you please specifically check the POWER9/POWER10 and ARMSVE kernels to make sure that I didn't break them? The integer and float-16 kernels got their signatures changed but DO NOT currently handle edge cases due to the limitations of my macro-based approach.