Enable user-customization of GEMM-like kernels and microkernels by devinamatthews · Pull Request #583 · flame/blis

devinamatthews · 2021-12-08T17:04:34Z

This PR adds the ability for users to override the default "macrokernel" used in the various level-3 operations, as well as change the microkernel dynamically. Currently, users may override the macrokernel for any operation by using bli_obj_set_ker_fn, but the microkernel (when using the default macrokernel) can only be overridden for GEMM operations by setting the appropriate field in the gemm_ker_params_t struct.

In order to support non-standard microkernels, e.g. those that do not write to C in the same format (or at all!), the responsibility for handling edge cases has been moved into the microkernel. This required updating all existing microkernels. @nicholaiTukanov @xrq-phys can you please specifically check the POWER9/POWER10 and ARMSVE kernels to make sure that I didn't break them? The integer and float-16 kernels got their signatures changed but DO NOT currently handle edge cases due to the limitations of my macro-based approach.

…ase2

[ci skip]

[ci skip] until I fix the ARM ones.

devinamatthews · 2021-12-08T17:49:40Z

@xrq-phys is there a reason why armsve is not included in arm64?

devinamatthews · 2021-12-08T20:24:49Z

@fgvanzee all architectures that I can test are working now.

fgvanzee · 2021-12-08T21:13:35Z

Thanks, I'll begin my review.

fgvanzee · 2021-12-18T00:43:44Z

@devinamatthews Mostly done. I still have a few odds and ends to discuss during our next video chat.

hominhquan · 2021-12-21T21:53:03Z

Hi, I just want to report that the indent in https://github.com/flame/blis/blob/master/frame/1m/packm/bli_packm_alloc.c is not so coherent.

fgvanzee · 2021-12-21T22:02:16Z

Thanks @hominhquan. I fixed the tab for the return statement. Was there something else?

hominhquan · 2021-12-21T22:05:55Z

Thanks @hominhquan. I fixed the tab for the return statement. Was there something else?

Hmm, I see more than the return. Here is the screenshot of the time of me writing now (line 68 and 74).

fgvanzee · 2021-12-21T22:15:02Z

@hominhquan Hmm, strange. This may be a GitHub rendering issue. When I open the cloned file in vim and search for the tab character, everything looks okay.

Could you try opening the file in your editor to confirm what I see?

hominhquan · 2021-12-21T22:25:50Z

My favorite editor reports a mix between tab and space:

And vim, too:

$ vim https://raw.githubusercontent.com/flame/blis/master/frame/1m/packm/bli_packm_alloc.c
:set list
:set listchars=tab:▸--

fgvanzee · 2021-12-21T22:38:28Z

That looks like an older version of the file. What are the first 7 digits of the git commit hash you're on?

hominhquan · 2021-12-21T22:40:30Z

I took master cf7d616 19 days ago : https://github.com/flame/blis/blob/master/frame/1m/packm/bli_packm_alloc.c

fgvanzee · 2021-12-21T22:41:45Z

The PR branch is on 72e6499. So that's probably the reason you're still seeing spaces.

hominhquan · 2021-12-21T22:43:34Z

Ah ok, the https://github.com/flame/blis/blob/72e6499/frame/1m/packm/bli_packm_alloc.c looks good now. Sorry my bad.

fgvanzee · 2021-12-21T23:21:32Z

No problem. I'm glad it was easy to track down. 🙂 Besides, I appreciate that you are looking out for whitespace!

xrq-phys · 2022-01-01T12:14:56Z

Hi.

Sorry for the very late response.

The SVE kernels seems working properly from the CI side and the performance shall go up at the end I guess :D.

Just two things:

The pre-Asm macro used in ArmSVE kernels seems to be GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY?
Intrinsics svcntd() requires support for <arm_sve.h> by the compiler which is not available before GCC 10. If possible, I'd like SVE GEMM kernels to support older compilers so these lines might serve as a replacement instead. bli_armsve_config_utils.* can be moved to kernels/armsve/.

Issue 2 might be also related to @devinamatthews 's question here:

@xrq-phys is there a reason why armsve is not included in arm64?

SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 does so.

xrq-phys · 2022-01-01T12:40:07Z

#593 for issue 2 above.

Regarding issue 1 it seems that other generic-stride-supporting kernels (haswell, etc.) are also referring GEMM_UKR_SETUP_CT not GEMM_UKR_SETUP_CT_ANY? Are the experiences that reducing to D and scattering into C is better then a direct scatter-store?

devinamatthews · 2022-01-01T15:25:04Z

The pre-Asm macro used in ArmSVE kernels seems to be GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY?

I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general stride write is already a big performance hit because a) the prefetch efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction overhead. Inline ASM really only helps with c).

devinamatthews · 2022-01-01T15:26:37Z

Issue 2 might be also related to @devinamatthews 's question here:

@xrq-phys is there a reason why armsve is not included in arm64?

SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 does so.

@fgvanzee shouldn't it be possible to unconditionally add armsve to the arm64 config and then if necessary blacklist it based on compiler version?

xrq-phys · 2022-01-01T15:31:51Z

I see. I'll try testing it out.

On Sun, Jan 2, 2022 at 0:25 Devin Matthews ***@***.***> wrote: 1. The pre-Asm macro used in ArmSVE kernels seems to be GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY? I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general stride write is already a big performance hit because a) the prefetch efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction overhead. Inline ASM really only helps with c). — Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4GUGRHUM74PHUZ36I6Z7TUT4MFXANCNFSM5JUJSH5Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -------------------- *Xu* RuQing *許* ルーキン東大理学系研究科物理学専攻藤堂研所属研究室Eメール: ***@***.*** 東京大学生Eメール: ***@***.***

jdiamondGitHub · 2022-01-02T17:49:03Z

Bear in mind that most people will be using armclang instead of gcc, because ARM promotes heavily to them that unless they use armclang, they won't get good performance on modern ARM64 targets. That being said, we were unable to get BLIS to compile with armclang, which might render the distinction moot. Fortunately, even sluggish Red Hat is now using gcc >= 8. We've also seen that you need gcc >= 11 to start getting comparable general performance to clang, with gcc 12 even exceeding clang in performance.

…

-- Jeff Diamond ***@***.***

On Sat, Jan 1, 2022, at 9:26 AM, Devin Matthews wrote: > Issue 2 might be also related to @devinamatthews <https://github.com/devinamatthews> 's question here: > >> @xrq-phys <https://github.com/xrq-phys> is there a reason why armsve is not included in arm64? >> > > SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64. > After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 <#396> does so. > @fgvanzee <https://github.com/fgvanzee> shouldn't it be possible to unconditionally add `armsve` to the `arm64` config and then if necessary blacklist it based on compiler version? — Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHGHJK7X5BJMS2BV3G4EORTUT4MLXANCNFSM5JUJSH5Q>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jdiamondGitHub · 2022-01-02T17:52:09Z

This might be a good example where you need to employ BLIS's use of families and subfamilies, since there's only one specifiable compiler per configuration. For example, "ARM64" isn't a "configuration" - it's a family of configurations like NEON64 and SVE1, or by analogy, x86_64.

…

-- Jeff Diamond ***@***.***

On Sat, Jan 1, 2022, at 9:26 AM, Devin Matthews wrote: > Issue 2 might be also related to @devinamatthews <https://github.com/devinamatthews> 's question here: > >> @xrq-phys <https://github.com/xrq-phys> is there a reason why armsve is not included in arm64? >> > > SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64. > After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 <#396> does so. > @fgvanzee <https://github.com/fgvanzee> shouldn't it be possible to unconditionally add `armsve` to the `arm64` config and then if necessary blacklist it based on compiler version? — Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHGHJK7X5BJMS2BV3G4EORTUT4MLXANCNFSM5JUJSH5Q>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

devinamatthews · 2022-01-02T18:46:06Z

That being said, we were unable to get BLIS to compile with armclang, which might render the distinction moot.

@jdiamondGitHub Please open an issue for this.

fgvanzee · 2022-01-03T23:09:21Z

@fgvanzee shouldn't it be possible to unconditionally add armsve to the arm64 config and then if necessary blacklist it based on compiler version?

Yes, I think that would work.

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations.

xrq-phys · 2022-01-07T05:28:21Z

I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general stride write is already a big performance hit because a) the prefetch efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction overhead. Inline ASM really only helps with c).

Indeed. Then shall we remove corresponding asm lines from the Armv8 and ArmSVE kernel files?

devinamatthews · 2022-01-07T15:54:37Z

Well, I though I had. Maybe something got messed up in the merge? If you're willing please submit a PR.

flame/blis#583 introdues 4 additional user customizable function fields. As BLIS.jl was using an outdated duplication of blis.h:obj_t, upgrading blis_jll v0.8.1 would cause the whole BLAS interface to break down. This commit should resolve #9

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0)

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed.

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Fixed incorrect sizeof(type) in edge case macros. Details: - Fixed a typo in bli_edge_case_macro_defs.h that was causing too many elements to be allocated in the temporary microtile. This fix was first introduced in cb74202. (NOTE: This isn't quite a cherry-pick since cb74202 also fixes code that does not yet exist.) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.

Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. - (cherry picked from commit 54fa28b) Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. - (cherry picked from commit 3f2440b) Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. - (cherry picked from commit 71851a0) Pass ct into microkernel via auxinfo_t. Details: - Added a .ct field to the auxinfo_t, which is used to pass a buffer local to the macrokernels into the microkernel to use for edge cases and storage/preference mismatches. NOTE: None of the microkernels currently *use* this .ct buffer. Fixed incorrect sizeof(type) in edge case macros. Details: - Fixed a typo in bli_edge_case_macro_defs.h that was causing too many elements to be allocated in the temporary microtile. This fix was first introduced in cb74202. (NOTE: This isn't quite a cherry-pick since cb74202 also fixes code that does not yet exist.) Removed macros guarding m and n ukernel params. Details: - Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and n dims that were added to the microkernel signature in this commit (and originally in 54fa28b). These macros helped us use the same ARM microkernel while debugging across both pre- and post-Dec24 (2021) commits, but going forward they will no longer be needed. Reactivate general stride assembly in armv8a ukr. Details: - Changed the definitions of the sgemm and dgemm ukernels in bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle mismatches between the storage of the microtile and the IO preference of the microkernel. The general stride code already existed within the armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT macro to GEMM_UKR_SETUP_CT_ANY. - Whitespace changes. Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384. Details: - Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY one that obtains its ct buffer from the auxinfo_t struct. - Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h. This overrides the default value set in bli_kernel_macro_defs.h, which is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2, which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes. NOTE: The last two changes above appear to have fixed the mild regression in trmm_r performance on single-socket (80-core) Altra hardware. So many thanks to Leick Robinson for his tireless help in this performance debugging.

devinamatthews added 6 commits December 7, 2021 13:47

Import changes from obj_t_makeover on top of master.

a3a603d

Pick up missed changes for gemm 1m/md.

5f5741f

All working now.

b5e3fa2

Merge remote-tracking branch 'upstream/master' into obj_t_makeover_ph…

489223f

…ase2

Move syrk_diagonal example.

ea6cc1f

Cleanup.

0c91323

[ci skip]

devinamatthews requested a review from fgvanzee December 8, 2021 17:04

devinamatthews added 2 commits December 8, 2021 11:09

Trigger Travis build.

e5c72e0

Missed some kernels.

4638cc6

[ci skip] until I fix the ARM ones.

devinamatthews added 2 commits December 8, 2021 19:37

Fix missed issues in ARM32 and ARMSVE kernels.

0b21b02

Fix munged Sandybridge sgemm ukr.

550bff7

Trivial whitespace, comment, and code changes.

dda657f

fgvanzee added 2 commits December 21, 2021 12:28

Relocate edge-case gemm ukr macros to new file.

e8ce74a

Trivial comment/whitespace changes.

72e6499

fgvanzee merged commit 54fa28b into master Dec 24, 2021

xrq-phys mentioned this pull request Jan 1, 2022

Evict <arm_sve.h> Requirement for SVE GEMM #593

Merged

devinamatthews deleted the obj_t_makeover_phase2 branch January 26, 2022 18:48

xrq-phys mentioned this pull request Feb 3, 2022

Arm64 & ArmSVE: Remove Generic Stride Assembly #608

Merged

3 tasks

xrq-phys mentioned this pull request Feb 19, 2022

Adapt Upstream Object's Costomizable Fields JuliaLinearAlgebra/BLIS.jl#10

Merged

Conversation

devinamatthews commented Dec 8, 2021

Uh oh!

devinamatthews commented Dec 8, 2021

Uh oh!

devinamatthews commented Dec 8, 2021

Uh oh!

fgvanzee commented Dec 8, 2021

Uh oh!

fgvanzee commented Dec 18, 2021

Uh oh!

hominhquan commented Dec 21, 2021

Uh oh!

fgvanzee commented Dec 21, 2021

Uh oh!

hominhquan commented Dec 21, 2021

Uh oh!

fgvanzee commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hominhquan commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fgvanzee commented Dec 21, 2021

Uh oh!

hominhquan commented Dec 21, 2021

Uh oh!

fgvanzee commented Dec 21, 2021

Uh oh!

hominhquan commented Dec 21, 2021

Uh oh!

fgvanzee commented Dec 21, 2021

Uh oh!

xrq-phys commented Jan 1, 2022

Uh oh!

xrq-phys commented Jan 1, 2022

Uh oh!

devinamatthews commented Jan 1, 2022

Uh oh!

devinamatthews commented Jan 1, 2022

Uh oh!

xrq-phys commented Jan 1, 2022 via email

Uh oh!

jdiamondGitHub commented Jan 2, 2022 via email

Uh oh!

jdiamondGitHub commented Jan 2, 2022 via email

Uh oh!

devinamatthews commented Jan 2, 2022

Uh oh!

fgvanzee commented Jan 3, 2022

Uh oh!

xrq-phys commented Jan 7, 2022

Uh oh!

devinamatthews commented Jan 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fgvanzee commented Dec 21, 2021 •

edited

Loading

hominhquan commented Dec 21, 2021 •

edited

Loading