Skip to content

Enable user-customization of GEMM-like kernels and microkernels#583

Merged
fgvanzee merged 13 commits intomasterfrom
obj_t_makeover_phase2
Dec 24, 2021
Merged

Enable user-customization of GEMM-like kernels and microkernels#583
fgvanzee merged 13 commits intomasterfrom
obj_t_makeover_phase2

Conversation

@devinamatthews
Copy link
Copy Markdown
Member

This PR adds the ability for users to override the default "macrokernel" used in the various level-3 operations, as well as change the microkernel dynamically. Currently, users may override the macrokernel for any operation by using bli_obj_set_ker_fn, but the microkernel (when using the default macrokernel) can only be overridden for GEMM operations by setting the appropriate field in the gemm_ker_params_t struct.

In order to support non-standard microkernels, e.g. those that do not write to C in the same format (or at all!), the responsibility for handling edge cases has been moved into the microkernel. This required updating all existing microkernels. @nicholaiTukanov @xrq-phys can you please specifically check the POWER9/POWER10 and ARMSVE kernels to make sure that I didn't break them? The integer and float-16 kernels got their signatures changed but DO NOT currently handle edge cases due to the limitations of my macro-based approach.

@devinamatthews
Copy link
Copy Markdown
Member Author

@xrq-phys is there a reason why armsve is not included in arm64?

@devinamatthews
Copy link
Copy Markdown
Member Author

@fgvanzee all architectures that I can test are working now.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Dec 8, 2021

Thanks, I'll begin my review.

@fgvanzee
Copy link
Copy Markdown
Member

@devinamatthews Mostly done. I still have a few odds and ends to discuss during our next video chat.

@hominhquan
Copy link
Copy Markdown
Contributor

Hi, I just want to report that the indent in https://github.com/flame/blis/blob/master/frame/1m/packm/bli_packm_alloc.c is not so coherent.

@fgvanzee
Copy link
Copy Markdown
Member

Thanks @hominhquan. I fixed the tab for the return statement. Was there something else?

@hominhquan
Copy link
Copy Markdown
Contributor

Thanks @hominhquan. I fixed the tab for the return statement. Was there something else?

Hmm, I see more than the return. Here is the screenshot of the time of me writing now (line 68 and 74).
Screenshot from 2021-12-21 23-03-51

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Dec 21, 2021

@hominhquan Hmm, strange. This may be a GitHub rendering issue. When I open the cloned file in vim and search for the tab character, everything looks okay.

Could you try opening the file in your editor to confirm what I see?

@hominhquan
Copy link
Copy Markdown
Contributor

hominhquan commented Dec 21, 2021

My favorite editor reports a mix between tab and space:
Screenshot from 2021-12-21 23-24-04

And vim, too:

$ vim https://raw.githubusercontent.com/flame/blis/master/frame/1m/packm/bli_packm_alloc.c
:set list
:set listchars=tab:▸--

Screenshot from 2021-12-21 23-27-26

@fgvanzee
Copy link
Copy Markdown
Member

That looks like an older version of the file. What are the first 7 digits of the git commit hash you're on?

@hominhquan
Copy link
Copy Markdown
Contributor

@fgvanzee
Copy link
Copy Markdown
Member

The PR branch is on 72e6499. So that's probably the reason you're still seeing spaces.

@hominhquan
Copy link
Copy Markdown
Contributor

Ah ok, the https://github.com/flame/blis/blob/72e6499/frame/1m/packm/bli_packm_alloc.c looks good now. Sorry my bad.

@fgvanzee
Copy link
Copy Markdown
Member

No problem. I'm glad it was easy to track down. 🙂 Besides, I appreciate that you are looking out for whitespace!

@fgvanzee fgvanzee merged commit 54fa28b into master Dec 24, 2021
@xrq-phys
Copy link
Copy Markdown
Collaborator

xrq-phys commented Jan 1, 2022

Hi.

Sorry for the very late response.

The SVE kernels seems working properly from the CI side and the performance shall go up at the end I guess :D.

Just two things:

  1. The pre-Asm macro used in ArmSVE kernels seems to be GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY?
  2. Intrinsics svcntd() requires support for <arm_sve.h> by the compiler which is not available before GCC 10. If possible, I'd like SVE GEMM kernels to support older compilers so these lines might serve as a replacement instead. bli_armsve_config_utils.* can be moved to kernels/armsve/.

Issue 2 might be also related to @devinamatthews 's question here:

@xrq-phys is there a reason why armsve is not included in arm64?

SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 does so.

@xrq-phys
Copy link
Copy Markdown
Collaborator

xrq-phys commented Jan 1, 2022

#593 for issue 2 above.

Regarding issue 1 it seems that other generic-stride-supporting kernels (haswell, etc.) are also referring GEMM_UKR_SETUP_CT not GEMM_UKR_SETUP_CT_ANY? Are the experiences that reducing to D and scattering into C is better then a direct scatter-store?

@devinamatthews
Copy link
Copy Markdown
Member Author

  1. The pre-Asm macro used in ArmSVE kernels seems to be GEMM_UKR_SETUP_CT, but SVE supports scatter-store instructions and so do the kernels. I suppose it should be changed into GEMM_UKR_SETUP_CT_ANY?

I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general stride write is already a big performance hit because a) the prefetch efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction overhead. Inline ASM really only helps with c).

@devinamatthews
Copy link
Copy Markdown
Member Author

Issue 2 might be also related to @devinamatthews 's question here:

@xrq-phys is there a reason why armsve is not included in arm64?

SVE GEMM kernels needs GCC >= 8's assembler to assemble. Separating them could retain old compilers' support for Arm64.
After all, I'm using a separate armsve just because the initial SVE256 support by Linaro #396 does so.

@fgvanzee shouldn't it be possible to unconditionally add armsve to the arm64 config and then if necessary blacklist it based on compiler version?

@xrq-phys
Copy link
Copy Markdown
Collaborator

xrq-phys commented Jan 1, 2022 via email

@jdiamondGitHub
Copy link
Copy Markdown
Member

jdiamondGitHub commented Jan 2, 2022 via email

@jdiamondGitHub
Copy link
Copy Markdown
Member

jdiamondGitHub commented Jan 2, 2022 via email

@devinamatthews
Copy link
Copy Markdown
Member Author

That being said, we were unable to get BLIS to compile with armclang, which might render the distinction moot.

@jdiamondGitHub Please open an issue for this.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Jan 3, 2022

@fgvanzee shouldn't it be possible to unconditionally add armsve to the arm64 config and then if necessary blacklist it based on compiler version?

Yes, I think that would work.

dzambare pushed a commit to Meghana-vankadari/blis that referenced this pull request Jan 6, 2022
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels' 
  directory to take m and n dimensions, and implemented edge-case 
  handling within those microkernels via a collection of new C 
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride 
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of 
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params 
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
@xrq-phys
Copy link
Copy Markdown
Collaborator

xrq-phys commented Jan 7, 2022

I removed the general-stride handling from all of the kernels in order to simplify and streamline kernel maintenance. If you measure better performance with general stride in the assembly then you can revert the changes to the inline ASM and use GEMM_UKR_SETUP_CT_ANY. However, general stride write is already a big performance hit because a) the prefetch efficiency is ~0, b) the write BW is 8-16x higher, c) higher nstruction overhead. Inline ASM really only helps with c).

Indeed. Then shall we remove corresponding asm lines from the Armv8 and ArmSVE kernel files?

@devinamatthews
Copy link
Copy Markdown
Member Author

Well, I though I had. Maybe something got messed up in the merge? If you're willing please submit a PR.

@devinamatthews devinamatthews deleted the obj_t_makeover_phase2 branch January 26, 2022 18:48
xrq-phys pushed a commit to JuliaLinearAlgebra/BLIS.jl that referenced this pull request Feb 19, 2022
flame/blis#583 introdues 4 additional user customizable function fields.
As BLIS.jl was using an outdated duplication of blis.h:obj_t, upgrading
blis_jll v0.8.1 would cause the whole BLAS interface to break down.
This commit should resolve #9
fgvanzee added a commit that referenced this pull request Nov 3, 2022
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels'
  directory to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
- (cherry picked from commit 54fa28b)

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
- (cherry picked from commit 3f2440b)

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
- (cherry picked from commit 71851a0)
fgvanzee added a commit that referenced this pull request Nov 15, 2022
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels'
  directory to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
- (cherry picked from commit 54fa28b)

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
- (cherry picked from commit 3f2440b)

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
- (cherry picked from commit 71851a0)

Removed macros guarding m and n ukernel params.

Details:
- Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and
  n dims that were added to the microkernel signature in this commit
  (and originally in 54fa28b). These macros helped us use the same ARM
  microkernel while debugging across both pre- and post-Dec24 (2021)
  commits, but going forward they will no longer be needed.
fgvanzee added a commit that referenced this pull request Mar 17, 2023
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels'
  directory to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
- (cherry picked from commit 54fa28b)

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
- (cherry picked from commit 3f2440b)

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
- (cherry picked from commit 71851a0)

Removed macros guarding m and n ukernel params.

Details:
- Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and
  n dims that were added to the microkernel signature in this commit
  (and originally in 54fa28b). These macros helped us use the same ARM
  microkernel while debugging across both pre- and post-Dec24 (2021)
  commits, but going forward they will no longer be needed.

Reactivate general stride assembly in armv8a ukr.

Details:
- Changed the definitions of the sgemm and dgemm ukernels in
  bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle
  mismatches between the storage of the microtile and the IO preference
  of the microkernel. The general stride code already existed within the
  armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT
  macro to GEMM_UKR_SETUP_CT_ANY.
- Whitespace changes.

Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384.

Details:
- Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the
  original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY
  one that obtains its ct buffer from the auxinfo_t struct.
- Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h.
  This overrides the default value set in bli_kernel_macro_defs.h, which
  is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2,
  which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes.

NOTE: The last two changes above appear to have fixed the mild
regression in trmm_r performance on single-socket (80-core) Altra
hardware. So many thanks to Leick Robinson for his tireless help in
this performance debugging.
fgvanzee added a commit that referenced this pull request Mar 23, 2023
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels'
  directory to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
- (cherry picked from commit 54fa28b)

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
- (cherry picked from commit 3f2440b)

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
- (cherry picked from commit 71851a0)

Fixed incorrect sizeof(type) in edge case macros.

Details:
- Fixed a typo in bli_edge_case_macro_defs.h that was causing too many
  elements to be allocated in the temporary microtile. This fix was
  first introduced in cb74202. (NOTE: This isn't quite a cherry-pick
  since cb74202 also fixes code that does not yet exist.)

Removed macros guarding m and n ukernel params.

Details:
- Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and
  n dims that were added to the microkernel signature in this commit
  (and originally in 54fa28b). These macros helped us use the same ARM
  microkernel while debugging across both pre- and post-Dec24 (2021)
  commits, but going forward they will no longer be needed.

Reactivate general stride assembly in armv8a ukr.

Details:
- Changed the definitions of the sgemm and dgemm ukernels in
  bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle
  mismatches between the storage of the microtile and the IO preference
  of the microkernel. The general stride code already existed within the
  armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT
  macro to GEMM_UKR_SETUP_CT_ANY.
- Whitespace changes.

Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384.

Details:
- Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the
  original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY
  one that obtains its ct buffer from the auxinfo_t struct.
- Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h.
  This overrides the default value set in bli_kernel_macro_defs.h, which
  is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2,
  which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes.

NOTE: The last two changes above appear to have fixed the mild
regression in trmm_r performance on single-socket (80-core) Altra
hardware. So many thanks to Leick Robinson for his tireless help in
this performance debugging.
fgvanzee added a commit that referenced this pull request Mar 30, 2023
Details:
- Moved edge-case handling into the gemm microkernel. This required
  changing the microkernel API to take m and n dimension parameters.
  This required updating all existing gemm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. We also updated all existing kernels in the 'kernels'
  directory to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Also
  removed the assembly code that formerly would handle general stride
  IO on the microtile, since this can now be handled by the same code
  that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
  bli_trsm_cntl_create(), where this function pointer is used in lieu of
  the default macrokernel when it is non-NULL, and ignored when it is
  NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
  function using byte pointers rather that one function for each
  floating-point datatype. Also, obtain the microkernel function pointer
  from the .ukr field of the params struct embedded within the obj_t
  for matrix C (assuming params is non-NULL and contains a non-NULL
  value in the .ukr field). Communicate both the gemm microkernel
  pointer to use as well as the params struct to the microkernel via
  the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
  struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
  We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
  calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
  bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
  associated code to test those operations.
- (cherry picked from commit 54fa28b)

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
  microkernel calling sequence, which now includes m and n dimensions so
  that the microkernel has all the information necessary to handle edge
  cases. Thanks to Jeff Diamond for catching this, which ideally would
  have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
  removed their corresponding prototypes. In both cases, var2 was a
  variant of the block-panel algorithm where edge-case handling was
  abstracted away to a microkernel wrapper. (Since this is now the
  official behavior of BLIS microkernels, I saw no need to have it
  included as a separate code path.)
- Comment updates.
- (cherry picked from commit 3f2440b)

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
- (cherry picked from commit 71851a0)

Pass ct into microkernel via auxinfo_t.

Details:
- Added a .ct field to the auxinfo_t, which is used to pass a buffer
  local to the macrokernels into the microkernel to use for edge cases
  and storage/preference mismatches. NOTE: None of the microkernels
  currently *use* this .ct buffer.

Fixed incorrect sizeof(type) in edge case macros.

Details:
- Fixed a typo in bli_edge_case_macro_defs.h that was causing too many
  elements to be allocated in the temporary microtile. This fix was
  first introduced in cb74202. (NOTE: This isn't quite a cherry-pick
  since cb74202 also fixes code that does not yet exist.)

Removed macros guarding m and n ukernel params.

Details:
- Removed the NEW_DEVIN_MICROKERNEL_MACROS macro guards around the m and
  n dims that were added to the microkernel signature in this commit
  (and originally in 54fa28b). These macros helped us use the same ARM
  microkernel while debugging across both pre- and post-Dec24 (2021)
  commits, but going forward they will no longer be needed.

Reactivate general stride assembly in armv8a ukr.

Details:
- Changed the definitions of the sgemm and dgemm ukernels in
  bli_gemm_armv8a_asm_d6x8.c to allow the assembly region to handle
  mismatches between the storage of the microtile and the IO preference
  of the microkernel. The general stride code already existed within the
  armv8a kernel; all that was needed was to change the GEMM_UKR_SETUP_CT
  macro to GEMM_UKR_SETUP_CT_ANY.
- Whitespace changes.

Set BLIS_STACK_BUF_MAX_SIZE on 'altra' to 384.

Details:
- Modified the bli_gemm_armv8a_asm_d6x8.c microkernel to use the
  original GEMM_UKR_SETUP_CT_ANY macro rather than the newer _AUXCT_ANY
  one that obtains its ct buffer from the auxinfo_t struct.
- Set the BLIS_STACK_BUF_MAX_SIZE macro to 384 in bli_family_altra.h.
  This overrides the default value set in bli_kernel_macro_defs.h, which
  is currently defined as BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2,
  which for the 'altra' subconfig comes out to 32*64*2 = 4096 bytes.

NOTE: The last two changes above appear to have fixed the mild
regression in trmm_r performance on single-socket (80-core) Altra
hardware. So many thanks to Leick Robinson for his tireless help in
this performance debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants