Skip to content

De-template GEMMT, TRMM, and TRSM macro-kernels.#674

Closed
devinamatthews wants to merge 41 commits intomasterfrom
detemplate-kernels
Closed

De-template GEMMT, TRMM, and TRSM macro-kernels.#674
devinamatthews wants to merge 41 commits intomasterfrom
detemplate-kernels

Conversation

@devinamatthews
Copy link
Copy Markdown
Member

Depends on #673.

devinamatthews and others added 30 commits January 31, 2022 10:47
These are defined in sub-configuration-specific header files, which are only included by reference kernels.
The gemm reference kernel now uses the configuration-dependent BLIS_MR_x/BLIS_NR_x macros to control unrolling, rather than fixed values. This fixes #259 and replaces PR #547.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no
   compile-time MR/NR. The refernce gemm kernels has been modified
   to detect this and fallback to a "dumb" version.
2. For some reason, adding an optimization for writing back full
   microtiles in row-major storage to the reference gemm kernel
   results in a segfault for armv7a/gcc-9.3. I can't tell if I'm
   doing something wrong of if there is a compiler bug. This
   optimization has been removed for the time being.
…vailable as macros.

The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks.
- *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts:
#	ref_kernels/3/bli_gemmtrsm_ref.c
#	ref_kernels/ind/bli_gemmtrsm1m_ref.c
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions.
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
Beta (as the scalar attached to C) was not seen as reset to 1 after the first iteration of the pc loop, as the wrong pointer was passed to bli_gemm_int.
# Conflicts:
#	config/a64fx/bli_cntx_init_a64fx.c
#	config/armsve/bli_cntx_init_armsve.c
#	config/bgq/bli_cntx_init_bgq.c
#	config/bulldozer/bli_cntx_init_bulldozer.c
#	config/cortexa15/bli_cntx_init_cortexa15.c
#	config/cortexa53/bli_cntx_init_cortexa53.c
#	config/cortexa57/bli_cntx_init_cortexa57.c
#	config/cortexa9/bli_cntx_init_cortexa9.c
#	config/excavator/bli_cntx_init_excavator.c
#	config/firestorm/bli_cntx_init_firestorm.c
#	config/haswell/bli_cntx_init_haswell.c
#	config/knc/bli_cntx_init_knc.c
#	config/knl/bli_cntx_init_knl.c
#	config/penryn/bli_cntx_init_penryn.c
#	config/piledriver/bli_cntx_init_piledriver.c
#	config/power10/bli_cntx_init_power10.c
#	config/power7/bli_cntx_init_power7.c
#	config/power9/bli_cntx_init_power9.c
#	config/sandybridge/bli_cntx_init_sandybridge.c
#	config/skx/bli_cntx_init_skx.c
#	config/steamroller/bli_cntx_init_steamroller.c
#	config/template/bli_cntx_init_template.c
#	config/thunderx2/bli_cntx_init_thunderx2.c
#	config/zen/bli_cntx_init_zen.c
#	config/zen2/bli_cntx_init_zen2.c
#	config/zen3/bli_cntx_init_zen3.c
#	frame/0/bli_l0_check.h
#	frame/0/bli_l0_oapi.c
#	frame/0/bli_l0_oapi.h
#	frame/0/bli_l0_tapi.h
#	frame/0/copysc/bli_copysc.c
#	frame/1/bli_l1v_oapi.h
#	frame/1/bli_l1v_tapi.c
#	frame/1/bli_l1v_tapi.h
#	frame/1d/bli_l1d_ft.h
#	frame/1d/bli_l1d_oapi.c
#	frame/1d/bli_l1d_oapi.h
#	frame/1d/bli_l1d_tapi.c
#	frame/1d/bli_l1d_tapi.h
#	frame/1f/bli_l1f_check.c
#	frame/1f/bli_l1f_check.h
#	frame/1f/bli_l1f_ft.h
#	frame/1f/bli_l1f_oapi.c
#	frame/1f/bli_l1f_oapi.h
#	frame/1f/bli_l1f_tapi.c
#	frame/1f/bli_l1f_tapi.h
#	frame/1m/bli_l1m_ft.h
#	frame/1m/bli_l1m_oapi.c
#	frame/1m/bli_l1m_oapi.h
#	frame/1m/bli_l1m_oft_var.h
#	frame/1m/bli_l1m_tapi.c
#	frame/1m/bli_l1m_tapi.h
#	frame/1m/packm/bli_packm_alloc.c
#	frame/1m/packm/bli_packm_alloc.h
#	frame/1m/packm/bli_packm_blk_var1.c
#	frame/1m/packm/bli_packm_blk_var1.h
#	frame/1m/packm/bli_packm_cntl.h
#	frame/1m/packm/bli_packm_init.c
#	frame/1m/packm/bli_packm_init.h
#	frame/1m/packm/bli_packm_int.c
#	frame/1m/packm/bli_packm_int.h
#	frame/1m/unpackm/bli_unpackm_blk_var1.c
#	frame/1m/unpackm/bli_unpackm_int.c
#	frame/2/bli_l2_check.c
#	frame/2/bli_l2_check.h
#	frame/2/bli_l2_ft.h
#	frame/2/bli_l2_oapi.c
#	frame/2/bli_l2_oapi.h
#	frame/2/bli_l2_tapi.c
#	frame/2/bli_l2_tapi.h
#	frame/3/bli_l3_blocksize.c
#	frame/3/bli_l3_blocksize.h
#	frame/3/bli_l3_cntl.c
#	frame/3/bli_l3_direct.h
#	frame/3/bli_l3_int.c
#	frame/3/bli_l3_int.h
#	frame/3/bli_l3_oapi.c
#	frame/3/bli_l3_oapi.h
#	frame/3/bli_l3_oapi_ex.c
#	frame/3/bli_l3_oapi_ex.h
#	frame/3/bli_l3_oft.h
#	frame/3/bli_l3_oft_var.h
#	frame/3/bli_l3_packab.c
#	frame/3/bli_l3_packab.h
#	frame/3/bli_l3_sup.c
#	frame/3/bli_l3_sup.h
#	frame/3/bli_l3_sup_oft.h
#	frame/3/bli_l3_sup_packm_a.c
#	frame/3/bli_l3_sup_packm_a.h
#	frame/3/bli_l3_sup_packm_b.c
#	frame/3/bli_l3_sup_packm_b.h
#	frame/3/bli_l3_sup_packm_var.c
#	frame/3/bli_l3_sup_packm_var.h
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/3/bli_l3_sup_vars.h
#	frame/3/bli_l3_tapi_ex.c
#	frame/3/bli_l3_tapi_ex.h
#	frame/3/gemm/bli_gemm_blk_var1.c
#	frame/3/gemm/bli_gemm_blk_var2.c
#	frame/3/gemm/bli_gemm_blk_var3.c
#	frame/3/gemm/bli_gemm_front.c
#	frame/3/gemm/bli_gemm_front.h
#	frame/3/gemm/bli_gemm_ker_var2.c
#	frame/3/gemm/bli_gemm_md.c
#	frame/3/gemm/bli_gemm_md.h
#	frame/3/gemm/bli_gemm_var.h
#	frame/3/gemmt/bli_gemmt_front.c
#	frame/3/gemmt/bli_gemmt_front.h
#	frame/3/gemmt/bli_gemmt_l_ker_var2.c
#	frame/3/gemmt/bli_gemmt_u_ker_var2.c
#	frame/3/gemmt/bli_gemmt_var.h
#	frame/3/gemmt/bli_gemmt_x_ker_var2.c
#	frame/3/hemm/bli_hemm_front.c
#	frame/3/hemm/bli_hemm_front.h
#	frame/3/symm/bli_symm_front.c
#	frame/3/symm/bli_symm_front.h
#	frame/3/trmm/bli_trmm_front.c
#	frame/3/trmm/bli_trmm_front.h
#	frame/3/trmm/bli_trmm_ll_ker_var2.c
#	frame/3/trmm/bli_trmm_lu_ker_var2.c
#	frame/3/trmm/bli_trmm_rl_ker_var2.c
#	frame/3/trmm/bli_trmm_ru_ker_var2.c
#	frame/3/trmm/bli_trmm_var.h
#	frame/3/trmm/bli_trmm_xx_ker_var2.c
#	frame/3/trmm3/bli_trmm3_front.c
#	frame/3/trmm3/bli_trmm3_front.h
#	frame/3/trsm/bli_trsm_blk_var1.c
#	frame/3/trsm/bli_trsm_blk_var2.c
#	frame/3/trsm/bli_trsm_blk_var3.c
#	frame/3/trsm/bli_trsm_front.c
#	frame/3/trsm/bli_trsm_front.h
#	frame/3/trsm/bli_trsm_ll_ker_var2.c
#	frame/3/trsm/bli_trsm_lu_ker_var2.c
#	frame/3/trsm/bli_trsm_rl_ker_var2.c
#	frame/3/trsm/bli_trsm_ru_ker_var2.c
#	frame/3/trsm/bli_trsm_var.h
#	frame/3/trsm/bli_trsm_xx_ker_var2.c
#	frame/base/bli_blksz.c
#	frame/base/bli_blksz.h
#	frame/base/bli_cntl.h
#	frame/base/bli_cntx.c
#	frame/base/bli_cntx.h
#	frame/base/bli_env.c
#	frame/base/bli_gks.c
#	frame/base/bli_gks.h
#	frame/base/bli_ind.h
#	frame/base/bli_info.c
#	frame/base/bli_obj_scalar.c
#	frame/base/bli_obj_scalar.h
#	frame/base/bli_pba.c
#	frame/base/bli_rntm.h
#	frame/base/bli_sba.c
#	frame/base/bli_sba.h
#	frame/base/bli_setgetijm.c
#	frame/base/check/bli_obj_check.c
#	frame/base/check/bli_obj_check.h
#	frame/include/bli_oapi_ex.h
#	frame/include/bli_obj_macro_defs.h
#	frame/include/bli_tapi_ex.h
#	frame/include/bli_type_defs.h
#	frame/thread/bli_l3_decor.h
#	frame/thread/bli_l3_decor_openmp.c
#	frame/thread/bli_l3_decor_pthreads.c
#	frame/thread/bli_l3_decor_single.c
#	frame/thread/bli_l3_sup_decor.h
#	frame/thread/bli_l3_sup_decor_openmp.c
#	frame/thread/bli_l3_sup_decor_pthreads.c
#	frame/thread/bli_l3_sup_decor_single.c
#	frame/thread/bli_thread.c
#	frame/thread/bli_thread.h
#	frame/thread/bli_thrinfo.c
#	frame/thread/bli_thrinfo.h
#	frame/thread/bli_thrinfo_sup.c
#	frame/util/bli_util_check.c
#	frame/util/bli_util_check.h
#	frame/util/bli_util_oapi.c
#	frame/util/bli_util_oapi.h
#	kernels/zen/1/bli_copyv_zen_int.c
#	kernels/zen/1/bli_scalv_zen_int10.c
#	kernels/zen/1f/bli_axpyf_zen_int_4.c
#	kernels/zen/1f/bli_axpyf_zen_int_5.c
#	ref_kernels/1m/bli_packm_cxk_1er_ref.c
#	ref_kernels/3/bli_gemm_ref.c
#	ref_kernels/3/bli_gemmtrsm_ref.c
#	ref_kernels/bli_cntx_ref.c
#	ref_kernels/ind/bli_gemm1m_ref.c
#	ref_kernels/ind/bli_trsm1m_ref.c
#	testsuite/src/test_libblis.c
This enables better debugging since errors will show up based on the un-flattened filename and line number.
# Conflicts:
#	build/flatten-headers.py
#	frame/3/bli_l3_sup_var1n2m.c
# Conflicts:
#	build/flatten-headers.py
#	frame/3/bli_l3_sup_packm.c
#	frame/3/bli_l3_sup_packm.h
#	frame/3/bli_l3_sup_packm_var.c
#	frame/3/bli_l3_sup_packm_var.h
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/3/gemmt/bli_gemmt_front.c
1. Add a check for pool exhaustion when freeing blocks. This detects double-free and other bad conditions without segfault.
2. Make sure to copy *all* block pointers when growing the pool size. Previously, checked-out block pointers were not copied, leading to the presence of uninitialized data.
This option (disbaled by default) enables compiling and linking with the Address Sanitizer library (ASan), via the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is included for all files *except* optimized kernels, since it usually reguires an extra register which violates the constraints for many gemm microkernels.
Reinstate check for checked-out blocks upon finalization. A flag has been added to indicate that the pool is actually under reinitialization (where checked-out blocks are OK), which temporarily disables the check. A memory leak where blocks are not checked back in is now correctly detected upon exit.
# Conflicts:
#	Makefile
#	common.mk
#	configure
#	frame/3/bli_l3_oapi_ex.c
#	frame/3/bli_l3_sup_packm.c
#	frame/3/bli_l3_sup_packm.h
#	frame/3/bli_l3_sup_ref.c
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/base/bli_pool.c
#	frame/base/bli_rntm.h
#	frame/thread/bli_l3_decor.h
#	frame/thread/bli_l3_decor_openmp.c
#	frame/thread/bli_l3_decor_pthreads.c
#	frame/thread/bli_l3_decor_single.c
#	frame/thread/bli_l3_sup_decor.h
#	frame/thread/bli_l3_sup_decor_openmp.c
#	frame/thread/bli_l3_sup_decor_pthreads.c
#	frame/thread/bli_l3_sup_decor_single.c
#	frame/thread/bli_thrcomm.h
#	frame/thread/bli_thrcomm_openmp.c
#	frame/thread/bli_thrcomm_pthreads.c
#	frame/thread/bli_thrcomm_single.c
#	frame/thread/bli_thread.c
#	frame/thread/bli_thrinfo.c
#	frame/thread/bli_thrinfo.h
#	frame/thread/bli_thrinfo_sup.c
@devinamatthews devinamatthews marked this pull request as draft October 5, 2022 20:10
@devinamatthews devinamatthews deleted the detemplate-kernels branch October 28, 2022 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants