De-template GEMMT, TRMM, and TRSM macro-kernels.#674
Closed
devinamatthews wants to merge 41 commits intomasterfrom
Closed
De-template GEMMT, TRMM, and TRSM macro-kernels.#674devinamatthews wants to merge 41 commits intomasterfrom
devinamatthews wants to merge 41 commits intomasterfrom
Conversation
These are defined in sub-configuration-specific header files, which are only included by reference kernels.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no compile-time MR/NR. The refernce gemm kernels has been modified to detect this and fallback to a "dumb" version. 2. For some reason, adding an optimization for writing back full microtiles in row-major storage to the reference gemm kernel results in a segfault for armv7a/gcc-9.3. I can't tell if I'm doing something wrong of if there is a compiler bug. This optimization has been removed for the time being.
…vailable as macros. The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks. - *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts: # ref_kernels/3/bli_gemmtrsm_ref.c # ref_kernels/ind/bli_gemmtrsm1m_ref.c
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions.
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
Beta (as the scalar attached to C) was not seen as reset to 1 after the first iteration of the pc loop, as the wrong pointer was passed to bli_gemm_int.
# Conflicts: # config/a64fx/bli_cntx_init_a64fx.c # config/armsve/bli_cntx_init_armsve.c # config/bgq/bli_cntx_init_bgq.c # config/bulldozer/bli_cntx_init_bulldozer.c # config/cortexa15/bli_cntx_init_cortexa15.c # config/cortexa53/bli_cntx_init_cortexa53.c # config/cortexa57/bli_cntx_init_cortexa57.c # config/cortexa9/bli_cntx_init_cortexa9.c # config/excavator/bli_cntx_init_excavator.c # config/firestorm/bli_cntx_init_firestorm.c # config/haswell/bli_cntx_init_haswell.c # config/knc/bli_cntx_init_knc.c # config/knl/bli_cntx_init_knl.c # config/penryn/bli_cntx_init_penryn.c # config/piledriver/bli_cntx_init_piledriver.c # config/power10/bli_cntx_init_power10.c # config/power7/bli_cntx_init_power7.c # config/power9/bli_cntx_init_power9.c # config/sandybridge/bli_cntx_init_sandybridge.c # config/skx/bli_cntx_init_skx.c # config/steamroller/bli_cntx_init_steamroller.c # config/template/bli_cntx_init_template.c # config/thunderx2/bli_cntx_init_thunderx2.c # config/zen/bli_cntx_init_zen.c # config/zen2/bli_cntx_init_zen2.c # config/zen3/bli_cntx_init_zen3.c # frame/0/bli_l0_check.h # frame/0/bli_l0_oapi.c # frame/0/bli_l0_oapi.h # frame/0/bli_l0_tapi.h # frame/0/copysc/bli_copysc.c # frame/1/bli_l1v_oapi.h # frame/1/bli_l1v_tapi.c # frame/1/bli_l1v_tapi.h # frame/1d/bli_l1d_ft.h # frame/1d/bli_l1d_oapi.c # frame/1d/bli_l1d_oapi.h # frame/1d/bli_l1d_tapi.c # frame/1d/bli_l1d_tapi.h # frame/1f/bli_l1f_check.c # frame/1f/bli_l1f_check.h # frame/1f/bli_l1f_ft.h # frame/1f/bli_l1f_oapi.c # frame/1f/bli_l1f_oapi.h # frame/1f/bli_l1f_tapi.c # frame/1f/bli_l1f_tapi.h # frame/1m/bli_l1m_ft.h # frame/1m/bli_l1m_oapi.c # frame/1m/bli_l1m_oapi.h # frame/1m/bli_l1m_oft_var.h # frame/1m/bli_l1m_tapi.c # frame/1m/bli_l1m_tapi.h # frame/1m/packm/bli_packm_alloc.c # frame/1m/packm/bli_packm_alloc.h # frame/1m/packm/bli_packm_blk_var1.c # frame/1m/packm/bli_packm_blk_var1.h # frame/1m/packm/bli_packm_cntl.h # frame/1m/packm/bli_packm_init.c # frame/1m/packm/bli_packm_init.h # frame/1m/packm/bli_packm_int.c # frame/1m/packm/bli_packm_int.h # frame/1m/unpackm/bli_unpackm_blk_var1.c # frame/1m/unpackm/bli_unpackm_int.c # frame/2/bli_l2_check.c # frame/2/bli_l2_check.h # frame/2/bli_l2_ft.h # frame/2/bli_l2_oapi.c # frame/2/bli_l2_oapi.h # frame/2/bli_l2_tapi.c # frame/2/bli_l2_tapi.h # frame/3/bli_l3_blocksize.c # frame/3/bli_l3_blocksize.h # frame/3/bli_l3_cntl.c # frame/3/bli_l3_direct.h # frame/3/bli_l3_int.c # frame/3/bli_l3_int.h # frame/3/bli_l3_oapi.c # frame/3/bli_l3_oapi.h # frame/3/bli_l3_oapi_ex.c # frame/3/bli_l3_oapi_ex.h # frame/3/bli_l3_oft.h # frame/3/bli_l3_oft_var.h # frame/3/bli_l3_packab.c # frame/3/bli_l3_packab.h # frame/3/bli_l3_sup.c # frame/3/bli_l3_sup.h # frame/3/bli_l3_sup_oft.h # frame/3/bli_l3_sup_packm_a.c # frame/3/bli_l3_sup_packm_a.h # frame/3/bli_l3_sup_packm_b.c # frame/3/bli_l3_sup_packm_b.h # frame/3/bli_l3_sup_packm_var.c # frame/3/bli_l3_sup_packm_var.h # frame/3/bli_l3_sup_var1n2m.c # frame/3/bli_l3_sup_vars.h # frame/3/bli_l3_tapi_ex.c # frame/3/bli_l3_tapi_ex.h # frame/3/gemm/bli_gemm_blk_var1.c # frame/3/gemm/bli_gemm_blk_var2.c # frame/3/gemm/bli_gemm_blk_var3.c # frame/3/gemm/bli_gemm_front.c # frame/3/gemm/bli_gemm_front.h # frame/3/gemm/bli_gemm_ker_var2.c # frame/3/gemm/bli_gemm_md.c # frame/3/gemm/bli_gemm_md.h # frame/3/gemm/bli_gemm_var.h # frame/3/gemmt/bli_gemmt_front.c # frame/3/gemmt/bli_gemmt_front.h # frame/3/gemmt/bli_gemmt_l_ker_var2.c # frame/3/gemmt/bli_gemmt_u_ker_var2.c # frame/3/gemmt/bli_gemmt_var.h # frame/3/gemmt/bli_gemmt_x_ker_var2.c # frame/3/hemm/bli_hemm_front.c # frame/3/hemm/bli_hemm_front.h # frame/3/symm/bli_symm_front.c # frame/3/symm/bli_symm_front.h # frame/3/trmm/bli_trmm_front.c # frame/3/trmm/bli_trmm_front.h # frame/3/trmm/bli_trmm_ll_ker_var2.c # frame/3/trmm/bli_trmm_lu_ker_var2.c # frame/3/trmm/bli_trmm_rl_ker_var2.c # frame/3/trmm/bli_trmm_ru_ker_var2.c # frame/3/trmm/bli_trmm_var.h # frame/3/trmm/bli_trmm_xx_ker_var2.c # frame/3/trmm3/bli_trmm3_front.c # frame/3/trmm3/bli_trmm3_front.h # frame/3/trsm/bli_trsm_blk_var1.c # frame/3/trsm/bli_trsm_blk_var2.c # frame/3/trsm/bli_trsm_blk_var3.c # frame/3/trsm/bli_trsm_front.c # frame/3/trsm/bli_trsm_front.h # frame/3/trsm/bli_trsm_ll_ker_var2.c # frame/3/trsm/bli_trsm_lu_ker_var2.c # frame/3/trsm/bli_trsm_rl_ker_var2.c # frame/3/trsm/bli_trsm_ru_ker_var2.c # frame/3/trsm/bli_trsm_var.h # frame/3/trsm/bli_trsm_xx_ker_var2.c # frame/base/bli_blksz.c # frame/base/bli_blksz.h # frame/base/bli_cntl.h # frame/base/bli_cntx.c # frame/base/bli_cntx.h # frame/base/bli_env.c # frame/base/bli_gks.c # frame/base/bli_gks.h # frame/base/bli_ind.h # frame/base/bli_info.c # frame/base/bli_obj_scalar.c # frame/base/bli_obj_scalar.h # frame/base/bli_pba.c # frame/base/bli_rntm.h # frame/base/bli_sba.c # frame/base/bli_sba.h # frame/base/bli_setgetijm.c # frame/base/check/bli_obj_check.c # frame/base/check/bli_obj_check.h # frame/include/bli_oapi_ex.h # frame/include/bli_obj_macro_defs.h # frame/include/bli_tapi_ex.h # frame/include/bli_type_defs.h # frame/thread/bli_l3_decor.h # frame/thread/bli_l3_decor_openmp.c # frame/thread/bli_l3_decor_pthreads.c # frame/thread/bli_l3_decor_single.c # frame/thread/bli_l3_sup_decor.h # frame/thread/bli_l3_sup_decor_openmp.c # frame/thread/bli_l3_sup_decor_pthreads.c # frame/thread/bli_l3_sup_decor_single.c # frame/thread/bli_thread.c # frame/thread/bli_thread.h # frame/thread/bli_thrinfo.c # frame/thread/bli_thrinfo.h # frame/thread/bli_thrinfo_sup.c # frame/util/bli_util_check.c # frame/util/bli_util_check.h # frame/util/bli_util_oapi.c # frame/util/bli_util_oapi.h # kernels/zen/1/bli_copyv_zen_int.c # kernels/zen/1/bli_scalv_zen_int10.c # kernels/zen/1f/bli_axpyf_zen_int_4.c # kernels/zen/1f/bli_axpyf_zen_int_5.c # ref_kernels/1m/bli_packm_cxk_1er_ref.c # ref_kernels/3/bli_gemm_ref.c # ref_kernels/3/bli_gemmtrsm_ref.c # ref_kernels/bli_cntx_ref.c # ref_kernels/ind/bli_gemm1m_ref.c # ref_kernels/ind/bli_trsm1m_ref.c # testsuite/src/test_libblis.c
This enables better debugging since errors will show up based on the un-flattened filename and line number.
# Conflicts: # build/flatten-headers.py # frame/3/bli_l3_sup_var1n2m.c
# Conflicts: # build/flatten-headers.py # frame/3/bli_l3_sup_packm.c # frame/3/bli_l3_sup_packm.h # frame/3/bli_l3_sup_packm_var.c # frame/3/bli_l3_sup_packm_var.h # frame/3/bli_l3_sup_var1n2m.c # frame/3/gemmt/bli_gemmt_front.c
1. Add a check for pool exhaustion when freeing blocks. This detects double-free and other bad conditions without segfault. 2. Make sure to copy *all* block pointers when growing the pool size. Previously, checked-out block pointers were not copied, leading to the presence of uninitialized data.
This option (disbaled by default) enables compiling and linking with the Address Sanitizer library (ASan), via the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is included for all files *except* optimized kernels, since it usually reguires an extra register which violates the constraints for many gemm microkernels.
Reinstate check for checked-out blocks upon finalization. A flag has been added to indicate that the pool is actually under reinitialization (where checked-out blocks are OK), which temporarily disables the check. A memory leak where blocks are not checked back in is now correctly detected upon exit.
# Conflicts: # Makefile # common.mk # configure # frame/3/bli_l3_oapi_ex.c # frame/3/bli_l3_sup_packm.c # frame/3/bli_l3_sup_packm.h # frame/3/bli_l3_sup_ref.c # frame/3/bli_l3_sup_var1n2m.c # frame/base/bli_pool.c # frame/base/bli_rntm.h # frame/thread/bli_l3_decor.h # frame/thread/bli_l3_decor_openmp.c # frame/thread/bli_l3_decor_pthreads.c # frame/thread/bli_l3_decor_single.c # frame/thread/bli_l3_sup_decor.h # frame/thread/bli_l3_sup_decor_openmp.c # frame/thread/bli_l3_sup_decor_pthreads.c # frame/thread/bli_l3_sup_decor_single.c # frame/thread/bli_thrcomm.h # frame/thread/bli_thrcomm_openmp.c # frame/thread/bli_thrcomm_pthreads.c # frame/thread/bli_thrcomm_single.c # frame/thread/bli_thread.c # frame/thread/bli_thrinfo.c # frame/thread/bli_thrinfo.h # frame/thread/bli_thrinfo_sup.c
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on #673.