From aa0cd123eab999af861c2167d4cd251a23cb8c7e Mon Sep 17 00:00:00 2001 From: "Field G. Van Zee" Date: Wed, 2 Nov 2022 20:52:45 -0500 Subject: [PATCH] Fixed trmm[3]/trsm performance bug in cf7d616. Details: - Fixed a performance bug in the packing of micropanels that intersect the diagonal of triangular matrices (i.e., those found in trmm, trmm3, and trsm). This bug was introduced in cf7d616 and stemmed from an ill-formed boolean conditional expression in bli_packm_blk_var1(). This conditional would chose when to use round-robin parallel work allocation, but checked for the triangularity of the submatrix being packed while failing also to check for whether the current micropanel actually intersected the diagonal. The net result of this bug was that *all* micropanels of a triangular matrix, no matter where the upanels resided within the matrix, were assigned to threads via a round-robin policy. This affected some microarchitectures and threading configurations much worse than others, but it seems that overall the effect was universally negative, likely because of the reduced spatial locality during the packing with round-robin. Thanks to Leick Robinson for his tireless efforts in helping track down this issue. --- frame/1m/packm/bli_packm_blk_var1.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/frame/1m/packm/bli_packm_blk_var1.c b/frame/1m/packm/bli_packm_blk_var1.c index 9ac9582db8..05263c4b7f 100644 --- a/frame/1m/packm/bli_packm_blk_var1.c +++ b/frame/1m/packm/bli_packm_blk_var1.c @@ -190,15 +190,16 @@ void bli_packm_blk_var1 inc_t p_inc = ps_p; - // NOTE: We MUST use round-robin partitioning when packing - // micropanels of a triangular matrix. Hermitian/symmetric - // and general packing may use slab or round-robin, depending - // on which was selected at configure-time. - // The definition of bli_packm_my_iter() will depend on whether slab - // or round-robin partitioning was requested at configure-time. - bool my_iter = bli_is_triangular( strucc ) - ? bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) - : bli_packm_my_iter ( it, it_start, it_end, tid, nt ); + // NOTE: We MUST use round-robin work allocation (bli_packm_my_iter_rr()) + // when packing micropanels of a triangular matrix. Hermitian/symmetric + // and general packing may use slab or round-robin (bli_packm_my_iter()), + // depending on which was selected at configure-time. + bool my_iter = ( bli_is_triangular( strucc ) && + bli_intersects_diag_n( diagoffc_i, panel_dim_i, + panel_len_full ) + ? bli_packm_my_iter_rr( it, it_start, it_end, tid, nt ) + : bli_packm_my_iter ( it, it_start, it_end, tid, nt ) + ); if ( bli_is_triangular( strucc ) && bli_is_unstored_subpart_n( diagoffc_i, uploc, panel_dim_i, panel_len_full ) )