Fixed trmm[3]/trsm performance bug in cf7d616.#685
Merged
Conversation
Details: - Fixed a performance bug in the packing of micropanels that intersect the diagonal of triangular matrices (i.e., those found in trmm, trmm3, and trsm). This bug was introduced in cf7d616 and stemmed from an ill-formed boolean conditional expression in bli_packm_blk_var1(). This conditional would chose when to use round-robin parallel work allocation, but checked for the triangularity of the submatrix being packed while failing also to check for whether the current micropanel actually intersected the diagonal. The net result of this bug was that *all* micropanels of a triangular matrix, no matter where the upanels resided within the matrix, were assigned to threads via a round-robin policy. This affected some microarchitectures and threading configurations much worse than others, but it seems that overall the effect was universally negative, likely because of the reduced spatial locality during the packing with round-robin. Thanks to Leick Robinson for his tireless efforts in helping track down this issue.
fgvanzee
added a commit
that referenced
this pull request
Nov 3, 2022
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
.ker_params. These fields store pointers to functions and data that
will allow the user to more flexibly create custom operations while
recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
replace the diagonal offset with panel offsets, and changed strides
of both C and P to inc/ldim semantics. Updated object API to the packm
variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
definition since it has been replaced by the .pack_fn pointer in the
obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
instead of typed pointers, allowing a single function to be used
regardless of datatype. This obviated having a separate implementation
in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
always packed as column-stored row panels -- that is, identically to
that of left-hand matrix operands ("A"). This means that while we pack
matrix A normally, we actually pack B in a transposed state. This
allowed us to simplify a lot of code throughout the framework, and
also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
described above. bli_packm_init()--which is now called from within
bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
a bool that indicates whether packing should be performed (or
skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
which, among other things, defaults the new .pack_fn field of the
obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
refocuses the view of an object so that it "forgets" any offsets from
its original pointer. This function also sets the object's root field
to itself. Calls to bli_obj_reset_origin() for each matrix operand
appear in the _front() functions, after the obj_t's are aliased. This
resetting of the underlying matrices' origins is needed in preparation
for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
libblis_test_pobj_create() to create local packed objects. Previously,
these packed objects were created by calling lower-level functions.
- (cherry picked from commit cf7d616)
Fixed trmm[3]/trsm performance bug in cf7d616. (#685)
Details:
- Fixed a performance bug in the packing of micropanels that intersect
the diagonal of triangular matrices (i.e., those found in trmm, trmm3,
and trsm). This bug was introduced in cf7d616 and stemmed from an
ill-formed boolean conditional expression in bli_packm_blk_var1().
This conditional would chose when to use round-robin parallel work
allocation, but checked for the triangularity of the submatrix being
packed while failing also to check for whether the current micropanel
actually intersected the diagonal. The net result of this bug was that
*all* micropanels of a triangular matrix, no matter where the upanels
resided within the matrix, were assigned to threads via a round-robin
policy. This affected some microarchitectures and threading
configurations much worse than others, but it seems that overall the
effect was universally negative, likely because of the reduced spatial
locality during the packing with round-robin. Thanks to Leick Robinson
for his tireless efforts in helping track down this issue.
- (cherry picked from commit 872898d)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details: