Fix bug where multiple threads pack the same parts of each panel.#680
Fix bug where multiple threads pack the same parts of each panel.#680
Conversation
This was caused by different parts of the code using num_threads/thread_id vs. n_way/work_id. The fix was to standardize on the latter and provide a "fake" thrinfo_t sub-prenode in the thread control tree which consists of single-threaded teams. The single-team with multiple threads node is also required since it and only it can be used to do barriers and broadcasting (e.g. of the packed buffer pointer).
| bli_thrinfo_set_sub_node( thread_cur, thread_par ); | ||
|
|
||
| if ( sub_prenode != NULL ) | ||
| if ( bszid == BLIS_NO_PART ) |
There was a problem hiding this comment.
I wonder if there is a replacement to the condition if ( bszid == BLIS_NO_PART ) (there are tons of it in BLIS) to something clearer, like:
const bool doing_packing = ...; // perhaps giving a new name to BLIS_NO_PART to mean "packing" ?
if ( doing_packing )
{
...
}
PS: you seem to have indent issues in some places, or it was the Github viewer ?
There was a problem hiding this comment.
We have lots of indent issues because I write spaces and then Field changes them to tabs. It's a process.
There was a problem hiding this comment.
But this is really a hack for now to fix performance and potential correctness issues. This will be replaced by something more robust eventually.
There was a problem hiding this comment.
Although do note that the only control tree nodes with BLIS_NO_PART at the moment are packing nodes.
There was a problem hiding this comment.
Although do note that the only control tree nodes with BLIS_NO_PART at the moment are packing nodes.
Yes, that's why I suggested to rename it to something more understandable. I recall myself the first time reading the code to find out what does BLIS_NO_PART mean. But yes, you can take time to refactor it later.
There was a problem hiding this comment.
I was trying to future-proof the enum when naming it BLIS_NO_PART because I envisioned maybe someday there would be an operation that doesn't partition but doesn't pack either. I agree it's not super clear and I'm 100% open to it being improved.
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
- (cherry picked from commit aeb5f0c)
Fixed performance bug caused by redundant packing. (#680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
- (cherry picked from commit 29f79f0)
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
- (cherry picked from commit aeb5f0c)
Fixed performance bug caused by redundant packing. (#680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
- (cherry picked from commit 29f79f0)
Fixed random segfault in test/3 drivers. (#788)
Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
result of sometimes leaving either .n_str or .k_str fields of the
params_t struct uninitialized, depending on the operation in question.
For example, in test_hemm.c, init_def_params() would only initialize
the .m_str and .n_str fields, but not the .k_str field. Even though
hemm doesn't use a 'k' dimension, the proc_params() function (called
via parse_cl_params()) universally attempts to convert all three into
integers via sscanf(), which was understandably failing when one of
those strings was a NULL pointer. I'm not sure how this code ever
worked to begin with. Special thanks to Leick Robinson for finding and
reporting this bug.
- (cherry picked from commit 1236dda)
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
- (cherry picked from commit aeb5f0c)
Fixed performance bug caused by redundant packing. (#680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
- (cherry picked from commit 29f79f0)
Fixed random segfault in test/3 drivers. (#788)
Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
result of sometimes leaving either .n_str or .k_str fields of the
params_t struct uninitialized, depending on the operation in question.
For example, in test_hemm.c, init_def_params() would only initialize
the .m_str and .n_str fields, but not the .k_str field. Even though
hemm doesn't use a 'k' dimension, the proc_params() function (called
via parse_cl_params()) universally attempts to convert all three into
integers via sscanf(), which was understandably failing when one of
those strings was a NULL pointer. I'm not sure how this code ever
worked to begin with. Special thanks to Leick Robinson for finding and
reporting this bug.
- (cherry picked from commit 1236dda)
Fixed staleness in kernels/zen/3/bli_gemm_small.c.
Details:
- Added missing 'const' keyword in function prototypes for
bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
- (cherry picked from commit aeb5f0c)
Fixed performance bug caused by redundant packing. (#680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
- (cherry picked from commit 29f79f0)
Fixed random segfault in test/3 drivers. (#788)
Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
result of sometimes leaving either .n_str or .k_str fields of the
params_t struct uninitialized, depending on the operation in question.
For example, in test_hemm.c, init_def_params() would only initialize
the .m_str and .n_str fields, but not the .k_str field. Even though
hemm doesn't use a 'k' dimension, the proc_params() function (called
via parse_cl_params()) universally attempts to convert all three into
integers via sscanf(), which was understandably failing when one of
those strings was a NULL pointer. I'm not sure how this code ever
worked to begin with. Special thanks to Leick Robinson for finding and
reporting this bug.
- (cherry picked from commit 1236dda)
Fixed staleness in kernels/zen/3/bli_gemm_small.c.
Details:
- Added missing 'const' keyword in function prototypes for
bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.
Allow test/3 drivers to use default ind_t method. (#804)
Details:
- Previously, the standalone performance drivers in test/3 were written
under the assumption that the user would want to explicitly test
either native execution *or* 1m. But because the accompanying runme.sh
script defaults to passing "native" in for the -i command line option
(which explicitly sets the induced method type), running the script
without modification causes the test drivers to use slow reference
microkernels on systems where native complex-domain microkernels are
not registered -- which will yield poor performance for complex-domain
level-3 operations. Furthermore, even if a user was aware of this, the
test drivers did not support any single value for the -i option that
would test BLIS using the library's default behavior -- that is, using
1m on systems where it is needed and native execution on systems that
have native microkernels implemented and registered.
- This commit addresses the aforementioned issue by supporting a new
value for the -i option: "auto". The "auto" value causes the driver
to avoid explicitly setting the induced method altogether, leaving
BLIS's default behavior in place. This "auto" option is also now the
default setting within the runme.sh script. Thanks to Leick Robinson
for finding and reporting this issue.
- Also added support for "nat" as a shorthand for "native", which
the help text already (erroneously) claimed was supported.
- (cherry picked from commit fd1a7e3)
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
- (cherry picked from commit aeb5f0c)
Fixed performance bug caused by redundant packing. (#680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
- (cherry picked from commit 29f79f0)
Fixed random segfault in test/3 drivers. (#788)
Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
result of sometimes leaving either .n_str or .k_str fields of the
params_t struct uninitialized, depending on the operation in question.
For example, in test_hemm.c, init_def_params() would only initialize
the .m_str and .n_str fields, but not the .k_str field. Even though
hemm doesn't use a 'k' dimension, the proc_params() function (called
via parse_cl_params()) universally attempts to convert all three into
integers via sscanf(), which was understandably failing when one of
those strings was a NULL pointer. I'm not sure how this code ever
worked to begin with. Special thanks to Leick Robinson for finding and
reporting this bug.
- (cherry picked from commit 1236dda)
Fixed staleness in kernels/zen/3/bli_gemm_small.c.
Details:
- Added missing 'const' keyword in function prototypes for
bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.
This was caused by different parts of the code using num_threads/thread_id vs. n_way/work_id. The fix was to standardize on the latter and provide a "fake" thrinfo_t sub-prenode in the thread control tree which consists of single-threaded teams. The single-team with multiple threads node is also required since it and only it can be used to do barriers and broadcasting (e.g. of the packed buffer pointer).