Skip to content

Fix bug where multiple threads pack the same parts of each panel.#680

Merged
fgvanzee merged 2 commits intomasterfrom
fix-thread-overpacking
Oct 31, 2022
Merged

Fix bug where multiple threads pack the same parts of each panel.#680
fgvanzee merged 2 commits intomasterfrom
fix-thread-overpacking

Conversation

@devinamatthews
Copy link
Copy Markdown
Member

This was caused by different parts of the code using num_threads/thread_id vs. n_way/work_id. The fix was to standardize on the latter and provide a "fake" thrinfo_t sub-prenode in the thread control tree which consists of single-threaded teams. The single-team with multiple threads node is also required since it and only it can be used to do barriers and broadcasting (e.g. of the packed buffer pointer).

This was caused by different parts of the code using num_threads/thread_id vs. n_way/work_id. The fix was to standardize on the latter and provide a "fake" thrinfo_t sub-prenode in the thread control tree which consists of single-threaded teams. The single-team with multiple threads node is also required since it and only it can be used to do barriers and broadcasting (e.g. of the packed buffer pointer).
Comment thread frame/3/bli_l3_thrinfo.c Outdated
bli_thrinfo_set_sub_node( thread_cur, thread_par );

if ( sub_prenode != NULL )
if ( bszid == BLIS_NO_PART )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a replacement to the condition if ( bszid == BLIS_NO_PART ) (there are tons of it in BLIS) to something clearer, like:

const bool doing_packing = ...; // perhaps giving a new name to BLIS_NO_PART to mean "packing" ?
if ( doing_packing )
{
    ...
}

PS: you seem to have indent issues in some places, or it was the Github viewer ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have lots of indent issues because I write spaces and then Field changes them to tabs. It's a process.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is really a hack for now to fix performance and potential correctness issues. This will be replaced by something more robust eventually.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although do note that the only control tree nodes with BLIS_NO_PART at the moment are packing nodes.

Copy link
Copy Markdown
Contributor

@hominhquan hominhquan Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although do note that the only control tree nodes with BLIS_NO_PART at the moment are packing nodes.

Yes, that's why I suggested to rename it to something more understandable. I recall myself the first time reading the code to find out what does BLIS_NO_PART mean. But yes, you can take time to refactor it later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to future-proof the enum when naming it BLIS_NO_PART because I envisioned maybe someday there would be an operation that doesn't partition but doesn't pack either. I agree it's not super clear and I'm 100% open to it being improved.

@fgvanzee fgvanzee merged commit 29f79f0 into master Oct 31, 2022
@devinamatthews devinamatthews deleted the fix-thread-overpacking branch November 1, 2022 01:08
fgvanzee added a commit that referenced this pull request Nov 20, 2023
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)
fgvanzee added a commit that referenced this pull request Dec 3, 2023
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)

Fixed random segfault in test/3 drivers. (#788)

Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)
fgvanzee added a commit that referenced this pull request Mar 26, 2024
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)

Fixed random segfault in test/3 drivers. (#788)

Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)

Fixed staleness in kernels/zen/3/bli_gemm_small.c.

Details:
- Added missing 'const' keyword in function prototypes for
  bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
  conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.
fgvanzee added a commit that referenced this pull request Apr 25, 2024
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)

Fixed random segfault in test/3 drivers. (#788)

Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)

Fixed staleness in kernels/zen/3/bli_gemm_small.c.

Details:
- Added missing 'const' keyword in function prototypes for
  bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
  conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.

Allow test/3 drivers to use default ind_t method. (#804)

Details:
- Previously, the standalone performance drivers in test/3 were written
  under the assumption that the user would want to explicitly test
  either native execution *or* 1m. But because the accompanying runme.sh
  script defaults to passing "native" in for the -i command line option
  (which explicitly sets the induced method type), running the script
  without modification causes the test drivers to use slow reference
  microkernels on systems where native complex-domain microkernels are
  not registered -- which will yield poor performance for complex-domain
  level-3 operations. Furthermore, even if a user was aware of this, the
  test drivers did not support any single value for the -i option that
  would test BLIS using the library's default behavior -- that is, using
  1m on systems where it is needed and native execution on systems that
  have native microkernels implemented and registered.
- This commit addresses the aforementioned issue by supporting a new
  value for the -i option: "auto". The "auto" value causes the driver
  to avoid explicitly setting the induced method altogether, leaving
  BLIS's default behavior in place. This "auto" option is also now the
  default setting within the runme.sh script. Thanks to Leick Robinson
  for finding and reporting this issue.
- Also added support for "nat" as a shorthand for "native", which
  the help text already (erroneously) claimed was supported.
- (cherry picked from commit fd1a7e3)
fgvanzee added a commit that referenced this pull request Apr 30, 2024
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)

Fixed random segfault in test/3 drivers. (#788)

Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)

Fixed staleness in kernels/zen/3/bli_gemm_small.c.

Details:
- Added missing 'const' keyword in function prototypes for
  bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
  conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants