Discussed in #616
Originally posted by izard March 30, 2023
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:
pthread_mutex_lock(&state->shared->mutex);
while (state->shared->has_work) {
pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock
and setting has_work to
pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);
Got a nice 2x speedup in time/token.
I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.
Discussed in #616
Originally posted by izard March 30, 2023
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for
state_shared.has_workin while loops than doing actual work in matrix multiply.So I changed busy waits like:
and setting
has_worktoGot a nice 2x speedup in time/token.
I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.