Skip to content

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

@gjmulder

Description

@gjmulder

Discussed in #616

Originally posted by izard March 30, 2023
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:

pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock

and setting has_work to

pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions