~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional

### Discussed in https://github.com/ggerganov/llama.cpp/discussions/616

<div type='discussions-op-text'>

<sup>Originally posted by **izard** March 30, 2023</sup>
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for `state_shared.has_work` in while loops than doing actual work in matrix multiply.
So I changed busy waits like: 
```
pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock
```

and setting `has_work` to 
```
pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

```
Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.
</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

Discussed in #616

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

Description

Discussed in #616

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions