Multi-threaded ggml_cpy#1035
Conversation
|
I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of |
|
This path is only used when LoRA is applied using a different base model specified with |
|
Thanks @slaren . I'm seeing 17s on master and 16s with your PR. Just because the SIMD optimizations were up for discussion: with |
ggerganov
left a comment
There was a problem hiding this comment.
ggml_compute_forward_dup_f32() and ggml_compute_forward_dup_f16() need some good refactoring soon 😄
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Reduces overall LoRA loading times significantly when using a different base model with
--lora-base, from 32s to 24s in my test case.It also seems to improve the general performance of
ggml_cpysignificantly, about twice as fast, but overall this is an insignificant fraction of the eval time, so it isn't really noticeable.I tried to cover all the paths in
ggml_cpy, but there are a lot of them and only a few are hit in llama.cpp, so I have not tested every single one.Perplexity (bs=512):
LoRA (quantize):