Skip to content

Multi-threaded ggml_cpy#1035

Merged
slaren merged 3 commits intoggml-org:masterfrom
slaren:mt-cpy
Apr 18, 2023
Merged

Multi-threaded ggml_cpy#1035
slaren merged 3 commits intoggml-org:masterfrom
slaren:mt-cpy

Conversation

@slaren
Copy link
Copy Markdown
Member

@slaren slaren commented Apr 18, 2023

Reduces overall LoRA loading times significantly when using a different base model with --lora-base, from 32s to 24s in my test case.

It also seems to improve the general performance of ggml_cpy significantly, about twice as fast, but overall this is an insignificant fraction of the eval time, so it isn't really noticeable.

I tried to cover all the paths in ggml_cpy, but there are a lot of them and only a few are hit in llama.cpp, so I have not tested every single one.

Perplexity (bs=512):

MASTER: perf_total_per_op_us[             CPY] =  309.170 ms
PR:     perf_total_per_op_us[             CPY] =  132.353 ms

LoRA (quantize):

MASTER: perf_total_per_op_us[             CPY] =  45.780 ms
PR:     perf_total_per_op_us[             CPY] =   5.255 ms

@sw
Copy link
Copy Markdown
Contributor

sw commented Apr 18, 2023

I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of llama_apply_lora_from_file_internal, right? Can you show exactly what command lines you use?

@slaren
Copy link
Copy Markdown
Member Author

slaren commented Apr 18, 2023

This path is only used when LoRA is applied using a different base model specified with --lora-base, otherwise the quantization is done in a ggml_add instead. You can use a command line similar to this one:

./main -m models/7B/ggml-model-q4_0.bin --lora lora/baize-lora-7B/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin

@sw
Copy link
Copy Markdown
Contributor

sw commented Apr 18, 2023

Thanks @slaren . I'm seeing 17s on master and 16s with your PR.

Just because the SIMD optimizations were up for discussion: with quantize_row_q_reference in ggml_compute_forward_dup_f16, the difference is greater. Master 33s, this PR 20s.

Comment thread ggml.c Outdated
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_compute_forward_dup_f32() and ggml_compute_forward_dup_f16() need some good refactoring soon 😄

Comment thread ggml.c Outdated
@slaren slaren merged commit 6667401 into ggml-org:master Apr 18, 2023
@slaren slaren deleted the mt-cpy branch April 18, 2023 22:53
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants