Multi-threaded ggml_cpy by slaren · Pull Request #1035 · ggml-org/llama.cpp

slaren · 2023-04-18T01:32:24Z

Reduces overall LoRA loading times significantly when using a different base model with --lora-base, from 32s to 24s in my test case.

It also seems to improve the general performance of ggml_cpy significantly, about twice as fast, but overall this is an insignificant fraction of the eval time, so it isn't really noticeable.

I tried to cover all the paths in ggml_cpy, but there are a lot of them and only a few are hit in llama.cpp, so I have not tested every single one.

Perplexity (bs=512):

MASTER: perf_total_per_op_us[             CPY] =  309.170 ms
PR:     perf_total_per_op_us[             CPY] =  132.353 ms

LoRA (quantize):

MASTER: perf_total_per_op_us[             CPY] =  45.780 ms
PR:     perf_total_per_op_us[             CPY] =   5.255 ms

sw · 2023-04-18T15:14:42Z

I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of llama_apply_lora_from_file_internal, right? Can you show exactly what command lines you use?

slaren · 2023-04-18T15:19:06Z

This path is only used when LoRA is applied using a different base model specified with --lora-base, otherwise the quantization is done in a ggml_add instead. You can use a command line similar to this one:

./main -m models/7B/ggml-model-q4_0.bin --lora lora/baize-lora-7B/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin

sw · 2023-04-18T15:38:51Z

Thanks @slaren . I'm seeing 17s on master and 16s with your PR.

Just because the SIMD optimizations were up for discussion: with quantize_row_q_reference in ggml_compute_forward_dup_f16, the difference is greater. Master 33s, this PR 20s.

ggerganov

ggml_compute_forward_dup_f32() and ggml_compute_forward_dup_f16() need some good refactoring soon 😄

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

slaren force-pushed the mt-cpy branch from 924714e to 9a262d5 Compare April 18, 2023 01:36

ggerganov reviewed Apr 18, 2023

View reviewed changes

Comment thread ggml.c Outdated

ggerganov approved these changes Apr 18, 2023

View reviewed changes

ggerganov reviewed Apr 18, 2023

View reviewed changes

Comment thread ggml.c Outdated

ggerganov approved these changes Apr 18, 2023

View reviewed changes

slaren and others added 3 commits April 19, 2023 00:43

Multi-threaded ggml_cpy

0f8b1df

Update ggml.c

8bd47a8

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Also fix wdata offset in ggml_compute_forward_add_q_f32

b9e99cd

slaren force-pushed the mt-cpy branch from 2a37a8e to b9e99cd Compare April 18, 2023 22:45

slaren merged commit 6667401 into ggml-org:master Apr 18, 2023

slaren deleted the mt-cpy branch April 18, 2023 22:53

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded ggml_cpy#1035

Multi-threaded ggml_cpy#1035
slaren merged 3 commits intoggml-org:masterfrom
slaren:mt-cpy

slaren commented Apr 18, 2023 •

edited

Loading

Uh oh!

sw commented Apr 18, 2023

Uh oh!

slaren commented Apr 18, 2023

Uh oh!

sw commented Apr 18, 2023

Uh oh!

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slaren commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw commented Apr 18, 2023

Uh oh!

slaren commented Apr 18, 2023

Uh oh!

sw commented Apr 18, 2023

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slaren commented Apr 18, 2023 •

edited

Loading