ggml : group all experts in a single ggml_mul_mat_id by slaren · Pull Request #6505 · ggml-org/llama.cpp

slaren · 2024-04-05T13:23:20Z

Should improve performance of MoE models with CUDA significantly. Also improved the rearrangement of the rows in the CUDA backend with custom kernels instead of memcpys, that's about 50% of the speedup here.

GPU	Model	Test	t/s master	t/s sl/moe-rework-2	Speedup
RTX 3090 Ti	mixtral Q3_K_S	pp512	387.58	1226.11	3.16
RTX 3090 Ti	mixtral Q3_K_S	tg128	43.07	50.40	1.17

cuda : improve mmid row copy

askmyteapot · 2024-04-06T14:21:08Z

Benchmarked this is on a Ryzen 5800x (64GB ddr4@3733MT CL16) and Tesla P40 24GB. 28/33 layers offloaded. Model is Bagel Mistery Tour 8x7b (mixtral)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl | test       | moe-rework-2 t/s | master       t/s | speedup |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | ---------------: | ------: |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | pp 4096    |    106.83 ± 0.07 |     75.64 ± 0.08 |  1.412x |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | tg 128     |     13.34 ± 0.01 |     12.70 ± 0.00 |  1.050x |

Dampfinchen · 2024-04-06T18:09:50Z

Alright, wow. PP went down from 11,85 ms/t to 4,95 ms/t (with partial offloading, 5 layers Mixtral and 2060). Simply incredible, but I'm not surprised anymore as Slaren always delivers. Llama.cpp's MOE implementation is now extremly robust.

askmyteapot · 2024-04-07T03:53:56Z

I also noticed that this pull uses significantly less CUDA Buffer (50% less) compared to master which allowed an extra layer at low ctx.

PR:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/33 layers to GPU
llm_load_tensors:        CPU buffer size =  2711.96 MiB
llm_load_tensors:      CUDA0 buffer size = 22465.31 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    16.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   240.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   395.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1574
llama_new_context_with_model: graph splits = 28

Master (build = 2620 (d4f220a))

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 29 repeating layers to GPU
llm_load_tensors: offloaded 29/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3425.96 MiB
llm_load_tensors:      CUDA0 buffer size = 21716.47 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    24.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   232.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   787.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1638
llama_new_context_with_model: graph splits = 41

askmyteapot · 2024-04-07T07:15:04Z

Just tested with 8k context... not as much as a savings.
PR

llm_load_tensors: offloaded 28/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4139.96 MiB
llm_load_tensors:      CUDA0 buffer size = 20967.62 MiB
......
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   129.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   906.50 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   595.69 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB

Master

llm_load_tensors: offloaded 28/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4139.96 MiB
llm_load_tensors:      CUDA0 buffer size = 20967.62 MiB
.......
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   129.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   906.50 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   783.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB

askmyteapot · 2024-04-07T08:09:49Z

ftype = IQ4_XS - 4.25 bpw
params = 46.70 B
size = 23.57 GiB (4.33 BPW)
23/33 layers to GPU

slaren · 2024-04-07T17:58:11Z

@JohannesGaessler I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that.

In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost.

JohannesGaessler · 2024-04-07T19:41:45Z

Some quick performance comparisons from me:

GPU	Model	Test	t/s master	t/s sl/moe-rework-2	Speedup
RTX 3090	Mixtral 8x7b Q3_K_S	pp512	399.25	1173.12	2.94
RTX 3090	Mixtral 8x7b Q3_K_S	tg128	53.47	55.80	1.04
P40	Mixtral 8x7b Q3_K_S	pp512	187.27	250.23	1.34
P40	Mixtral 8x7b Q3_K_S	tg128	23.50	23.91	1.02

I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that.

I am measuring a performance difference from the changes:

GPU	Model	Test	t/s `ea2b795`	t/s sl/moe-rework-2	Speedup
RTX 3090	Mixtral 8x7b Q3_K_S	pp512	1112.34	1173.12	1.05
RTX 3090	Mixtral 8x7b Q3_K_S	tg128	55.45	55.80	1.01
P40	Mixtral 8x7b Q3_K_S	pp512	246.34	250.23	1.02
P40	Mixtral 8x7b Q3_K_S	tg128	23.82	23.91	1.00

In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost.

That would definitely help. When you look into this I think it would also make sense to check whether it is possible to set the input/output type to FP32 to avoid the conversion of some tensors. (With the input I suspect that it's probably not possible though.)

askmyteapot · 2024-04-07T22:45:34Z

Definite speedup on P40 with larger iq4_xs quant with partial offloading.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl | test       | updatedPR    t/s | moe-rework-2 t/s | speedup |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | ---------------: | ------: |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | pp 4096    |    115.45 ± 0.08 |    106.83 ± 0.07 |  1.080x |
| llama 7B IQ4_XS - 4.25 bpw     |  23.42 GiB |    46.70 B | CUDA       |  28 | tg 128     |     14.52 ± 0.01 |     13.34 ± 0.01 |  1.088x |

slaren · 2024-04-08T21:31:49Z

@ggerganov I really do not want to have to modify 21 functions in exactly the same way again, I would rather spend some time refactoring. Did you find any reason that would prevent making the Metal kernel_mul_mv_id kernels a template?

ggerganov · 2024-04-09T05:50:45Z

Did you find any reason that would prevent making the Metal kernel_mul_mv_id kernels a template?

No reason at all - I simply wasn't able to fit this into templates. Thanks for doing it - it was very ugly before

…nstead of silu. Do not pass too much time on this function as it will be replaced in #6505

LostRuins · 2024-04-16T10:53:14Z

Hey there, just wondering if there's any reason this isn't ready to be merged yet. Have heard a couple of reports that it's really beneficial for mixtral PP speed for some people who have tried it.

slaren · 2024-04-16T11:05:16Z

It's still missing a Metal implementation. It should be good for CPU and CUDA already.

ggml-ci

ggerganov · 2024-04-17T20:59:38Z

Let's rebase on master and will continue review tomorrow

ggerganov

Very nice! M2 Ultra results (-ub 256 is optimal):

./scripts/compare-commits.sh master sl/moe-rework-2 -m models/mixtral-8x7b-32k-fast/ggml-model-f16.gguf -ub 256 -p 1,2,4,8,16,32,64,128,256,512

CPU	Model	Test	t/s master	t/s sl/moe-rework-2	Speedup
M2 Ultra	llama 8x7B F16	pp1	22.28	23.15	1.04
M2 Ultra	llama 8x7B F16	pp2	21.18	22.11	1.04
M2 Ultra	llama 8x7B F16	pp4	26.67	27.40	1.03
M2 Ultra	llama 8x7B F16	pp8	30.73	44.21	1.44
M2 Ultra	llama 8x7B F16	pp16	50.73	79.88	1.57
M2 Ultra	llama 8x7B F16	pp32	90.07	154.87	1.72
M2 Ultra	llama 8x7B F16	pp64	155.10	263.48	1.70
M2 Ultra	llama 8x7B F16	pp128	256.59	357.97	1.40
M2 Ultra	llama 8x7B F16	pp256	319.72	370.37	1.16
M2 Ultra	llama 8x7B F16	pp512	319.97	370.87	1.16
M2 Ultra	llama 8x7B F16	tg128	22.38	23.12	1.03

ggml-ci

slaren · 2024-04-18T11:45:54Z

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

slaren · 2024-04-18T11:53:32Z

@ggerganov Do you know why the ggml-ci cuda-v100 failed? The log ends during a quantize. Was it a timeout? There are more mul_mat_id tests in test-backend-ops that could increase the runtime.

ggerganov · 2024-04-18T11:57:24Z

Yes, it exceeded 30 min. On master we were at ~27 min

We can either increase to 40 min or maybe not run ctest in Debug?

NeoZhangJianyu · 2024-04-18T11:59:54Z

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

Got it! I will study and fix later.
I have thought to have a rest after fix mul_mat_id() UT in last weekend. :)

Thank for reminding!

airMeng · 2024-04-18T12:17:32Z

@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance.

@slaren can we have a workaround like macros in llama.cpp or a fallback to CPU to maintain SYCL capabilities? Then SYCL will not block your merging, and we can have more time on SYCL kernels (I am just assigned a JIRA about MOE, maybe I can re-use the efforts)

slaren · 2024-04-18T12:37:03Z

We could disable offloading of MoE models when using SYCL by setting n_gpu_layers to 0 in llm_load_tensors. That should at least avoid crashes with SYCL, but the result would be the same than running llama.cpp with -ngl 0.

slaren · 2024-04-18T12:42:42Z

Yes, it exceeded 30 min. On master we were at ~27 min

We can either increase to 40 min or maybe not run ctest in Debug?

I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest.

ggerganov · 2024-04-18T12:48:44Z

I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest.

Yes, for now should I bump the timeout to 40 min and figure a test reduction later on master?

ggml-ci

slaren · 2024-04-18T13:20:16Z

Yes, for now should I bump the timeout to 40 min and figure a test reduction later on master?

I think this is good enough for now. There are full tests with a few types to verify the logic, and then a simple test with the other types to check if they work at all.

* ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml : group all experts in a single ggml_mul_mat_id

ea2b795

cuda : improve mmid row copy

This comment was marked as off-topic.

Sign in to view

slaren mentioned this pull request Apr 5, 2024

ggml : update mul_mat_id to use the same tensor for all the experts #6387

Merged

10 tasks

askmyteapot reviewed Apr 6, 2024

View reviewed changes

Comment thread ggml-cuda.cu Outdated

askmyteapot reviewed Apr 6, 2024

View reviewed changes

Comment thread ggml-cuda.cu Outdated

phymbert mentioned this pull request Apr 6, 2024

model: support arch DbrxForCausalLM #6515

Merged

13 tasks

JohannesGaessler reviewed Apr 7, 2024

View reviewed changes

Comment thread ggml-cuda.cu Outdated

Comment thread ggml-cuda.cu Outdated

Comment thread ggml-cuda.cu

Comment thread ggml-cuda.cu

minor

1b5d78d

slaren added 2 commits April 7, 2024 20:31

fix windows build

bc61554

refactor moe ffn to llm_build_moe_ffn

f3f7627

cleanup

23f7d71

update imatrix

9a43e80

slaren mentioned this pull request Apr 8, 2024

metal : try to unify mul_mv_id kernels #6556

Merged

slaren added 2 commits April 12, 2024 18:26

minor

47c3867

Merge remote-tracking branch 'origin/master' into sl/moe-rework-2

137fbb8

phymbert added a commit that referenced this pull request Apr 12, 2024

llama: rename build_moe to build_moe_ffn and fix grok is using gelu i…

f1256dc

…nstead of silu. Do not pass too much time on this function as it will be replaced in #6505

slaren added 2 commits April 17, 2024 15:52

add metal impl

fc363e4

Merge remote-tracking branch 'origin/master' into sl/moe-rework-2

42003fd

cleanup

997a9b5

ggml-ci

slaren marked this pull request as ready for review April 17, 2024 17:13

slaren requested a review from ggerganov April 17, 2024 17:13

cuda : fix binbcast

f7fe79a

slaren added 3 commits April 17, 2024 23:45

Merge remote-tracking branch 'origin/master' into sl/moe-rework-2

d18b19c

cuda : fix warnings

0e6963d

metal : enable buffer log prints again

4d8fe07

ggerganov approved these changes Apr 18, 2024

View reviewed changes

Comment thread llama.cpp Outdated

Comment thread ggml-cuda.cu

ggerganov and others added 2 commits April 18, 2024 13:53

llama : simplify moe reshapes

2080a97

ggml-ci

ggml-ci

4980e35

slaren added 2 commits April 18, 2024 14:49

test-backend-ops : only run all mul mat tests for base types

bd17f27

ggml-ci

llama : disable moe offloading with SYCL

ba5b546

slaren merged commit 0d56246 into master Apr 18, 2024

slaren deleted the sl/moe-rework-2 branch April 18, 2024 13:18

ggerganov mentioned this pull request Apr 19, 2024

Low prompt processing speed with mixtral? #6740

Closed

slaren mentioned this pull request May 6, 2024

Fixed save_imatrix to match old behaviour for MoE #7099

Merged

arthw mentioned this pull request May 21, 2024

[SYCL]fix ggml_sycl_mul_mat_id() to match the change of api #7436

Merged

jukofyork mentioned this pull request Feb 8, 2025

llama : add option to override model tensor buffers #11397

Merged

2 tasks

Conversation

slaren commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

askmyteapot commented Apr 6, 2024

Uh oh!

Dampfinchen commented Apr 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

askmyteapot commented Apr 7, 2024

Uh oh!

askmyteapot commented Apr 7, 2024

Uh oh!

askmyteapot commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Apr 7, 2024

Uh oh!

askmyteapot commented Apr 7, 2024

Uh oh!

slaren commented Apr 8, 2024

Uh oh!

ggerganov commented Apr 9, 2024

Uh oh!

LostRuins commented Apr 16, 2024

Uh oh!

slaren commented Apr 16, 2024

Uh oh!

ggerganov commented Apr 17, 2024

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

slaren commented Apr 18, 2024

Uh oh!

slaren commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 18, 2024

Uh oh!

NeoZhangJianyu commented Apr 18, 2024

Uh oh!

airMeng commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Apr 18, 2024

Uh oh!

slaren commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 18, 2024

Uh oh!

slaren commented Apr 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

slaren commented Apr 5, 2024 •

edited

Loading

Dampfinchen commented Apr 6, 2024 •

edited

Loading

askmyteapot commented Apr 7, 2024 •

edited

Loading

slaren commented Apr 7, 2024 •

edited

Loading

slaren commented Apr 18, 2024 •

edited

Loading

airMeng commented Apr 18, 2024 •

edited

Loading

slaren commented Apr 18, 2024 •

edited

Loading