ggml : group all experts in a single ggml_mul_mat_id#6505
Conversation
cuda : improve mmid row copy
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Benchmarked this is on a Ryzen 5800x (64GB ddr4@3733MT CL16) and Tesla P40 24GB. 28/33 layers offloaded. Model is Bagel Mistery Tour 8x7b (mixtral) |
|
Alright, wow. PP went down from 11,85 ms/t to 4,95 ms/t (with partial offloading, 5 layers Mixtral and 2060). Simply incredible, but I'm not surprised anymore as Slaren always delivers. Llama.cpp's MOE implementation is now extremly robust. |
|
I also noticed that this pull uses significantly less CUDA Buffer (50% less) compared to master which allowed an extra layer at low ctx. PR: Master (build = 2620 (d4f220a)) |
|
Just tested with 8k context... not as much as a savings. Master |
|
@JohannesGaessler I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that. In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost. |
|
Some quick performance comparisons from me:
I am measuring a performance difference from the changes:
That would definitely help. When you look into this I think it would also make sense to check whether it is possible to set the input/output type to FP32 to avoid the conversion of some tensors. (With the input I suspect that it's probably not possible though.) |
|
Definite speedup on P40 with larger iq4_xs quant with partial offloading. |
|
@ggerganov I really do not want to have to modify 21 functions in exactly the same way again, I would rather spend some time refactoring. Did you find any reason that would prevent making the Metal |
No reason at all - I simply wasn't able to fit this into templates. Thanks for doing it - it was very ugly before |
…nstead of silu. Do not pass too much time on this function as it will be replaced in #6505
|
Hey there, just wondering if there's any reason this isn't ready to be merged yet. Have heard a couple of reports that it's really beneficial for mixtral PP speed for some people who have tried it. |
|
It's still missing a Metal implementation. It should be good for CPU and CUDA already. |
|
Let's rebase on |
ggerganov
left a comment
There was a problem hiding this comment.
Very nice! M2 Ultra results (-ub 256 is optimal):
./scripts/compare-commits.sh master sl/moe-rework-2 -m models/mixtral-8x7b-32k-fast/ggml-model-f16.gguf -ub 256 -p 1,2,4,8,16,32,64,128,256,512| CPU | Model | Test | t/s master | t/s sl/moe-rework-2 | Speedup |
|---|---|---|---|---|---|
| M2 Ultra | llama 8x7B F16 | pp1 | 22.28 | 23.15 | 1.04 |
| M2 Ultra | llama 8x7B F16 | pp2 | 21.18 | 22.11 | 1.04 |
| M2 Ultra | llama 8x7B F16 | pp4 | 26.67 | 27.40 | 1.03 |
| M2 Ultra | llama 8x7B F16 | pp8 | 30.73 | 44.21 | 1.44 |
| M2 Ultra | llama 8x7B F16 | pp16 | 50.73 | 79.88 | 1.57 |
| M2 Ultra | llama 8x7B F16 | pp32 | 90.07 | 154.87 | 1.72 |
| M2 Ultra | llama 8x7B F16 | pp64 | 155.10 | 263.48 | 1.70 |
| M2 Ultra | llama 8x7B F16 | pp128 | 256.59 | 357.97 | 1.40 |
| M2 Ultra | llama 8x7B F16 | pp256 | 319.72 | 370.37 | 1.16 |
| M2 Ultra | llama 8x7B F16 | pp512 | 319.97 | 370.87 | 1.16 |
| M2 Ultra | llama 8x7B F16 | tg128 | 22.38 | 23.12 | 1.03 |
|
@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance. |
|
@ggerganov Do you know why the ggml-ci cuda-v100 failed? The log ends during a |
|
Yes, it exceeded 30 min. On We can either increase to 40 min or maybe not run |
Got it! I will study and fix later. Thank for reminding! |
@slaren can we have a workaround like macros in llama.cpp or a fallback to CPU to maintain SYCL capabilities? Then SYCL will not block your merging, and we can have more time on SYCL kernels (I am just assigned a JIRA about MOE, maybe I can re-use the efforts) |
|
We could disable offloading of MoE models when using SYCL by setting |
I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest. |
Yes, for now should I bump the timeout to 40 min and figure a test reduction later on |
I think this is good enough for now. There are full tests with a few types to verify the logic, and then a simple test with the other types to check if they work at all. |
* ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Should improve performance of MoE models with CUDA significantly. Also improved the rearrangement of the rows in the CUDA backend with custom kernels instead of memcpys, that's about 50% of the speedup here.