metal : refactor + optimize by ggerganov · Pull Request #15857 · ggml-org/llama.cpp

ggerganov · 2025-09-07T16:34:04Z

This PR introduces a new dynamic way to compile and use Metal kernels. On master we had to pre-compile all kernels during the backend initialization. Now, we can dynamically compile the needed kernels on the fly during inference.

This change unlocks the possibility to compile significantly more optimized kernels using the specific shapes of the current computation. This is achieved through the MTLFunctionConstant mechanism. Using this, we can now unroll loops significantly better and improve the overall performance.

The new kernel loading mechanism is currently applied to all Flash Attention (FA) kernels. In follow up PRs, we will continuously move all the remaining kernels and utilize function constants to improve the performance.

A secondary additional change is improved vector kernels for Q8_0.

The impact from these changes is significantly improved performance across the board. The bigger the context, the bigger the speed up.

Model	Test	t/s master	t/s gg/metal-refactor	Speedup
gemma3 12B Q8_0	pp512	807.32	838.85	1.04
gemma3 12B Q8_0	pp512@d1024	725.71	797.56	1.10
gemma3 12B Q8_0	pp512@d2048	705.75	784.87	1.11
gemma3 12B Q8_0	pp512@d8192	617.82	730.08	1.18
gemma3 12B Q8_0	pp512@d32768	407.97	573.20	1.41
gemma3 12B Q8_0	tg32	42.12	42.97	1.02
gemma3 12B Q8_0	tg32@d1024	40.80	41.80	1.02
gemma3 12B Q8_0	tg32@d32768	35.55	36.26	1.02
gemma3 4B Q8_0	pp512	2473.10	2641.20	1.07
gemma3 4B Q8_0	pp512@d1024	2164.10	2478.27	1.15
gemma3 4B Q8_0	pp512@d2048	2095.75	2414.46	1.15
gemma3 4B Q8_0	pp512@d8192	1790.75	2219.47	1.24
gemma3 4B Q8_0	pp512@d32768	1138.71	1664.12	1.46
gemma3 4B Q8_0	tg32	98.78	98.75	1.00
gemma3 4B Q8_0	tg32@d1024	93.88	95.53	1.02
gemma3 4B Q8_0	tg32@d32768	81.96	83.43	1.02
gpt-oss 120B MXFP4 MoE	pp512	1208.74	1222.50	1.01
gpt-oss 120B MXFP4 MoE	pp512@d1024	1171.21	1196.71	1.02
gpt-oss 120B MXFP4 MoE	pp512@d2048	1112.20	1149.21	1.03
gpt-oss 120B MXFP4 MoE	pp512@d8192	899.23	980.58	1.09
gpt-oss 120B MXFP4 MoE	pp512@d32768	507.91	607.60	1.20
gpt-oss 120B MXFP4 MoE	tg32	80.43	83.52	1.04
gpt-oss 120B MXFP4 MoE	tg32@d1024	79.57	81.16	1.02
gpt-oss 120B MXFP4 MoE	tg32@d32768	57.21	61.25	1.07
gpt-oss 20B MXFP4 MoE	pp512	2352.55	2404.28	1.02
gpt-oss 20B MXFP4 MoE	pp512@d1024	2198.80	2283.64	1.04
gpt-oss 20B MXFP4 MoE	pp512@d2048	2084.15	2188.54	1.05
gpt-oss 20B MXFP4 MoE	pp512@d8192	1591.06	1764.50	1.11
gpt-oss 20B MXFP4 MoE	pp512@d32768	822.74	1000.95	1.22
gpt-oss 20B MXFP4 MoE	tg32	116.02	122.63	1.06
gpt-oss 20B MXFP4 MoE	tg32@d1024	114.13	119.72	1.05
gpt-oss 20B MXFP4 MoE	tg32@d32768	83.72	88.65	1.06
llama 8B Q8_0	pp512	1308.80	1320.41	1.01
llama 8B Q8_0	pp512@d1024	1221.28	1253.95	1.03
llama 8B Q8_0	pp512@d2048	1043.15	1191.10	1.14
llama 8B Q8_0	pp512@d8192	829.73	927.69	1.12
llama 8B Q8_0	pp512@d32768	378.31	478.48	1.26
llama 8B Q8_0	tg32	67.56	71.86	1.06
llama 8B Q8_0	tg32@d1024	64.71	70.27	1.09
llama 8B Q8_0	tg32@d32768	41.17	45.17	1.10
qwen2 1.5B Q8_0	pp512	6106.18	6210.50	1.02
qwen2 1.5B Q8_0	pp512@d1024	5373.06	5613.40	1.04
qwen2 1.5B Q8_0	pp512@d2048	4784.09	5078.52	1.06
qwen2 1.5B Q8_0	pp512@d8192	2596.45	3246.83	1.25
qwen2 1.5B Q8_0	pp512@d32768	940.35	1332.19	1.42
qwen2 1.5B Q8_0	tg32	184.88	187.62	1.01
qwen2 1.5B Q8_0	tg32@d1024	164.85	180.98	1.10
qwen2 1.5B Q8_0	tg32@d32768	114.57	121.77	1.06
qwen2 3B Q8_0	pp512	2961.57	2992.26	1.01
qwen2 3B Q8_0	pp512@d1024	2624.24	2742.73	1.05
qwen2 3B Q8_0	pp512@d2048	2265.07	2531.85	1.12
qwen2 3B Q8_0	pp512@d8192	1398.54	1739.14	1.24
qwen2 3B Q8_0	pp512@d32768	630.45	766.17	1.22
qwen2 3B Q8_0	tg32	117.70	124.62	1.06
qwen2 3B Q8_0	tg32@d1024	107.16	121.23	1.13
qwen2 3B Q8_0	tg32@d32768	66.23	78.30	1.18
qwen2 7B Q8_0	pp512	1419.15	1427.45	1.01
qwen2 7B Q8_0	pp512@d1024	1336.82	1363.87	1.02
qwen2 7B Q8_0	pp512@d2048	1263.52	1303.77	1.03
qwen2 7B Q8_0	pp512@d8192	941.38	1031.58	1.10
qwen2 7B Q8_0	pp512@d32768	460.00	545.48	1.19
qwen2 7B Q8_0	tg32	71.87	76.35	1.06
qwen2 7B Q8_0	tg32@d1024	68.38	74.68	1.09
qwen2 7B Q8_0	tg32@d32768	47.13	53.64	1.14
qwen3 4B Q8_0	pp512	2274.78	2321.97	1.02
qwen3 4B Q8_0	pp512@d1024	1989.08	2105.77	1.06
qwen3 4B Q8_0	pp512@d2048	1773.93	1923.09	1.08
qwen3 4B Q8_0	pp512@d8192	1075.20	1270.23	1.18
qwen3 4B Q8_0	pp512@d32768	355.96	519.51	1.46
qwen3 4B Q8_0	tg32	100.36	105.36	1.05
qwen3 4B Q8_0	tg32@d1024	92.35	101.01	1.09
qwen3 4B Q8_0	tg32@d32768	47.05	53.52	1.14
qwen3moe 30B.A3B Q8_0	pp512	2057.43	2110.77	1.03
qwen3moe 30B.A3B Q8_0	pp512@d1024	1777.20	1889.09	1.06
qwen3moe 30B.A3B Q8_0	pp512@d2048	1549.48	1694.82	1.09
qwen3moe 30B.A3B Q8_0	pp512@d8192	882.27	1050.35	1.19
qwen3moe 30B.A3B Q8_0	pp512@d32768	298.33	404.83	1.36
qwen3moe 30B.A3B Q8_0	tg32	76.90	77.77	1.01
qwen3moe 30B.A3B Q8_0	tg32@d1024	70.76	75.61	1.07
qwen3moe 30B.A3B Q8_0	tg32@d32768	38.77	43.84	1.13

TODO before merge

Add comments

Next PRs

Make the backend async metal : make the backend async #15832
Speed up the rest of the vector kernels
Switch to inference-time compilation for the rest of the important kernels

ggml-ci

* metal : refactor ggml-ci * cont : refactor FA-vec kernel * cont : print metal library load time * minor : warn to debug + bettern kernel names ggml-ci * metal : optimize mul_mv q8_0 ggml-ci * metal : simplify FA pipeline creation functions ggml-ci * metal : improve naming consistency * metal : safer function constants offsets ggml-ci * metal : comments ggml-ci

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 7, 2025

ggerganov added 6 commits September 8, 2025 12:14

metal : refactor

d5b1feb

ggml-ci

cont : refactor FA-vec kernel

2f2ea92

cont : print metal library load time

e51e14e

minor : warn to debug + bettern kernel names

7586e69

ggml-ci

metal : optimize mul_mv q8_0

e8b8ac2

ggml-ci

metal : simplify FA pipeline creation functions

d034c5d

ggml-ci

ggerganov force-pushed the gg/metal-refactor branch from 2a63ecf to d034c5d Compare September 8, 2025 09:31

ggerganov added 3 commits September 8, 2025 12:56

metal : improve naming consistency

83199a2

metal : safer function constants offsets

5497858

ggml-ci

metal : comments

0f28ee6

ggml-ci

ggerganov merged commit f28d4f4 into master Sep 8, 2025
53 of 55 checks passed

ggerganov deleted the gg/metal-refactor branch September 8, 2025 10:35

ggerganov mentioned this pull request Sep 13, 2025

metal : refactor kernel loading #15964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : refactor + optimize#15857

metal : refactor + optimize#15857
ggerganov merged 9 commits intomasterfrom
gg/metal-refactor

ggerganov commented Sep 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ggerganov commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO before merge

Next PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggerganov commented Sep 7, 2025 •

edited

Loading