build : enable link-time optimizations by ggerganov · Pull Request #3859 · ggml-org/llama.cpp

ggerganov · 2023-10-30T13:14:52Z

Try to restore the performance to what it was before the refactoring #3833
Seems like the ggml_fp16_to_fp32 and ggml_fp32_to_fp16 calls slow down the processing significantly. At least with ARM_NEON. Haven't confirmed for x86 architectures

ggerganov · 2023-10-30T13:16:03Z

@wro52 Could you test this branch and see if it fixes the performance in your environment?

slaren · 2023-10-30T13:47:30Z

For me with gcc 12.3.0 under Linux it doesn't seem to change anything either way, but also didn't #3833. But it does increase a full build time (with make) by ~20% and adds a lot of warnings from gcc.

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.39 ± 0.37
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.25 ± 0.04

build: 1206b5f (1446) build time: 27.87 secs

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.30 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.23 ± 0.03

build: 6e08281 (1445) build time: 23.15 secs

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.43 ± 0.05
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.20 ± 0.04

build: 82a6646 (1440) build time: 22.41 secs

Green-Sky · 2023-10-30T13:49:11Z

adds a lot of warnings from gcc.

mostly duplicate symbols, right? I think I saw those when I activated LTO locally. Having the same, non-private, symbols or the same symbol names across multiple TU is bad practice anyway, so those warnings are correct.

slaren · 2023-10-30T13:51:12Z

No, it is just this warning repeated many times:

lto-wrapper: warning: using serial compilation of 3 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information

ggerganov · 2023-10-30T13:57:35Z

I also get these warnings - not sure how to fix.

Any alternatives?
Without LTO, on master I get:

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	0	8	tg 128	19.03 ± 0.03

build: 6e08281 (1445)

With LTO and also before #3833 I get:

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	0	8	tg 128	33.89 ± 0.12

build: a6aba2c (1448)

Almost 2x slowdown.

Also, on master with this patch the performance is restored:

diff --git a/ggml-quants.c b/ggml-quants.c
index fd4ee1be..5a5ed16f 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -6,6 +6,9 @@
 #include <assert.h>
 #include <float.h>
 
+#define ggml_fp16_to_fp32
+#define ggml_fp32_to_fp16
+
 #ifdef __ARM_NEON
 
 // if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:

But this works only for ARM_NEON where there is native F16 <-> F32 cast

slaren · 2023-10-30T13:59:04Z

The warnings disappear with -flto=auto:

diff --git a/Makefile b/Makefile
index 2a2ac850..348143e0 100644
--- a/Makefile
+++ b/Makefile
@@ -124,7 +124,7 @@ MK_CFLAGS        += -Ofast -flto
 MK_HOST_CXXFLAGS += -Ofast
 MK_CUDA_CXXFLAGS += -O3
 else
-MK_CFLAGS        += -O3 -flto
+MK_CFLAGS        += -O3 -flto=auto
 MK_CXXFLAGS      += -O3
 endif

Green-Sky · 2023-10-30T14:00:11Z

We could leave LTO OFF by default, like before, but set it ON in the ci. (not a real solution though)

slaren · 2023-10-30T14:01:16Z

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

slaren · 2023-10-30T14:11:09Z

The inlining issue also used to be a problem with early versions of ggml-cuda, and may still be with ggml-opencl since it inherited this code:
https://github.com/ggerganov/llama.cpp/blob/6e08281e588bbba1a5d180290a94a43f167f3a1a/ggml-opencl.cpp#L1618-L1623

Green-Sky · 2023-10-30T14:18:02Z

we might get away with something like -finline-limit=

since ggml uses a lot of loops, -faggressive-loop-optimizations sounds interesting too.

ggerganov · 2023-10-30T14:43:40Z

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

This can work. Implemented here: #3861

wro52 · 2023-10-30T16:12:25Z

@wro52 Could you test this branch and see if it fixes the performance in your environment?

Tried every possible speed setting - no significant influence

ggerganov · 2023-10-30T17:19:53Z

Merged #3861 instead

build : enable link-time optimizations

1206b5f

ggerganov added 2 commits October 30, 2023 15:40

build : disable lto for C++ (make) and enable existing LTO flag (cmake)

6f6b0db

ci : try to fix code coverage build

a6aba2c

ggerganov force-pushed the lto branch from e245f6c to a6aba2c Compare October 30, 2023 13:46

ci : fix focal build

57c4296

make : use -lfto=auto to avoid warnings and maintain perf

bc28aaa

ggerganov closed this Oct 30, 2023

Conversation

ggerganov commented Oct 30, 2023

Uh oh!

ggerganov commented Oct 30, 2023

Uh oh!

slaren commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Oct 30, 2023

Uh oh!

slaren commented Oct 30, 2023

Uh oh!

ggerganov commented Oct 30, 2023

Uh oh!

slaren commented Oct 30, 2023

Uh oh!

Green-Sky commented Oct 30, 2023

Uh oh!

slaren commented Oct 30, 2023

Uh oh!

slaren commented Oct 30, 2023

Uh oh!

Green-Sky commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 30, 2023

Uh oh!

wro52 commented Oct 30, 2023

Uh oh!

ggerganov commented Oct 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

slaren commented Oct 30, 2023 •

edited

Loading

Green-Sky commented Oct 30, 2023 •

edited

Loading