Conversation
|
@wro52 Could you test this branch and see if it fixes the performance in your environment? |
|
For me with gcc 12.3.0 under Linux it doesn't seem to change anything either way, but also didn't #3833. But it does increase a full build time (with make) by ~20% and adds a lot of warnings from gcc.
build: 1206b5f (1446) build time: 27.87 secs
build: 6e08281 (1445) build time: 23.15 secs
build: 82a6646 (1440) build time: 22.41 secs |
mostly duplicate symbols, right? I think I saw those when I activated LTO locally. Having the same, non-private, symbols or the same symbol names across multiple TU is bad practice anyway, so those warnings are correct. |
|
No, it is just this warning repeated many times: |
|
I also get these warnings - not sure how to fix. Any alternatives?
build: 6e08281 (1445) With LTO and also before #3833 I get:
build: a6aba2c (1448) Almost 2x slowdown. Also, on diff --git a/ggml-quants.c b/ggml-quants.c
index fd4ee1be..5a5ed16f 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -6,6 +6,9 @@
#include <assert.h>
#include <float.h>
+#define ggml_fp16_to_fp32
+#define ggml_fp32_to_fp16
+
#ifdef __ARM_NEON
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:But this works only for ARM_NEON where there is native F16 <-> F32 cast |
|
The warnings disappear with diff --git a/Makefile b/Makefile
index 2a2ac850..348143e0 100644
--- a/Makefile
+++ b/Makefile
@@ -124,7 +124,7 @@ MK_CFLAGS += -Ofast -flto
MK_HOST_CXXFLAGS += -Ofast
MK_CUDA_CXXFLAGS += -O3
else
-MK_CFLAGS += -O3 -flto
+MK_CFLAGS += -O3 -flto=auto
MK_CXXFLAGS += -O3
endif |
|
We could leave LTO OFF by default, like before, but set it ON in the ci. (not a real solution though) |
|
I guess we could move the fp16 conversion functions to an internal header |
|
The inlining issue also used to be a problem with early versions of |
|
we might get away with something like since ggml uses a lot of loops, |
This can work. Implemented here: #3861 |
Tried every possible speed setting - no significant influence |
|
Merged #3861 instead |
ref #3858
Try to restore the performance to what it was before the refactoring #3833
Seems like the
ggml_fp16_to_fp32andggml_fp32_to_fp16calls slow down the processing significantly. At least with ARM_NEON. Haven't confirmed for x86 architectures