Conversation
There was a problem hiding this comment.
Newbie question: can you please explain why this white space is dangerous?
Thx!
There was a problem hiding this comment.
It is not dangerous. I'm just being sarcastic about a test failing because of one forgotten trailing white space.
ggerganov
left a comment
There was a problem hiding this comment.
M1 Pro
| Model | Master | This PR |
|---|---|---|
| 7B | 38.4 | 29.5 |
| 13B | 69.48 | 51.5 |
However, I think the calculation seems to be incorrect.
Here is a run with this PR - the generated text is quite incoherent:
I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
Details
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: `main' is up to date.
main: build = 858 (417546c)
main: seed = 1689921112
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0,09 MB
llama_model_load_internal: mem required = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size = 100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x131f08be0
ggml_metal_init: loaded kernel_mul 0x131f091e0
ggml_metal_init: loaded kernel_mul_row 0x131f09810
ggml_metal_init: loaded kernel_scale 0x131f09d30
ggml_metal_init: loaded kernel_silu 0x131f0a250
ggml_metal_init: loaded kernel_relu 0x131f0a770
ggml_metal_init: loaded kernel_gelu 0x131f0ac90
ggml_metal_init: loaded kernel_soft_max 0x131f0b340
ggml_metal_init: loaded kernel_diag_mask_inf 0x131f0b9a0
ggml_metal_init: loaded kernel_get_rows_f16 0x131f0c020
ggml_metal_init: loaded kernel_get_rows_q4_0 0x131f0c6a0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x131f0ce90
ggml_metal_init: loaded kernel_get_rows_q2_K 0x131f0d510
ggml_metal_init: loaded kernel_get_rows_q3_K 0x131f0db90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x131f0e210
ggml_metal_init: loaded kernel_get_rows_q5_K 0x131f0e890
ggml_metal_init: loaded kernel_get_rows_q6_K 0x131f0ef10
ggml_metal_init: loaded kernel_rms_norm 0x131f0f5d0
ggml_metal_init: loaded kernel_norm 0x131f0fc80
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x131f10650
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x131f10d10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x131f113d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x131f11a90
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x131f12310
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x131f129d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x131f13070
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x131f13710
ggml_metal_init: loaded kernel_rope 0x131f13e30
ggml_metal_init: loaded kernel_alibi_f32 0x131f14950
ggml_metal_init: loaded kernel_cpy_f32_f16 0x131f151e0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x131f15a70
ggml_metal_init: loaded kernel_cpy_f16_f16 0x131f16300
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 128,17 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512,00 MB, ( 7157,80 / 21845,34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
llama_print_timings: load time = 404,54 ms
llama_print_timings: sample time = 44,56 ms / 64 runs ( 0,70 ms per token, 1436,23 tokens per second)
llama_print_timings: prompt eval time = 598,88 ms / 8 tokens ( 74,86 ms per token, 13,36 tokens per second)
llama_print_timings: eval time = 3248,99 ms / 63 runs ( 51,57 ms per token, 19,39 tokens per second)
llama_print_timings: total time = 3897,98 ms
ggml_metal_free: deallocating
For comparison, on master:
I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
Details
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: `main' is up to date.
main: build = 856 (e782c9e)
main: seed = 1689921195
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0,09 MB
llama_model_load_internal: mem required = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size = 100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x159e0aa90
ggml_metal_init: loaded kernel_mul 0x159e0b090
ggml_metal_init: loaded kernel_mul_row 0x159e0b6c0
ggml_metal_init: loaded kernel_scale 0x159e0bbe0
ggml_metal_init: loaded kernel_silu 0x159e0c100
ggml_metal_init: loaded kernel_relu 0x159e0c620
ggml_metal_init: loaded kernel_gelu 0x159e0cb40
ggml_metal_init: loaded kernel_soft_max 0x159e0d1f0
ggml_metal_init: loaded kernel_diag_mask_inf 0x159e0d850
ggml_metal_init: loaded kernel_get_rows_f16 0x159e0ded0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x159e0e550
ggml_metal_init: loaded kernel_get_rows_q4_1 0x159e0ed40
ggml_metal_init: loaded kernel_get_rows_q2_K 0x159e0f3c0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x159e0fa40
ggml_metal_init: loaded kernel_get_rows_q4_K 0x159e100c0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x159e10740
ggml_metal_init: loaded kernel_get_rows_q6_K 0x159e10dc0
ggml_metal_init: loaded kernel_rms_norm 0x159e11480
ggml_metal_init: loaded kernel_norm 0x159e11b30
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x159e12500
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x159e12bc0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x159e13280
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x159e13960
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x159e141e0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x159e148a0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x159e14f40
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x159e155e0
ggml_metal_init: loaded kernel_rope 0x159e15d00
ggml_metal_init: loaded kernel_alibi_f32 0x159e16820
ggml_metal_init: loaded kernel_cpy_f32_f16 0x159e170b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x159e17940
ggml_metal_init: loaded kernel_cpy_f16_f16 0x159e181d0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 128,17 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512,00 MB, ( 7157,80 / 21845,34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
llama_print_timings: load time = 419,38 ms
llama_print_timings: sample time = 44,66 ms / 64 runs ( 0,70 ms per token, 1433,18 tokens per second)
llama_print_timings: prompt eval time = 604,51 ms / 8 tokens ( 75,56 ms per token, 13,23 tokens per second)
llama_print_timings: eval time = 4374,82 ms / 63 runs ( 69,44 ms per token, 14,40 tokens per second)
llama_print_timings: total time = 5029,98 ms
ggml_metal_free: deallocating
This is the command that I use:
make clean && LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000The perplexity results also confirm that something is not OK. To make the perplexity tool run using the Metal kernels, make sure to add -b 1 command line arg like this:
make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1
Here are the first three chunks on master and on this PR (note it's very slow):
# master
[1]4.9103,[2]5.5275,[3]6.3980,
# PR
[1]22.6174,[2]27.6240,[3]29.5259,
Full log of the last command:
Details
$ ▶ make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
rm -vf *.o *.so main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h
common.o
ggml-metal.o
ggml.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c -o k_quants.o k_quants.c
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput
==== Run ./main -h for help. ====
main: build = 858 (417546c)
main: seed = 1689922165
llama.cpp: loading model from ./models/7B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 2733.65 MB
llama_model_load_internal: mem required = 4303.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x1346098e0
ggml_metal_init: loaded kernel_mul 0x134609ee0
ggml_metal_init: loaded kernel_mul_row 0x13460a510
ggml_metal_init: loaded kernel_scale 0x13460aa30
ggml_metal_init: loaded kernel_silu 0x13460af50
ggml_metal_init: loaded kernel_relu 0x13460b470
ggml_metal_init: loaded kernel_gelu 0x13460b990
ggml_metal_init: loaded kernel_soft_max 0x13460c040
ggml_metal_init: loaded kernel_diag_mask_inf 0x13460c6a0
ggml_metal_init: loaded kernel_get_rows_f16 0x13460cd20
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13460d3a0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13460db90
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13460e210
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13460e890
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13460ef10
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13460f590
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13460fc10
ggml_metal_init: loaded kernel_rms_norm 0x1346102d0
ggml_metal_init: loaded kernel_norm 0x134610980
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x134611350
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x134611a10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1346120d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x134612790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x134613010
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1346136d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x134613d70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x134614410
ggml_metal_init: loaded kernel_rope 0x134614b30
ggml_metal_init: loaded kernel_alibi_f32 0x134615650
ggml_metal_init: loaded kernel_cpy_f32_f16 0x134615ee0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x134616770
ggml_metal_init: loaded kernel_cpy_f16_f16 0x134617000
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 102.54 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 2733.66 MB, ( 2734.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 770.00 MB, ( 3504.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB, ( 3762.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 288.00 MB, ( 4050.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, ( 4562.11 / 21845.34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 655 chunks, batch_size=1
perplexity: 16.02 seconds per pass - ETA 2 hours 54 minutes
[1]22.6174,[2]27.6240,[3]29.5259,^C
If we confirm something is wrong, might be worth doing the same checks for the Metal implementation of the other quantizations to make sure we didn't overlook something.
|
@ggerganov Great catch, thanks! I was getting not too a bad answer on the meaning of life while testing. The bug was that I was always using the mins/scales of the the first 128 weights in the super-block. Normally for such a bug one gets complete gibberish. With the last commit I now get the same perplexities as before. |
* Faster Q2_K on Metal * Deleting unnoticed and dangereous trailing white space * Fixed bug in new metal Q2_K implementation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Faster Q2_K on Metal * Deleting unnoticed and dangereous trailing white space * Fixed bug in new metal Q2_K implementation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Following in the footsteps of #2290 and #2294.
TG-128 in ms/t on M2 Max with 30-core GPU: