Add Q4_3 support to cuBLAS#1086
Conversation
|
7B q4_3 perplexity with cuBLAS: 6.0617 Detailsmain: seed = 1682015944 llama.cpp: loading model from models/7B/ggml-model-q4_3.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 6 (mostly Q4_3) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4936267.11 KB llama_model_load_internal: mem required = 6612.57 MB (+ 1026.00 MB per state) .................................................................................................... llama_init_from_file: kv self size = 256.00 MBsystem_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 9033.50 ms |
* Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Also changed the Makefile to link to the cuda dynamic libraries, linking is much faster that way and there is no reason to link statically for local use.