Add Q4_3 support to cuBLAS by slaren · Pull Request #1086 · ggml-org/llama.cpp

slaren · 2023-04-20T18:37:00Z

Also changed the Makefile to link to the cuda dynamic libraries, linking is much faster that way and there is no reason to link statically for local use.

slaren · 2023-04-20T19:37:01Z

7B q4_3 perplexity with cuBLAS: 6.0617

Details

main: seed = 1682015944 llama.cpp: loading model from models/7B/ggml-model-q4_3.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 6 (mostly Q4_3) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4936267.11 KB llama_model_load_internal: mem required = 6612.57 MB (+ 1026.00 MB per state) .................................................................................................... llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
5.65 seconds per pass - ETA 1.03 hours
[1]4.3508,[2]4.7736,[3]5.6662,[4]6.2864,[5]6.4227,[6]6.3703,[7]6.5471,[8]6.6450,[9]6.9846,[10]7.2508,[11]7.4526,[12]7.4782,[13]7.3964,[14]7.4641,[15]7.7127,[16]7.3279,[17]7.2089,[18]7.1596,[19]6.8043,[20]6.7911,[21]6.6988,[22]6.5289,[23]6.5003,[24]6.4051,[25]6.4152,[26]6.2542,[27]6.0810,[28]5.9814,[29]5.8920,[30]5.7339,[31]5.7033,[32]5.7208,[33]5.6628,[34]5.6958,[35]5.7176,[36]5.7589,[37]5.7625,[38]5.7697,[39]5.8023,[40]5.8532,[41]5.8655,[42]5.9053,[43]5.8678,[44]5.9249,[45]5.9253,[46]5.8997,[47]5.9190,[48]5.8934,[49]5.8921,[50]5.8504,[51]5.8450,[52]5.8340,[53]5.8786,[54]5.8589,[55]5.8354,[56]5.8628,[57]5.8831,[58]5.9029,[59]5.9213,[60]5.9619,[61]5.9535,[62]6.0125,[63]6.0403,[64]6.0535,[65]6.0964,[66]6.1040,[67]6.1232,[68]6.1375,[69]6.1625,[70]6.1932,[71]6.2158,[72]6.2465,[73]6.3047,[74]6.3088,[75]6.3243,[76]6.3367,[77]6.3494,[78]6.3360,[79]6.3622,[80]6.3551,[81]6.3681,[82]6.3726,[83]6.3218,[84]6.3046,[85]6.2919,[86]6.2698,[87]6.2102,[88]6.1858,[89]6.1659,[90]6.1492,[91]6.1729,[92]6.1674,[93]6.1685,[94]6.1660,[95]6.1934,[96]6.1921,[97]6.1884,[98]6.1818,[99]6.1686,[100]6.1689,[101]6.1924,[102]6.1878,[103]6.2070,[104]6.2139,[105]6.2127,[106]6.2300,[107]6.2305,[108]6.2437,[109]6.2370,[110]6.2308,[111]6.2521,[112]6.2721,[113]6.2745,[114]6.2706,[115]6.2764,[116]6.2675,[117]6.2732,[118]6.3004,[119]6.3223,[120]6.3569,[121]6.3717,[122]6.3952,[123]6.4325,[124]6.4500,[125]6.4400,[126]6.4794,[127]6.5153,[128]6.5454,[129]6.5295,[130]6.5373,[131]6.5317,[132]6.5242,[133]6.5118,[134]6.5214,[135]6.5170,[136]6.5053,[137]6.4969,[138]6.4791,[139]6.4690,[140]6.4653,[141]6.4379,[142]6.4332,[143]6.4042,[144]6.3833,[145]6.3751,[146]6.3640,[147]6.3670,[148]6.3670,[149]6.3619,[150]6.3581,[151]6.3602,[152]6.3507,[153]6.3348,[154]6.3266,[155]6.3332,[156]6.3287,[157]6.3453,[158]6.3493,[159]6.3545,[160]6.3568,[161]6.3683,[162]6.3404,[163]6.3282,[164]6.3047,[165]6.2742,[166]6.2474,[167]6.2103,[168]6.1802,[169]6.1658,[170]6.1544,[171]6.1274,[172]6.1095,[173]6.0937,[174]6.0644,[175]6.0429,[176]6.0309,[177]6.0117,[178]5.9891,[179]5.9722,[180]5.9629,[181]5.9418,[182]5.9239,[183]5.9101,[184]5.9087,[185]5.9011,[186]5.9014,[187]5.9079,[188]5.9040,[189]5.9216,[190]5.9226,[191]5.9437,[192]5.9593,[193]5.9758,[194]5.9870,[195]6.0082,[196]6.0240,[197]6.0442,[198]6.0589,[199]6.0623,[200]6.0667,[201]6.0619,[202]6.0804,[203]6.0877,[204]6.0862,[205]6.0971,[206]6.1038,[207]6.0998,[208]6.1083,[209]6.1123,[210]6.1173,[211]6.1272,[212]6.1341,[213]6.1443,[214]6.1468,[215]6.1489,[216]6.1636,[217]6.1807,[218]6.1940,[219]6.1937,[220]6.1898,[221]6.1845,[222]6.1826,[223]6.1738,[224]6.1672,[225]6.1636,[226]6.1836,[227]6.1922,[228]6.1972,[229]6.2034,[230]6.2003,[231]6.2163,[232]6.2049,[233]6.1881,[234]6.1737,[235]6.1548,[236]6.1483,[237]6.1385,[238]6.1405,[239]6.1261,[240]6.1158,[241]6.1175,[242]6.1211,[243]6.1193,[244]6.1085,[245]6.1052,[246]6.0943,[247]6.0828,[248]6.0762,[249]6.0737,[250]6.0782,[251]6.0713,[252]6.0677,[253]6.0581,[254]6.0524,[255]6.0405,[256]6.0225,[257]6.0106,[258]6.0025,[259]6.0002,[260]5.9921,[261]5.9880,[262]5.9823,[263]5.9770,[264]5.9585,[265]5.9581,[266]5.9564,[267]5.9498,[268]5.9584,[269]5.9570,[270]5.9575,[271]5.9655,[272]5.9693,[273]5.9692,[274]5.9715,[275]5.9798,[276]5.9858,[277]6.0012,[278]6.0115,[279]6.0208,[280]6.0234,[281]6.0338,[282]6.0396,[283]6.0546,[284]6.0629,[285]6.0712,[286]6.0842,[287]6.0843,[288]6.0899,[289]6.0817,[290]6.0667,[291]6.0517,[292]6.0369,[293]6.0240,[294]6.0261,[295]6.0251,[296]6.0296,[297]6.0280,[298]6.0312,[299]6.0286,[300]6.0177,[301]6.0176,[302]6.0096,[303]6.0007,[304]5.9919,[305]5.9884,[306]5.9764,[307]5.9785,[308]5.9813,[309]5.9656,[310]5.9602,[311]5.9537,[312]5.9559,[313]5.9501,[314]5.9487,[315]5.9329,[316]5.9278,[317]5.9121,[318]5.8926,[319]5.9047,[320]5.9169,[321]5.9211,[322]5.9170,[323]5.9105,[324]5.9076,[325]5.9178,[326]5.9179,[327]5.9201,[328]5.9238,[329]5.9298,[330]5.9332,[331]5.9455,[332]5.9429,[333]5.9499,[334]5.9447,[335]5.9389,[336]5.9426,[337]5.9404,[338]5.9398,[339]5.9350,[340]5.9308,[341]5.9389,[342]5.9418,[343]5.9461,[344]5.9465,[345]5.9471,[346]5.9448,[347]5.9487,[348]5.9521,[349]5.9545,[350]5.9511,[351]5.9518,[352]5.9523,[353]5.9463,[354]5.9476,[355]5.9528,[356]5.9561,[357]5.9526,[358]5.9620,[359]5.9645,[360]5.9613,[361]5.9611,[362]5.9679,[363]5.9788,[364]5.9850,[365]5.9900,[366]5.9913,[367]5.9996,[368]5.9970,[369]5.9978,[370]5.9996,[371]5.9943,[372]5.9991,[373]6.0038,[374]6.0023,[375]6.0024,[376]6.0089,[377]6.0040,[378]6.0068,[379]6.0129,[380]6.0055,[381]6.0024,[382]5.9973,[383]5.9965,[384]5.9961,[385]5.9949,[386]5.9946,[387]5.9943,[388]5.9910,[389]5.9860,[390]5.9792,[391]5.9716,[392]5.9678,[393]5.9660,[394]5.9689,[395]5.9676,[396]5.9600,[397]5.9668,[398]5.9712,[399]5.9791,[400]5.9790,[401]5.9803,[402]5.9812,[403]5.9831,[404]5.9893,[405]5.9802,[406]5.9771,[407]5.9765,[408]5.9782,[409]5.9896,[410]6.0009,[411]6.0122,[412]6.0279,[413]6.0388,[414]6.0465,[415]6.0520,[416]6.0600,[417]6.0719,[418]6.0754,[419]6.0824,[420]6.0911,[421]6.1027,[422]6.1063,[423]6.1133,[424]6.1237,[425]6.1329,[426]6.1392,[427]6.1437,[428]6.1517,[429]6.1569,[430]6.1651,[431]6.1789,[432]6.1825,[433]6.1816,[434]6.1774,[435]6.1783,[436]6.1807,[437]6.1904,[438]6.1979,[439]6.1947,[440]6.1936,[441]6.1887,[442]6.1875,[443]6.1887,[444]6.1895,[445]6.1874,[446]6.1898,[447]6.1928,[448]6.1966,[449]6.1941,[450]6.1948,[451]6.1908,[452]6.1781,[453]6.1700,[454]6.1645,[455]6.1652,[456]6.1699,[457]6.1716,[458]6.1697,[459]6.1704,[460]6.1789,[461]6.1763,[462]6.1750,[463]6.1785,[464]6.1774,[465]6.1748,[466]6.1672,[467]6.1679,[468]6.1676,[469]6.1697,[470]6.1701,[471]6.1654,[472]6.1699,[473]6.1645,[474]6.1658,[475]6.1597,[476]6.1613,[477]6.1543,[478]6.1535,[479]6.1596,[480]6.1640,[481]6.1656,[482]6.1613,[483]6.1571,[484]6.1590,[485]6.1572,[486]6.1515,[487]6.1514,[488]6.1491,[489]6.1444,[490]6.1421,[491]6.1394,[492]6.1339,[493]6.1310,[494]6.1291,[495]6.1287,[496]6.1251,[497]6.1196,[498]6.1181,[499]6.1137,[500]6.1044,[501]6.0980,[502]6.0981,[503]6.0974,[504]6.0887,[505]6.0904,[506]6.0914,[507]6.0861,[508]6.0822,[509]6.0816,[510]6.0849,[511]6.0896,[512]6.0930,[513]6.0952,[514]6.1015,[515]6.0960,[516]6.0952,[517]6.0961,[518]6.0956,[519]6.0985,[520]6.1008,[521]6.1021,[522]6.1049,[523]6.1056,[524]6.1113,[525]6.1145,[526]6.1155,[527]6.1172,[528]6.1121,[529]6.1130,[530]6.1078,[531]6.1064,[532]6.1111,[533]6.1135,[534]6.1117,[535]6.1137,[536]6.1085,[537]6.1062,[538]6.1114,[539]6.1124,[540]6.1159,[541]6.1162,[542]6.1173,[543]6.1188,[544]6.1197,[545]6.1178,[546]6.1188,[547]6.1148,[548]6.1097,[549]6.1099,[550]6.1069,[551]6.1035,[552]6.1013,[553]6.0975,[554]6.0953,[555]6.0921,[556]6.0914,[557]6.0939,[558]6.0902,[559]6.0900,[560]6.0898,[561]6.0902,[562]6.0880,[563]6.0878,[564]6.0921,[565]6.0942,[566]6.0942,[567]6.0920,[568]6.0928,[569]6.0913,[570]6.0941,[571]6.0942,[572]6.0950,[573]6.0947,[574]6.0912,[575]6.0907,[576]6.0906,[577]6.0890,[578]6.0871,[579]6.0876,[580]6.0811,[581]6.0773,[582]6.0764,[583]6.0772,[584]6.0774,[585]6.0699,[586]6.0630,[587]6.0637,[588]6.0683,[589]6.0738,[590]6.0765,[591]6.0788,[592]6.0776,[593]6.0746,[594]6.0755,[595]6.0731,[596]6.0765,[597]6.0743,[598]6.0713,[599]6.0735,[600]6.0727,[601]6.0713,[602]6.0727,[603]6.0754,[604]6.0762,[605]6.0798,[606]6.0821,[607]6.0805,[608]6.0773,[609]6.0782,[610]6.0817,[611]6.0800,[612]6.0825,[613]6.0788,[614]6.0740,[615]6.0667,[616]6.0693,[617]6.0633,[618]6.0586,[619]6.0530,[620]6.0394,[621]6.0326,[622]6.0310,[623]6.0324,[624]6.0328,[625]6.0327,[626]6.0315,[627]6.0339,[628]6.0341,[629]6.0340,[630]6.0372,[631]6.0428,[632]6.0487,[633]6.0472,[634]6.0506,[635]6.0513,[636]6.0478,[637]6.0443,[638]6.0469,[639]6.0438,[640]6.0447,[641]6.0449,[642]6.0515,[643]6.0537,[644]6.0548,[645]6.0529,[646]6.0570,[647]6.0529,[648]6.0539,[649]6.0543,[650]6.0581,[651]6.0635,[652]6.0646,[653]6.0685,[654]6.0622,[655]6.0617,

llama_print_timings: load time = 9033.50 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 3328253.06 ms / 335360 tokens ( 9.92 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 3366921.57 ms

* Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Add Q4_3 support to cuBLAS

7aba7ca

ggerganov approved these changes Apr 20, 2023

View reviewed changes

slaren merged commit 2005469 into ggml-org:master Apr 20, 2023

slaren deleted the cuda-q4_3 branch April 20, 2023 18:59

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Add Q4_3 support to cuBLAS (ggml-org#1086)

8f66ae7

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

Add Q4_3 support to cuBLAS (ggml-org#1086)

dc2fb22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q4_3 support to cuBLAS#1086

Add Q4_3 support to cuBLAS#1086
slaren merged 1 commit intoggml-org:masterfrom
slaren:cuda-q4_3

slaren commented Apr 20, 2023

Uh oh!

slaren commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slaren commented Apr 20, 2023

Uh oh!

slaren commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants