cuda: Q1_0 initial backend#21629
Conversation
| } | ||
|
|
||
| // Q1_0 requires MMA — no DP4A fallback path | ||
| if (type == GGML_TYPE_Q1_0 && !turing_mma_available(cc) && !amd_mfma_available(cc) && !amd_wmma_available(cc)) { |
There was a problem hiding this comment.
!amd_mfma_available(cc) && !amd_wmma_available(cc)
Not fully sure about the AMD part, copilot review suggested adding that to avoid cuBLAS fallback on AMD gpus (don't have access to an AMD gpu to test myself).
There was a problem hiding this comment.
Its wrong, since you guard the kernel with defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) but then select for both MFMA and WMMA gpus. It should accept AMD_WMMA_AVAILABLE too.
There was a problem hiding this comment.
good catch, will fix. First will try to support d4pa then this probably won't be needed.
| } | ||
|
|
||
| // Q1_0 requires MMA — no DP4A fallback path | ||
| if (type == GGML_TYPE_Q1_0 && !turing_mma_available(cc) && !amd_mfma_available(cc) && !amd_wmma_available(cc)) { |
There was a problem hiding this comment.
Its wrong, since you guard the kernel with defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) but then select for both MFMA and WMMA gpus. It should accept AMD_WMMA_AVAILABLE too.
|
over all, not supporting the dp4a path is pretty undesirable btw, not just for older gpus |
|
@IMbackK fair enough, I can give the dp4a path another try. What's the best way to test it? Runs decently on T4 (this google colab demo) Is this build correct to force D4pa path? |
|
If you are on CUDA 11 or 12 you can compile with |
|
Have you tried using the LUT in shared memory for unpacking the bits? |
|
@IMbackK I think dp4a should be good now, ran benchmark and KL validation tests. Also was curious if I force cuBLAS fallback what happens so tried that as well: @JohannesGaessler Thanks for suggestion that was helpful, had to switch to 4090 as on 5090 the JIT was crashing but works on 4090 Tried the following build options:
RTX 4090 GPU Benchmarks — Prompt Processing (pp512, tokens/s)
RTX 4090 GPU Benchmarks — Token Generation (tg128, tokens/s)
**RTX 4090 - KL Divergence Summary **
|
|
@pl752 I think I tried something similar but don't think it helped or was slower (don't remember the details). |
|
Did not see the last commit with dp4a, not sure it would make a difference. But here are some numbers from my machine of 84ab75f:
The numbers are pretty close to the T4 numbers, which makes sense (TU106 vs TU104). |
| static __device__ __forceinline__ void dequantize_q1_0(const void * vx, const int64_t ib, const int iqs, float2 & v){ | ||
| const block_q1_0 * x = (const block_q1_0 *) vx; | ||
|
|
||
| const float d = x[ib].d; | ||
|
|
||
| const int bit_index_0 = iqs; | ||
| const int bit_index_1 = iqs + 1; | ||
|
|
||
| const int byte_index_0 = bit_index_0 / 8; | ||
| const int bit_offset_0 = bit_index_0 % 8; | ||
|
|
||
| const int byte_index_1 = bit_index_1 / 8; | ||
| const int bit_offset_1 = bit_index_1 % 8; | ||
|
|
||
| // Extract bits: 1 = +d, 0 = -d (branchless) | ||
| const int bit_0 = (x[ib].qs[byte_index_0] >> bit_offset_0) & 1; | ||
| const int bit_1 = (x[ib].qs[byte_index_1] >> bit_offset_1) & 1; | ||
|
|
||
| v.x = (2*bit_0 - 1) * d; | ||
| v.y = (2*bit_1 - 1) * d; | ||
| } |
There was a problem hiding this comment.
The condition iqs % 2 == 0 should always be true so you could potentially optimize this function (does not need to be in this PR).
There was a problem hiding this comment.
Oh interesting, will give a try, removes on division and one % so might be worth.
There was a problem hiding this comment.
I tried it and did not make any change in speed it seems.
| const int qs0 = bxi->qs[qs_offset + 0] | (bxi->qs[qs_offset + 1] << 8) | | ||
| (bxi->qs[qs_offset + 2] << 16) | (bxi->qs[qs_offset + 3] << 24); |
There was a problem hiding this comment.
Use the function get_int_b2 instead.
There was a problem hiding this comment.
I tried get_int_b2 in both places and with 3 build combination, there is a large speed degredation on the token generation path for smaller models, see table below:
That was kinda surprising so I did the rediti builds twice to make sure there was no other changes in the builds:
get_int_b2 Benchmark Results — L40S (sm_89, CUDA 12.8)
| Build | get_int_b2 | Model | pp512 (t/s) | tg128 (t/s) | pp Δ | tg Δ |
|---|---|---|---|---|---|---|
| default | No | 1.7B | 27,161 | 630 | — | — |
| default | Yes | 1.7B | 26,863 | 513 | -1.1% | -18.6% |
| default | No | 4B | 15,623 | 414 | — | — |
| default | Yes | 4B | 15,481 | 332 | -0.9% | -19.8% |
| default | No | 8B | 10,258 | 339 | — | — |
| default | Yes | 8B | 10,252 | 337 | -0.1% | -0.4% |
| cuBLAS | No | 1.7B | 20,447 | 622 | — | — |
| cuBLAS | Yes | 1.7B | 20,629 | 506 | +0.9% | -18.6% |
| cuBLAS | No | 4B | 10,860 | 409 | — | — |
| cuBLAS | Yes | 4B | 10,988 | 329 | +1.2% | -19.6% |
| cuBLAS | No | 8B | 6,534 | 336 | — | — |
| cuBLAS | Yes | 8B | 6,466 | 334 | -1.0% | -0.5% |
| DP4A | No | 1.7B | 19,767 | 492 | — | — |
| DP4A | Yes | 1.7B | 20,219 | 495 | +2.3% | +0.7% |
| DP4A | No | 4B | 10,016 | 316 | — | — |
| DP4A | Yes | 4B | 9,971 | 319 | -0.4% | +1.0% |
| DP4A | No | 8B | 6,151 | 337 | — | — |
| DP4A | Yes | 8B | 6,125 | 336 | -0.4% | -0.1% |
| const int bits4 = (qs0 >> shift) & 0x0F; | ||
| const int b0 = (bits4 & 0x01) ? 1 : -1; | ||
| const int b1 = (bits4 & 0x02) ? 1 : -1; | ||
| const int b2 = (bits4 & 0x04) ? 1 : -1; | ||
| const int b3 = (bits4 & 0x08) ? 1 : -1; | ||
| unpacked_bytes[j] = (b0 & 0xFF) | ((b1 & 0xFF) << 8) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 24); |
There was a problem hiding this comment.
You should be able to optimize this using __vadd4 (does not need to be in this PR).
There was a problem hiding this comment.
Interesting, will try it. Will send another PR if it helps a lot.
There was a problem hiding this comment.
I tried this as well over the weekend, and seems did not make a difference. Will take another pass after for tuning after this PR is merged.
| const int v = bq1_0->qs[offset + 0] | (bq1_0->qs[offset + 1] << 8) | | ||
| (bq1_0->qs[offset + 2] << 16) | (bq1_0->qs[offset + 3] << 24); |
There was a problem hiding this comment.
Use the function get_int_b2 instead.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
Is there any more changes needed on our side? I see few actions failed but don't think its due to this PR. |
am17an
left a comment
There was a problem hiding this comment.
The CI failures a bit random, not sure what's causing them
* [cuda] initial Q1_0 backend * remove unused code, fix AMD MMA guard * attempt to support dp4a * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* [cuda] initial Q1_0 backend * remove unused code, fix AMD MMA guard * attempt to support dp4a * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* [cuda] initial Q1_0 backend * remove unused code, fix AMD MMA guard * attempt to support dp4a * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Overview
Follow up after merging of Q1_0 CPU PR. This PR adds the relevant CUDA backend.
Seems also this works for AMD in some cases that was a nice surprise :)
See a live demo of Bonsai 8B using these CUDA kernels and
llama-serveron hugging-face space prism-ml/Bonsai-demo, using a L40S GPU and getting decent speeds. Each request running on one gpu with a naive load balancer (just for demo purposes).Models:
Questions:
I could not get DP4A working for these kernels, kept getting wrong results, is that required or okay to do cuBLAS fallback for that, seems its for few generation ago?llama-bench (-fa 1)
Device: NVIDIA RTX 5090 (32 GB), CUDA backend
Bonsai-1.7B (231.13 MiB, 1.72B params)
Bonsai-4B (540.09 MiB, 4.02B params)
Bonsai-8B (1.07 GiB, 8.19B params)
End-to-end testing: KL Divergence (Q1_0 vs unpacked into FP16)
To test accuracy of the CUDA backend, we compare the KL divergence of the Q1_0 model against the unpacked FP16 model. The weights are equivalent so checking the logits gives us a good indication of the accuracy of the CUDA backend. Ran on 20 chunks of wikitext-2-raw, ctx 512.
For each model testing vs the unpacked version here: https://huggingface.co/collections/prism-ml/bonsai-auxiliary
Requirements