CUDA performance optimizations#1530
Conversation
I would not worry about HIP right now, it is still in draft, and this is one of the reasons, it's not sure if it is fully compatible with CUDA (so far it has been). I was able to solve the issue with this kind of change: - __shfl_xor_sync(0xffffffff, tmp, mask, 32)
+ __shfl_xor(tmp, mask, 32)Well actually, the 32 I changed to 64, etc, because of the different warp size. It would be nice if it were a define or something, actually. |
|
Excellent work! Interesting analytics:
It would be interesting to see if the performance gain for 65B follows the trend of slight percentage increases as the parameter count increases. |
|
The unroll indeed makes the compilation super long. Not sure we want to support it. for (int i = 0; i < n; ++i) {
...
}replace with: assert(n % 32 == 0);
for (int io = 0; io < n; io += 32) {
#pragma unroll
for (int i = 0; i < 32; ++i) {
....
}
}Or something along these lines |
|
The unroll makes the compilation longer but honestly I don't care about 2 minutes longer compilation if it means I get a few % more performance. |
|
I've added additional performance numbers to the OP. The difference from unrolling is ~5%. Keep in mind that as more performance optimizations are added this number will increase. I personally favor the current solution that just unrolls the big loop (unless there are performance differences). It's easier to maintain and for debugging purposes it's always possible to just compile without it. I can quickly test the proposed approach though. |
|
I've pushed an alternative version to this branch. The value for |
| option(LLAMA_ACCELERATE "llama: enable Accelerate framework" ON) | ||
| option(LLAMA_OPENBLAS "llama: use OpenBLAS" OFF) | ||
| option(LLAMA_CUBLAS "llama: use cuBLAS" OFF) | ||
| set(LLAMA_CUDA_BY "1" CACHE STRING "llama: y block size for dmmv CUDA kernels") |
There was a problem hiding this comment.
Can we avoid introduce more and more options? instead to check the compute capability to decide if we should enable these features dynamically.
There was a problem hiding this comment.
Automating performance optimizations is something that I would like to do long-term but right now I don't think we have the data necessary to judge which options should be enabled under which circumstances.
There was a problem hiding this comment.
I think there should be more options. There are plenty already for runtime tuning.
|
I am gonna try to bring #1087 into compatibility with this. The shuffle function seems the biggest hurdle right now, but it doesn't seem impossible. |
ba10833 to
b00c58c
Compare
|
For AMD this is roughly 20% faster: #define GGML_CUDA_DMMV_BLOCK_X 64 // dmmv = dequantize_mul_mat_vec
// ...
for (int mask = GGML_CUDA_DMMV_BLOCK_X/2; mask > 0; mask >>= 1) {
tmp += __shfl_xor_sync(0xffffffff, tmp, mask, GGML_CUDA_DMMV_BLOCK_X);
}And EDIT: master with |
|
I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling. |
Ok, I have mixed feelings about this unrolling anyway, so an alternative would be welcome |
960ee21 to
a06f7ec
Compare
|
I ended up doing an implementation kind of similar to what ggerganov proposed. It's now possible to set an option In terms of features I think this PR is complete but I just noticed that "BLOCK_X" is maybe a bad name for this value since it does not actually set the block size; that value is always |
|
Maybe just |
ggerganov
left a comment
There was a problem hiding this comment.
Looks great. Will do some tests later
10d9967 to
3698cd0
Compare
ggerganov
left a comment
There was a problem hiding this comment.
Minor fixes in Makefile
Works fine on 4080. LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=1 seems to be optimal on this card, but haven't done extensive testing
| else | ||
| NVCCFLAGS += -DGGML_CUDA_DMMV_X=32 | ||
| endif # LLAMA_CUDA_DMMV_X | ||
| ifdef LLAMA_CUDA_BY |
There was a problem hiding this comment.
| ifdef LLAMA_CUDA_BY | |
| ifdef LLAMA_CUDA_DMMV_Y |
|
|
||
| // dmmv = dequantize_mul_mat_vec | ||
| #ifndef GGML_CUDA_DMMV_X | ||
| #define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY | ||
| #endif | ||
| #ifndef GGML_CUDA_DMMV_Y | ||
| #define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY | ||
| #endif |
There was a problem hiding this comment.
| // dmmv = dequantize_mul_mat_vec | |
| #ifndef GGML_CUDA_DMMV_X | |
| #define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY | |
| #endif | |
| #ifndef GGML_CUDA_DMMV_Y | |
| #define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY | |
| #endif | |
| // dmmv = dequantize_mul_mat_vec | |
| #ifndef GGML_CUDA_DMMV_X | |
| #define GGML_CUDA_DMMV_X 32 | |
| #endif | |
| #ifndef GGML_CUDA_DMMV_Y | |
| #define GGML_CUDA_DMMV_Y 1 | |
| #endif |
3698cd0 to
d45df1b
Compare
|
This looks ready to merge, correct? |
|
From my side yes. |
|
(Sorry about that, I accidentally hit a random key with the page focused.) |
|
This is amazing now, <70 ms/t for 13b Q4_0 on my old graphics card. The variables need to be tuned though for different systems. |
* xor hack * block y dim * loop unrolling * Fixed cmake LLAMA_CUDA_BY option * Removed hipblas compatibility code * Define GGML_CUDA_DMMV_BLOCK_Y if not defined * Fewer iters, more ops per iter * Renamed DMMV X/Y compilation options
* xor hack * block y dim * loop unrolling * Fixed cmake LLAMA_CUDA_BY option * Removed hipblas compatibility code * Define GGML_CUDA_DMMV_BLOCK_Y if not defined * Fewer iters, more ops per iter * Renamed DMMV X/Y compilation options
* xor hack * block y dim * loop unrolling * Fixed cmake LLAMA_CUDA_BY option * Removed hipblas compatibility code * Define GGML_CUDA_DMMV_BLOCK_Y if not defined * Fewer iters, more ops per iter * Renamed DMMV X/Y compilation options
This PR adds performance optimizations for GPU accelerated token generation, mostly benefiting fast GPUs like the RTX 3090. Performance optimizations can be enabled via the options
LLAMA_CUDA_BY=2andLLAMA_CUDA_UNROLL=1(make) orLLAMA_CUDA_UNROLL=ON(cmake) at compile time. These options degrade performance on my GTX 1070. Build instructions (Linux):Implementation details
GGML_USE_HIPBLASis not defined.LLAMA_CUDA_BYsets the number of rows per block. On my RTX 3090 setting this option to 2 is faster but higher values have slightly worse performance. On my GTX 1070 a value of 2 or higher causes performance degradation.Loop unrolling: The matrices used in llama.cpp always have the same size. So the loops used during inference can be unrolled if the compiler is told how large the matrices are. This is done via movingncolsfrom a regular argument to a template argument and adding a switch statement for the various matrix sizes (8 in total). On my RTX 3090 this is faster but on my GTX 1070 it's slower. Enabling this option significantly increases compile time.LLAMA_CUDA_BXcan be set to determine the block size in x direction. Default value is 32, 64 was faster on RTX 3090.Results
For the RTX 3090 I used
LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1, for the GTX 1070 I did not use these options.