CUDA: faster large batch FA without tensor cores#7314
CUDA: faster large batch FA without tensor cores#7314JohannesGaessler merged 1 commit intoggml-org:masterfrom
Conversation
|
This PR should provide a good speedup for the P100 but unfortunately I don't own one with which I could test the code. I would appreciate it if a P100 owner could post the output of with the path to an actual model. |
|
I have 4 m40's if that will help. If this works I may just drop the money for 4x p100s |
|
Here ya' go! I added -ts 1 to restrict it to one p100. I can redo the test without it if you like - I have 5 available. I tried to use a similar model to yours. Command: ./llama-bench --model ../../mod/gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1 Output: You seem to be getting dramatically faster results with your p40 than my p100, which has me curious. |
|
Using Dell PowerEdge R730 with Dual Intel Xeon E5-2697 V3 2.6 GHz 14 Core |
|
@dirkson @richginsberg thank you. |
|
Seeing about +5% on the P100, doesn't matter if 1 or 2 GPUs. However I'm getting very different P40 results from what you've posted above - I wonder did you run the test with 4xP40? I don't have 4, I only have 2: With 1xP40 I observe a large (30%) improvement at low batch sizes but past batch 512 it gets a tiny bit slower. With 2xP40 things really open up the 50% performance improvement is across the board and massive.. well done 🤯 💪 Single P100ggml_cuda_init: found 1 CUDA devices:
Dual P100Master: FA is slowerggml_cuda_init: found 2 CUDA devices:
This branch: FA is 5% faster!ggml_cuda_init: found 2 CUDA devices:
Single P40 (faster up to 256 only)ggml_cuda_init: found 1 CUDA devices:
Dual P40Master: FA slower past ctx 256ggml_cuda_init: found 2 CUDA devices:
This branch: 🤯 🐎Device 0: Tesla P40, compute capability 6.1, VMM: yes
|
|
The numbers are for Mistral 7b q4_0 on 1x P40, running on Linux 6.6.26-1-MANJARO. Are you using Windows? |
|
@JohannesGaessler I am running Ubuntu-22, the numbers I posted were for llama2-7b but switching to mistral-7b doesn't make much difference I see the same pattern, a single P40 is slower after b=256 and doesn't hit anywhere near the speeds you're reporting: ggml_cuda_init: found 1 CUDA devices:
For reference here is a RTX3060 in the same machine on the same model: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
|
|
Keep in mind that to increase the batch size that is submitted to the CUDA backend, you need to increase the ubatch-size alongside the batch size. Adding |
|
Do you have ECC memory enabled? If it's disabled Are you disabling the other GPUs via |
|
@slaren thank you for the clarification, in this particular case it luckily does not seem to affect the conclusions:
For very large batch sizes the performance with FlashAttention decreases but the performance seems to be optimal with a batch size of 512 anyways. |
it seems less optimal for qwen2 32B at larger batch |
|
The closest AMD alternative I know to NVIDIA NSight Compute would be Radeon GPU Profiler. It's still a bit different, but may be enough to get started. On the command-line,
|
|
Another run using Asus ESC4000 G4 with Intel Xeon Gold 6138 Processor LGA3647 1.8Ghz 20 Core 40 Thread |
ECC is disabled.
via CUDA_VISIABLE_DEVICES, but I just tried via -ts and the results were the same. My P100 numbers match what others are reporting, but your P40 numbers are somehow ~4x mine. I guess we need another set of P40 benchmarks. |
|
@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:
|
It could be something to do with " -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4" |
yup, it should work without -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4 when compile |
|
Seeing great results with this PR @JohannesGaessler thanks! Here's the numbers from a P40 that I've power limited to 130W (because it keeps the card cooler): P40
RTX 3060Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes (Power limited to 150W)
|
|
@JohannesGaessler Looks like you were right and there was something power limiting the P40s in my main rig to around 70W. I've moved them to the secondary and now they're >200W during these tests. My observation from the severely power-limited rig stands: with 2xP40 the performance gains here are HUGE. Single P40ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2xP40 split layerggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2xP40 Split rowggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Llama-3-70B-InstructNot as drastic but still some very welcome improvements, staying above 8 tok/sec: CUDA_VISIBLE_DEVICES=0,1 ./llama-bench --model /disk-0/models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r 1 -fa 0,1 -b 256,512 -sm layer,row
|
slaren
left a comment
There was a problem hiding this comment.
I don't want to block merging this, but I will point the obvious that there is a lot of code duplication here and that is going to complicate maintaining this code in the future.
|
@JohannesGaessler This was working great after merge but with the new Phi3 related commits, I'm now getting a crash when When Current version from master that's crashing with FA: Startup command: Phi-3 Medium gguf from here: https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF Crash output: |
|
There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work. |
Thanks for the quick fix @JohannesGaessler ! After merging latest changes, inference is now working well on the P40 with FA with the Phi 3 model I linked above. |




This PR adds CUDA FlashAttention kernels that do not use tensor cores and are optimized for large batch sizes. On my P40 enabling FlashAttention is now consistently faster:
On my RX 6800 these new kernels unfortunately perform quite poorly which is why I'm not enabling them for AMD. I don't know what the issue is and I cannot use NVIDIA NSight Compute to find out either. To my knowledge there is simply no equivalent AMD tool; if it turns out that I am just ignorant I would love for someone to correct me.