ggml: implement quantized KV cache for FA by JohannesGaessler · Pull Request #7372 · ggml-org/llama.cpp

JohannesGaessler · 2024-05-18T20:34:22Z

This PR implements the use of quantized KV caches for the CPU backend when using FlashAttention. This is done via switch statements in ggml_compute_forward_flash_attn_ext_f16; from what I can tell this does not significantly affect performance. I also added comments and did some small performance optimizations such as converting Q only once to FP16/q8_0/q8_1 and using multiplications for scaling rather than divisions. The amount of wdata has been increased but only on the order of kiB which should be negligible. On my desktop with a Ryzen 5950X power limited to 95W the performance changes as follows:

model	backend	threads	fa	test	t/s master	t/s PR	Speedup
llama 7B Q4_0	CPU	16	1	pp2048	33.43	35.29	1.06
llama 7B Q4_0	CPU	16	1	tg512	10.59	10.71	1.01

When benchmarking, be mindful of the order in which you run the tests since a hot CPU will perform worse and this difference can easily be larger than the performance difference from this PR. I did a warmup run prior to the actual measurements.

slaren · 2024-05-18T22:49:36Z

It would be preferable to use the more generic function pointers in type_traits in the same way mul_mat does.

github-actions · 2024-05-19T10:47:01Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8537.9ms p(95)=20441.75ms fails=, finish reason: stop=493 truncated=54
Prompt processing (pp): avg=101.07tk/s p(95)=423.38tk/s
Token generation (tg): avg=51.07tk/s p(95)=48.17tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fa-quantize-3 commit=b7da2e86db9836b4b85a30675f39c0571de0ec94

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 875.21, 875.21, 875.21, 875.21, 875.21, 827.76, 827.76, 827.76, 827.76, 827.76, 839.07, 839.07, 839.07, 839.07, 839.07, 887.27, 887.27, 887.27, 887.27, 887.27, 906.6, 906.6, 906.6, 906.6, 906.6, 892.65, 892.65, 892.65, 892.65, 892.65, 901.19, 901.19, 901.19, 901.19, 901.19, 905.11, 905.11, 905.11, 905.11, 905.11, 914.48, 914.48, 914.48, 914.48, 914.48, 914.94, 914.94, 914.94, 914.94, 914.94, 939.87, 939.87, 939.87, 939.87, 939.87, 917.31, 917.31, 917.31, 917.31, 917.31, 928.28, 928.28, 928.28, 928.28, 928.28, 932.96, 932.96, 932.96, 932.96, 932.96, 932.64, 932.64, 932.64, 932.64, 932.64, 932.47, 932.47, 932.47, 932.47, 932.47, 925.58, 925.58, 925.58, 925.58, 925.58, 938.79, 938.79, 938.79, 938.79, 938.79, 938.52, 938.52, 938.52, 938.52, 938.52, 941.35, 941.35, 941.35, 941.35, 941.35, 940.47, 940.47, 940.47, 940.47, 940.47, 940.23, 940.23, 940.23, 940.23, 940.23, 903.32, 903.32, 903.32, 903.32, 903.32, 903.77, 903.77, 903.77, 903.77, 903.77, 913.4, 913.4, 913.4, 913.4, 913.4, 916.13, 916.13, 916.13, 916.13, 916.13, 915.08, 915.08, 915.08, 915.08, 915.08, 915.15, 915.15, 915.15, 915.15, 915.15, 914.7, 914.7, 914.7, 914.7, 914.7, 915.77, 915.77, 915.77, 915.77, 915.77, 912.89, 912.89, 912.89, 912.89, 912.89, 914.47, 914.47, 914.47, 914.47, 914.47, 912.92, 912.92, 912.92, 912.92, 912.92, 912.52, 912.52, 912.52, 912.52, 912.52, 916.64, 916.64, 916.64, 916.64, 916.64, 912.77, 912.77, 912.77, 912.77, 912.77, 910.04, 910.04, 910.04, 910.04, 910.04, 911.36, 911.36, 911.36, 911.36, 911.36, 913.73, 913.73, 913.73, 913.73, 913.73, 911.06, 911.06, 911.06, 911.06, 911.06, 913.16, 913.16, 913.16, 913.16, 913.16, 911.52, 911.52, 911.52, 911.52, 911.52, 908.89, 908.89, 908.89, 908.89, 908.89, 907.08, 907.08, 907.08, 907.08, 907.08, 900.25, 900.25, 900.25, 900.25, 900.25, 903.67, 903.67, 903.67, 903.67, 903.67, 900.27, 900.27, 900.27, 900.27, 900.27, 899.22, 899.22, 899.22, 899.22, 899.22, 901.12, 901.12, 901.12, 901.12, 901.12, 901.99, 901.99, 901.99, 901.99, 901.99, 902.03, 902.03, 902.03, 902.03, 902.03, 904.79, 904.79, 904.79, 904.79, 904.79, 897.11, 897.11, 897.11, 897.11, 897.11, 903.84, 903.84, 903.84, 903.84, 903.84, 902.57, 902.57, 902.57, 902.57, 902.57, 903.66, 903.66, 903.66, 903.66, 903.66, 902.45, 902.45, 902.45, 902.45, 902.45, 902.98, 902.98, 902.98, 902.98, 902.98, 904.22, 904.22, 904.22, 904.22, 904.22, 905.74, 905.74, 905.74, 905.74, 905.74, 904.84, 904.84]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.77, 37.77, 37.77, 37.77, 37.77, 31.81, 31.81, 31.81, 31.81, 31.81, 30.19, 30.19, 30.19, 30.19, 30.19, 32.39, 32.39, 32.39, 32.39, 32.39, 32.6, 32.6, 32.6, 32.6, 32.6, 32.48, 32.48, 32.48, 32.48, 32.48, 33.71, 33.71, 33.71, 33.71, 33.71, 34.12, 34.12, 34.12, 34.12, 34.12, 33.95, 33.95, 33.95, 33.95, 33.95, 34.15, 34.15, 34.15, 34.15, 34.15, 34.32, 34.32, 34.32, 34.32, 34.32, 34.1, 34.1, 34.1, 34.1, 34.1, 33.47, 33.47, 33.47, 33.47, 33.47, 33.46, 33.46, 33.46, 33.46, 33.46, 32.31, 32.31, 32.31, 32.31, 32.31, 30.38, 30.38, 30.38, 30.38, 30.38, 30.71, 30.71, 30.71, 30.71, 30.71, 30.81, 30.81, 30.81, 30.81, 30.81, 30.77, 30.77, 30.77, 30.77, 30.77, 30.79, 30.79, 30.79, 30.79, 30.79, 30.85, 30.85, 30.85, 30.85, 30.85, 30.98, 30.98, 30.98, 30.98, 30.98, 30.68, 30.68, 30.68, 30.68, 30.68, 30.8, 30.8, 30.8, 30.8, 30.8, 31.07, 31.07, 31.07, 31.07, 31.07, 30.87, 30.87, 30.87, 30.87, 30.87, 31.0, 31.0, 31.0, 31.0, 31.0, 31.06, 31.06, 31.06, 31.06, 31.06, 31.23, 31.23, 31.23, 31.23, 31.23, 31.28, 31.28, 31.28, 31.28, 31.28, 31.39, 31.39, 31.39, 31.39, 31.39, 31.61, 31.61, 31.61, 31.61, 31.61, 31.66, 31.66, 31.66, 31.66, 31.66, 31.39, 31.39, 31.39, 31.39, 31.39, 31.24, 31.24, 31.24, 31.24, 31.24, 30.62, 30.62, 30.62, 30.62, 30.62, 30.54, 30.54, 30.54, 30.54, 30.54, 30.71, 30.71, 30.71, 30.71, 30.71, 30.91, 30.91, 30.91, 30.91, 30.91, 31.02, 31.02, 31.02, 31.02, 31.02, 30.91, 30.91, 30.91, 30.91, 30.91, 30.75, 30.75, 30.75, 30.75, 30.75, 30.54, 30.54, 30.54, 30.54, 30.54, 29.81, 29.81, 29.81, 29.81, 29.81, 28.77, 28.77, 28.77, 28.77, 28.77, 28.79, 28.79, 28.79, 28.79, 28.79, 28.82, 28.82, 28.82, 28.82, 28.82, 28.83, 28.83, 28.83, 28.83, 28.83, 28.9, 28.9, 28.9, 28.9, 28.9, 28.93, 28.93, 28.93, 28.93, 28.93, 29.02, 29.02, 29.02, 29.02, 29.02, 29.06, 29.06, 29.06, 29.06, 29.06, 28.98, 28.98, 28.98, 28.98, 28.98, 28.99, 28.99, 28.99, 28.99, 28.99, 29.0, 29.0, 29.0, 29.0, 29.0, 29.11, 29.11, 29.11, 29.11, 29.11, 29.28, 29.28, 29.28, 29.28, 29.28, 29.37, 29.37, 29.37, 29.37, 29.37, 29.42, 29.42, 29.42, 29.42, 29.42, 29.51, 29.51, 29.51, 29.51, 29.51, 29.58, 29.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.4, 0.4, 0.4, 0.4, 0.4, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.39, 0.39, 0.39, 0.39, 0.39, 0.26, 0.26, 0.26, 0.26, 0.26, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.34, 0.34, 0.34, 0.34, 0.34, 0.52, 0.52, 0.52, 0.52, 0.52, 0.68, 0.68, 0.68, 0.68, 0.68, 0.62, 0.62, 0.62, 0.62, 0.62, 0.46, 0.46, 0.46, 0.46, 0.46, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716145964 --> 1716146596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0]

JohannesGaessler · 2024-05-19T12:27:36Z

I tested using the same FP32 code for both FP16 and quantized V cache but the performance was worse.

ggerganov · 2024-05-19T14:36:21Z

Needs just contiguous rows, not the entire tensors

JohannesGaessler force-pushed the fa-quantize-3 branch from c1c6d84 to d7168fd Compare May 19, 2024 11:39

ggerganov approved these changes May 19, 2024

View reviewed changes

Comment thread ggml.c Outdated

Copy link
Copy Markdown

Member

ggerganov May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs just contiguous rows, not the entire tensors

ggml: implement quantized KV cache for FA

b7da2e8

JohannesGaessler force-pushed the fa-quantize-3 branch from d7168fd to b7da2e8 Compare May 19, 2024 14:46

JohannesGaessler merged commit 5ca49cb into ggml-org:master May 19, 2024

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

ggml: implement quantized KV cache for FA (ggml-org#7372)

40f137e

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

ggml: implement quantized KV cache for FA (ggml-org#7372)

59a38d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: implement quantized KV cache for FA#7372

ggml: implement quantized KV cache for FA#7372
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:fa-quantize-3

JohannesGaessler commented May 18, 2024

Uh oh!

slaren commented May 18, 2024

Uh oh!

github-actions Bot commented May 19, 2024 •

edited

Loading

Uh oh!

JohannesGaessler commented May 19, 2024

Uh oh!

ggerganov May 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented May 18, 2024

Uh oh!

slaren commented May 18, 2024

Uh oh!

github-actions Bot commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 19, 2024

Uh oh!

ggerganov May 19, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 19, 2024 •

edited

Loading