CANN: refactor mask handling and improve performance in FA by noemotiovon · Pull Request #15561 · ggml-org/llama.cpp

noemotiovon · 2025-08-25T10:01:34Z

What does this PR do?

Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
Optimized performance in non-alibi scenarios by reducing one repeat operation.
Optimized tensor layout in FA from BNSD to BSND, reducing tensor movement time (BSND layout is contiguous in memory), significantly improving performance (15%).
Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

OP Test

......
  11821/11821 tests passed
  Backend CANN0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

Model Test

......
llama_perf_sampler_print:    sampling time =      58.82 ms /   257 runs   (    0.23 ms per token,  4369.49 tokens per second)
llama_perf_context_print:        load time =    1744.56 ms
llama_perf_context_print: prompt eval time =      30.98 ms /    20 tokens (    1.55 ms per token,   645.58 tokens per second)
llama_perf_context_print:        eval time =    1266.71 ms /   236 runs   (    5.37 ms per token,   186.31 tokens per second)
llama_perf_context_print:       total time =    2386.79 ms /   256 tokens
llama_perf_context_print:    graphs reused =        235

hipudding

Great job! only one minor modification required.

noemotiovon · 2025-08-26T11:11:55Z

I also optimized tensor layout in FA from BNSD to BSND in this PR, reducing tensor movement time (BSND layout is contiguous in memory), significantly improving performance (15%).

Model Test:

......
> 
llama_perf_sampler_print:    sampling time =      97.20 ms /   409 runs   (    0.24 ms per token,  4207.60 tokens per second)
llama_perf_context_print:        load time =    2098.21 ms
llama_perf_context_print: prompt eval time =      27.14 ms /    20 tokens (    1.36 ms per token,   736.97 tokens per second)
llama_perf_context_print:        eval time =    1708.84 ms /   388 runs   (    4.40 ms per token,   227.05 tokens per second)
llama_perf_context_print:       total time =    3553.72 ms /   408 tokens
llama_perf_context_print:    graphs reused =        386

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com>

Signed-off-by: noemotiovon <757486878@qq.com>

…upport * origin/master: (61 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

…g-model-disabled-agent-prefill * origin/master: (76 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

…nemotron-nano-15409 * origin/master: (59 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

…15561) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>

* CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>

…15561) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 25, 2025

noemotiovon force-pushed the fa branch 2 times, most recently from df36aa8 to 09f0444 Compare August 26, 2025 08:10

noemotiovon changed the title ~~[CANN]Optimization of unnecessary repeat in the FA operator~~ CANN: refactor mask handling and improve performance in FA Aug 26, 2025

hipudding reviewed Aug 26, 2025

View reviewed changes

Comment thread ggml/src/ggml-cann/ggml-cann.cpp Outdated

hipudding reviewed Aug 26, 2025

View reviewed changes

Comment thread ggml/src/ggml-cann/ggml-cann.cpp Outdated

hipudding approved these changes Aug 26, 2025

View reviewed changes

noemotiovon added 3 commits August 27, 2025 02:52

[CANN]: fix review

92e61dd

Signed-off-by: noemotiovon <757486878@qq.com>

[CANN]: Optimization FA BNSD to BSND

db86df3

Signed-off-by: noemotiovon <757486878@qq.com>

noemotiovon force-pushed the fa branch from fcc2613 to db86df3 Compare August 27, 2025 02:52

hipudding merged commit 1e74897 into ggml-org:master Aug 27, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CANN: refactor mask handling and improve performance in FA#15561

CANN: refactor mask handling and improve performance in FA#15561
hipudding merged 3 commits intoggml-org:masterfrom
noemotiovon:fa

noemotiovon commented Aug 25, 2025 •

edited

Loading

Uh oh!

hipudding left a comment

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

noemotiovon commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

OP Test

Model Test

Uh oh!

hipudding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Test:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noemotiovon commented Aug 25, 2025 •

edited

Loading

noemotiovon commented Aug 26, 2025 •

edited

Loading