Skip to content

Fix FlashAttention debug test, FP32 assert#7684

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-quant-fixup
Jun 1, 2024
Merged

Fix FlashAttention debug test, FP32 assert#7684
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-quant-fixup

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Fixup to #7527 (comment) .

Removes an incorrect assert for FP32 Flashattention. Pads the number of elements for per KV cache row in the tests to a multiple of the block size. The backend is still going to return that the head size is not supported but I think that if something like this were to ever be implemented padding would be the only sensible way to do it.

@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2024
@JohannesGaessler JohannesGaessler merged commit e141ce6 into ggml-org:master Jun 1, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants