Skip to content

Only enable sgemm for prompt processing, not for inference#9330

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
netrunnereve:sgemm_pp
Sep 7, 2024
Merged

Only enable sgemm for prompt processing, not for inference#9330
ggerganov merged 1 commit intoggml-org:masterfrom
netrunnereve:sgemm_pp

Conversation

@netrunnereve
Copy link
Copy Markdown
Collaborator

While sgemm/tinyblas was designed to speed up prompt processing using tiled matrix multiplications, llama.cpp also calls it for inference as a 1x1 computation. Personally I think it makes more sense for us to use our dedicated ggml_vec_dot functions for the inference dot products and leave sgemm for prompt processing only. We can optimize each one for its respective purpose and so forth.

See my PR #8049 for an example where sgemm has faster prompt processing while ggml_vec_dot has faster inference.

@ggerganov ggerganov merged commit e536426 into ggml-org:master Sep 7, 2024
@netrunnereve netrunnereve deleted the sgemm_pp branch September 8, 2024 01:03
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants