avx2: compute ksigns instead of loading from table#19657
Open
dfriehs wants to merge 5 commits intoggml-org:masterfrom
Open
avx2: compute ksigns instead of loading from table#19657dfriehs wants to merge 5 commits intoggml-org:masterfrom
dfriehs wants to merge 5 commits intoggml-org:masterfrom
Conversation
Contributor
Author
|
I found one more trick that seems to speed things up a bit, performance seems better in both cases now. I'd appreciate benchmarks on different systems though. There's still some things left to do (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is my attempt at implementing the ksign computation like in #19624 for the cpu AVX2 backend. Unfortunately I'm not sure if it's an improvement;
test-backend-ops perfshows slightly higher FLOPs (but uses hyperthreading) whilellama-benchshows a reduction in performance.Disclaimer: This is the first time I've seriously worked with AVX instructions. The flow matches the one implemented in
ggml_vec_dot_iq2_xs_q8_Kquite a bit (which I realized far too late), but IQ2_XS signs seem packed in a more beneficial way for SIMD.As I'm not sure computation is faster and I'm not sure how (and if even possible) to tune this further I would appreciate if someone experienced with AVX2 could give me a review, or if someone that uses IQ2_XSS models with layers on the cpu could benchmark the changes. Right now I'd err on the side of closing this PR.
All performance tests built without BLAS, and only CPU backend.
test-backend-ops test -b CUDA0 -p iq2_xxspasses for me, so the computation should be correct.nice -20 test-backend-ops perf -b CPU -p iq2_xxson ff4affb (master)on 7fe317f (this branch)
taskset -c 0-15 nice -20 llama-bench -m Qwen2.5-Coder-32B-Instruct-IQ2_XXS.gguf -fa 0,1 -p 128 -n 64 --threads 16on ff4affb (master)on 7fe317f (this branch)