Optimization: Qwen3 next autoregressive pass#17996
Conversation
|
|
before: ggml_cuda_init: found 3 CUDA devices:
after: ggml_cuda_init: found 3 CUDA devices:
|
Edited: |
|
Nah, this should be a general optimization. This means there are other bottlenecks in play for the ROCm implementation than the slow delta-net. Can you run inference with |
That looks like a 10% bump, right? |
@pwilkin Hopefully this log is what you need :) |
CISC
left a comment
There was a problem hiding this comment.
There's an excessive amount of conts and asserts here, most of which I'm sure are unnecessary, but I think qwen3next needs a general cleanup of these anyway, so will leave that to you at a later stage.
|
@IIIIIllllIIIIIlllll can you do a bench for |
|
@pwilkin In case you're wondering, I think the |
@pwilkin |
|
Adding some multi-GPU ROCm data with several experts offloaded to CPU: SetupSpecsCPU: Ryzen 9 3950x ModelQwen3-Next-80B-A3B-Thinking-Q4_K_S Command/llama.cpp/build/build/bin/llama-server --host 127.0.0.1 --jinja --min-p 0 --mlock --mmap -ncmoe 20 --port 44163 --repeat-penalty 1.05 --temp 0.5 --top-k 0.20 --top-p 0.95 --warmup --alias Qwen3-Next-80B-A3B-Thinking-Q4_K_S --ctx-size 75000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_S.gguf --n-gpu-layers 999 --threads 8 --tensor-split 67,33 --log-verboseResultsggml-org/main Branch17.3 tokens/second pwilkin:lean_mean_token_machine Branch22.5 tokens/second Increase of >5 tokens/second or ~30% increase in token-gen speed |
|
Some 4x V100 32GB results w/ q8_0 gguf master: lean_mean_token_machine: Before: 38.39 t/s |
|
I was feeling a bit bored and naively asked gemini-cli to make the changes CISC suggested, it seems like it's consistently faster and it seems coherent (only did very brief testing). I do remember it breaking when it changed the sum_row conts though, but I don't know if any of the rest are needed. cont/assert reduction: Gain of 1.19 t/s over this commit (+2.67%) for a total gain of 7.4 t/s (+19.3%) over master patch file if your interested: qwen3.patch |
Nice little PP boost. |
|
Worth a few percent on my system: The number of CONT ops for |
|
|
please ignore my previous reply. The test results in my previous reply were executed in the PuTTY terminal, and I don't know why they were so bad. It's really strange, changing -DGGML_HIP_ROCWMMA_FATTN to OFF significantly improved pp's speed... Perhaps the performance of AI MAX+ 395 has reached its limit (this is questionable). this PR - DGGML_HIP_ROCWMMA_FATTN=OFF: this PR - DGGML_HIP_ROCWMMA_FATTN=ON: master- DGGML_HIP_ROCWMMA_FATTN=OFF: |
| // Choose between build_delta_net_chunking, build_delta_net_recurrent, and build_delta_net_autoregressive based on n_tokens | ||
| ggml_tensor * attn_out; | ||
| if (n_seq_tokens == 1) { | ||
| attn_out = build_delta_net_autoregressive(q_conv, k_conv, v_conv, gate, beta, state, il); | ||
| } else if (n_seq_tokens > CHUNK_SIZE) { | ||
| attn_out = build_delta_net_chunking(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il); | ||
| } else { | ||
| attn_out = build_delta_net_recurrent(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il); | ||
| } |
There was a problem hiding this comment.
This is highly not recommended. Instead of adding more branches, we have to figure out how to make the graph static. Start with simplifying the existing graphs by removing redundant ops.
There was a problem hiding this comment.
But in this case we can't make the graph static since the special branch here is one where the decay mask computation doesn't happen (because n_seq_tokens == 1, so it all collapses to trivial transformations, therefore they can be optimized out).
I can probably remove the recurrent part now since I'm not sure there's a realistic case for it, it'll be either chunking or autoregressive.
There was a problem hiding this comment.
maybe a bit off-topic, but I had a look quickly at the version on master branch and it seems like some ggml_cont_* and ggml_transpose can potentially be redundant. I suspect something like this can be reduced further:
ggml_tensor * k_cumdecay =
ggml_cont(ctx0, ggml_transpose(ctx0, ggml_mul_mat(ctx0, attn, ggml_cont(ctx0, ggml_transpose(ctx0, kbeta_gexp)))));
This trick in the contribution guide sometimes saved me a transpose:
Otherwise, sometimes you can also use a non-contiguous tensor if the next ops accept it
Also, sometimes unsqueeze(-1) can be just a ggml_view which costs almost nothing in term of speed
Edit: sometimes, you can also transpose the weight upon converting to GGUF, which make it usable in the formula mentioned above
Is there any reason why it could have gotten slower for me? I'm compiling it with |
|
got an interesting finding in Win11 + RTX5090, compile with vulkan support and force to use vulkan0 device with pp512 60%+ and tg128 100%+ vulkan0:
build: c00ff92 (7389) cuda0:
build: c00ff92 (7389) |
4a494ab to
b739b11
Compare
|
Alright, I've done the final refactorings: I also removed the recurrent version of the delta_net in favor of the chunked version since the use-case for the recurrent one was very narrow (prompt processing but with less than 64 tokens) and it didn't make sense to keep it just for that. Final numbers for IQ1_M quant on my box: load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-haswell.so
|
| chunk_size, causal_mask->ne[2], causal_mask->ne[3], | ||
| causal_mask->nb[1], causal_mask->nb[2], causal_mask->nb[3], 0); | ||
| causal_mask->nb[1], causal_mask->nb[2], causal_mask->nb[3], 0) : | ||
| ggml_tri(ctx0, ggml_fill_inplace(ctx0, ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, chunk_size, chunk_size), 1.0f), |
There was a problem hiding this comment.
ggml_new_tensor_2d should be avoided in general, especially inside loops. It creates new tensors increasing the graph size and the compute buffers. Use it only for input tensors at the beginning of the graph.
|
|
||
| ggml_tensor * chunked_mask = | ||
| ggml_view_4d(ctx0, causal_mask, chunk_size, | ||
| n_tokens >= chunk_size ? |
There was a problem hiding this comment.
Can we avoid these branches? The old version is more friendly towards keeping the graph topology static, so if it still works, it would be better to keep it.
As comparison from fastllm on my machine, running directly Qwen3-Next-80B-A3B-Thinking-FP8 with 5060TI offload, the TG keeps around 21 t/s at the beginning and drops down to 13 t/s at around 45K context length. |
|
@ggerganov aight, I think it's as clean as I can make it at this point. |
* It's Qwen3 Next, the lean mean token generation machine! * Apply patches from thread * Remove recurrent version, only keep chunked and autoregressive * Remove unnecessary conts and asserts * Remove more extra conts and asserts * Cleanup masking
* It's Qwen3 Next, the lean mean token generation machine! * Apply patches from thread * Remove recurrent version, only keep chunked and autoregressive * Remove unnecessary conts and asserts * Remove more extra conts and asserts * Cleanup masking
* It's Qwen3 Next, the lean mean token generation machine! * Apply patches from thread * Remove recurrent version, only keep chunked and autoregressive * Remove unnecessary conts and asserts * Remove more extra conts and asserts * Cleanup masking
This change adds a dedicated autoregressive version of delta-net which short cirtuits all the recurrent computations for
n_seq_tokens == 1. The end result is roughly a 40% bump in token generation speed.