Skip to content

[CUDA] Increase number of output elements per-thread block if the K-dimension is small#20635

Merged
am17an merged 3 commits intoggml-org:masterfrom
gaugarg-nv:small_k_optimization
Mar 22, 2026
Merged

[CUDA] Increase number of output elements per-thread block if the K-dimension is small#20635
am17an merged 3 commits intoggml-org:masterfrom
gaugarg-nv:small_k_optimization

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

@gaugarg-nv gaugarg-nv commented Mar 16, 2026

The K-dimension (inner dot product dimension) of the FFN-down matrices can be quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3-235B-A22B has a k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such matrices.

This change is also helpful for Tensor parallelism (PR #19378), where FFN-down is split along the K dimension.

Single GPU Performance on 1x RTX Pro 6000 Blackwell
model_type n_ubatch n_prompt master-avg_ts pr-avg_ts Speed-up
qwen3moe 30B.A3B Q4_K - Medium 1 512 231.4418 239.4359 1.03
qwen3moe 30B.A3B Q4_K - Medium 2 512 336.3564 353.3403 1.05
qwen3moe 30B.A3B Q4_K - Medium 4 512 498.7951 544.9048 1.09
qwen3moe 30B.A3B Q4_K - Medium 8 512 579.7136 580.2928 1.00
qwen3moe 30B.A3B Q4_K - Medium 16 512 936.1984 934.2313 1.00
qwen3moe 30B.A3B Q4_K - Medium 32 512 1456.243 1453.281 1.00
qwen3moe 30B.A3B Q4_K - Medium 64 512 2185.851 2185.245 1.00
qwen3moe 30B.A3B Q4_K - Medium 128 512 2970.54 2969.02 1.00
qwen3moe 30B.A3B Q4_K - Medium 256 512 4774.641 4779.619 1.00
qwen3moe 30B.A3B Q4_K - Medium 512 512 6587.268 6592.251 1.00
qwen3moe 30B.A3B Q8_0 1 512 188.6321 189.3348 1.00
qwen3moe 30B.A3B Q8_0 2 512 296.4038 304.8155 1.03
qwen3moe 30B.A3B Q8_0 4 512 446.3545 480.4061 1.08
qwen3moe 30B.A3B Q8_0 8 512 513.8571 513.5698 1.00
qwen3moe 30B.A3B Q8_0 16 512 814.9273 809.3003 0.99
qwen3moe 30B.A3B Q8_0 32 512 1309.532 1310.682 1.00
qwen3moe 30B.A3B Q8_0 64 512 2145.738 2147.491 1.00
qwen3moe 30B.A3B Q8_0 128 512 3039.336 3040.037 1.00
qwen3moe 30B.A3B Q8_0 256 512 4908.882 4912.358 1.00
qwen3moe 30B.A3B Q8_0 512 512 6795.054 6800.975 1.00
qwen3 4B Q4_K - Medium 1 512 270.4391 270.4142 1.00
qwen3 4B Q4_K - Medium 2 512 522.5462 523.2189 1.00
qwen3 4B Q4_K - Medium 4 512 888.7895 891.6788 1.00
qwen3 4B Q4_K - Medium 8 512 1331.554 1333.544 1.00
qwen3 4B Q4_K - Medium 16 512 2609.212 2613.457 1.00
qwen3 4B Q4_K - Medium 32 512 4131.247 4153.166 1.01
qwen3 4B Q4_K - Medium 64 512 6010.69 6040.168 1.00
qwen3 4B Q4_K - Medium 128 512 8336.18 8368.532 1.00
qwen3 4B Q4_K - Medium 256 512 12653.47 12680.27 1.00
qwen3 4B Q4_K - Medium 512 512 16933.91 16990.33 1.00
gpt-oss 20B MXFP4 MoE 1 512 327.1843 327.2503 1.00
gpt-oss 20B MXFP4 MoE 2 512 487.6076 487.2249 1.00
gpt-oss 20B MXFP4 MoE 4 512 722.2551 722.1628 1.00
gpt-oss 20B MXFP4 MoE 8 512 909.277 911.6954 1.00
gpt-oss 20B MXFP4 MoE 16 512 1475.936 1474.678 1.00
gpt-oss 20B MXFP4 MoE 32 512 2448.124 2449.26 1.00
gpt-oss 20B MXFP4 MoE 64 512 4019.604 4021.089 1.00
gpt-oss 20B MXFP4 MoE 128 512 5825.155 5820.645 1.00
gpt-oss 20B MXFP4 MoE 256 512 8901.978 8885.761 1.00
gpt-oss 20B MXFP4 MoE 512 512 11634.01 11628.95 1.00
llama 8B Q4_K - Medium 1 512 218.6202 218.5992 1.00
llama 8B Q4_K - Medium 2 512 426.0842 425.9328 1.00
llama 8B Q4_K - Medium 4 512 753.1047 753.2821 1.00
llama 8B Q4_K - Medium 8 512 1043.164 1042.673 1.00
llama 8B Q4_K - Medium 16 512 2306.093 2301.84 1.00
llama 8B Q4_K - Medium 32 512 3720.924 3730.606 1.00
llama 8B Q4_K - Medium 64 512 5444.508 5457.328 1.00
llama 8B Q4_K - Medium 128 512 7452.762 7408.557 0.99
llama 8B Q4_K - Medium 256 512 10174.56 10179.98 1.00
llama 8B Q4_K - Medium 512 512 12917.97 12923.66 1.00
llama 8B Q4_0 1 512 232.4301 232.74 1.00
llama 8B Q4_0 2 512 461.9919 461.8752 1.00
llama 8B Q4_0 4 512 889.2508 889.4003 1.00
llama 8B Q4_0 8 512 1377.003 1377.244 1.00
llama 8B Q4_0 16 512 2338.211 2335.362 1.00
llama 8B Q4_0 32 512 3822.713 3822.771 1.00
llama 8B Q4_0 64 512 5891.381 5883.02 1.00
llama 8B Q4_0 128 512 7699.878 7715.334 1.00
llama 8B Q4_0 256 512 10874.01 10842.19 1.00
llama 8B Q4_0 512 512 14034.76 14027.07 1.00
Tensor Parallelism Performance on 2x RTX Pro 6000 Blackwell with PR 19378
      ae0334f PR Speed-up
2xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 pp512 2165.37 2167.83 1.00
2xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 tg128 71.51 75.37 1.05
2xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 pp512 8357.29 8359.93 1.00
2xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 tg128 182.1 194.26 1.07
4xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 pp512 2367.91 2342.61 0.99
4xRTX 6000 Pro BW Qwen3-235B-A22B-Q4_0 tg128 66.05 71.5 1.08
4xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 pp512 8408.73 8415.25 1.00
4xRTX 6000 Pro BW Qwen3-30B-A3B-Q4_0 tg128 155.57 162.79 1.05

@gaugarg-nv gaugarg-nv requested a review from a team as a code owner March 16, 2026 11:26
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 16, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 16, 2026

This also helps in Qwen3.5 which has a down shape of 512

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

This also helps in Qwen3.5 which has a down shape of 512

Adding Qwen3.5-35B-A3B data on RTX Pro 6000 BW:

model_type n_ubatch n_prompt master-avg_ts pr-avg_ts  
qwen35moe 35B.A3B Q4_K - Medium 1 512 201.1068 207.3763 1.03
qwen35moe 35B.A3B Q4_K - Medium 2 512 287.9365 300.4817 1.04
qwen35moe 35B.A3B Q4_K - Medium 4 512 476.962 513.3622 1.08
qwen35moe 35B.A3B Q4_K - Medium 8 512 566.0659 564.7415 1.00
qwen35moe 35B.A3B Q4_K - Medium 16 512 795.9801 797.4574 1.00
qwen35moe 35B.A3B Q4_K - Medium 32 512 1291.454 1291.694 1.00
qwen35moe 35B.A3B Q4_K - Medium 64 512 1979.488 1979.35 1.00
qwen35moe 35B.A3B Q4_K - Medium 128 512 2660.565 2659.532 1.00
qwen35moe 35B.A3B Q4_K - Medium 256 512 4538.735 4538.854 1.00
qwen35moe 35B.A3B Q4_K - Medium 512 512 6729.192 6747.521 1.00

Comment thread ggml/src/ggml-cuda/mmvq.cu Outdated
{ \
constexpr int c_ncols_dst = C_NCOLS_DST; \
const int nwarps = calc_nwarps(type, c_ncols_dst, table_id); \
const bool use_small_k = nwarps > 1 && blocks_per_row_x < nwarps * blocks_per_iter_1warp; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done inside the cuda kernel using multiplications by re-ordering this expression? It would simplify the code quite a bit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, are you suggesting to remove small_k template parameter from the kernel? I think that should be doable.

But we will still need this code on the host as we modify rows_per_block for small_k, which in turn modifies the grid dimensions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. On the host side we can create a function to return the correct dims instead of doing the if/else and the macro

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to simplify the host code.

But I think removing small_k as a template parameter from the kernel would mean rows_per_cuda_block can no longer be constexpr. And some of the local register and shared memory allocation depend on this value to be constexpr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then it makes sense to leave the template parameter as it is. My only worry was the compile times/binary sizes. Do you notice any difference in them? If you're using ninja build it should be pretty easy to see via .ninja_log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@am17an Sorry for the late follow-up on this. I was busy with some other work.

Regarding build times, I am seeing an increase of 12 seconds and an increase of 2MB in libggml-cuda.so.
IMO, given that most SOTA models are MOE, it is worth taking this hit.

Regarding host-side code simplification, I think we can NOT avoid if-else as small_k is a template parameter.
Let me know if there are any specific ideas you have regarding code simplification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you compiling only for blackwell? I think this change will slow down the CI. We can limit this change to ncols_dst = 1, since that 99% of the use-case.

228.646s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o
169.613s        ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q2_k.cu.o                                             
163.399s        ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq256-dv256.cu.o                              
153.896s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
147.442s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_1.cu.o
144.375s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q3_k.cu.o
144.037s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_s.cu.o
139.735s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_0.cu.o
138.991s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xs.cu.o
132.618s	ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_64-ncols2_1.cu.o

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have limited the change to ncols_dst = 1 for now. This would obviously mean no scaling for BS > 1. I will spend more time on this kernel later and see how we can add specialization for small-k without adding too much compilation time and library size.

Copy link
Copy Markdown
Contributor

@am17an am17an Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical way is to separate ncols into separate template files. But I think it might be overkill for now

Comment thread ggml/src/ggml-cuda/mmvq.cu Outdated
Comment on lines +482 to +484
// When K is small, increase rows_per_block to match nwarps so each warp
// processes a different row. This amortizes y-vector reads and reduces block count.
// Trigger when the full thread block covers all K blocks in a single loop iteration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you increase rows_per_block that will not result in different warps working on different src0 rows. Each thread will still work on every row/column assigned to a CUDA block, all that changes is that the inner loop is over more rows so the compiler should in principle be able to recognize that the same data is being loaded multiple times. To avoid wasted work I would suggest you reduce the number of warps per CUDA block instead.

Regardless of the above, it may very well be that increasing rows_per_block is beneficial on Blackwell (in general, I did not test this). Did you test the impact of your small_k config for larger matrices.

Generally speaking, MMVQ is of comparatively poor code quality for historical reasons. It's among the first kernels that I wrote so I was less experienced (llama.cpp/ggml was my first contact with CUDA) and I didn't yet find the time to circle back to it to look into how it could be improved. It may be worthwhile for you to look for optimization opportunities beyond applications for TP.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you increase rows_per_block that will not result in different warps working on different src0 rows. Each thread will still work on every row/column assigned to a CUDA block, all that changes is that the inner loop is over more rows so the compiler should in principle be able to recognize that the same data is being loaded multiple times.

Thanks @JohannesGaessler for the pointers. You're right about the kernel. I will update the comments.

To avoid wasted work I would suggest you reduce the number of warps per CUDA block instead.

I tried setting n_warps to 1 or 2 for small_k without modifying rows_per_block . None of them show any perf improvements. In general, I think too small CTAs are not very efficient.

Regardless of the above, it may very well be that increasing rows_per_block is beneficial on Blackwell (in general, I did not test this). Did you test the impact of your small_k config for larger matrices.

Yes, I tried setting small_k to true for all cases, which would increase rows_per_block in all cases. This gave me a regression of 4-6% for a few models in the BS=1 case. So, rejected the idea.

It may be worthwhile for you to look for optimization opportunities beyond applications for TP.

Yes, the main motivation of this work was Tensor parallelism. I will explore more ideas.
In general, for BS=1, the performance of kernels looks good except for some corner cases like small-k. I think one idea worth pursuing is using 1 warp per element for small-K, so that we avoid using shared memory and use light-weight warpReduce instead of block-level reduce. I can try experimenting with this idea.

Would you like to proceed with this PR first, or should we do more exploration around these ideas first?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that if this PR empirically improves performance for some cases we should keep it, just change the rationale since it is misleading.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesGaessler updated the comments and PR description.

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.
@gaugarg-nv gaugarg-nv force-pushed the small_k_optimization branch from 4f20a44 to cfbbfb2 Compare March 18, 2026 11:27
@gaugarg-nv gaugarg-nv changed the title [CUDA] Use a single warp per element instead of a single block per element if the K-dimension is small [CUDA] Increase number of output elements per-thread block if the K-dimension is small Mar 18, 2026
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this PR changes the behavior for all hardware. So either we need to assert that these changes are actually beneficial in those instances or this PR need to be limited in terms of what hardware it affects.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 19, 2026

I ran some tests on GPUs available. @IMbackK can you run this PR on your AMD cards? I should have a Strix Halo soon so I should be able to test some stuff on that as well.

4090

Model Test t/s master t/s cfbbfb2 Speedup
qwen35moe 35B.A3B Q4_K_S tg128 183.23 185.76 1.01

3090

Model Test t/s master t/s cfbbfb2 Speedup
qwen35moe 35B.A3B Q4_K_S tg128 146.48 151.64 1.04

@ggerganov
Copy link
Copy Markdown
Member

Some gains on DGX Spark too:

Model Test t/s master t/s pr/20635 Speedup
kimi-linear 48B.A3B Q4_K_M tg32 69.88 71.03 1.02
qwen35 27B Q4_K_M tg32 11.82 11.81 1.00
qwen3next 80B.A3B Q4_0 tg32 64.27 65.53 1.02

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 19, 2026

seams fine:

GPU Model Microbatch size Test t/s master t/s small_k_optimization Speedup
MI100 gpt-oss 20B MXFP4 MoE 1 pp1024 169.06 165.95 0.98
MI100 gpt-oss 20B MXFP4 MoE 4 pp1024 236.31 234.73 0.99
MI100 gpt-oss 20B MXFP4 MoE 8 pp1024 304.73 303.89 1.00
MI100 gpt-oss 20B MXFP4 MoE 512 pp1024 1956.96 1952.06 1.00
MI100 mistral3 14B Q8_0 1 pp1024 26.55 26.44 1.00
MI100 mistral3 14B Q8_0 4 pp1024 75.48 74.37 0.99
MI100 mistral3 14B Q8_0 8 pp1024 106.88 107.57 1.01
MI100 mistral3 14B Q8_0 512 pp1024 756.05 756.43 1.00
MI100 qwen3moe 30B.A3B Q8_0 1 pp1024 102.62 106.37 1.04
MI100 qwen3moe 30B.A3B Q8_0 4 pp1024 145.90 145.76 1.00
MI100 qwen3moe 30B.A3B Q8_0 8 pp1024 168.12 166.81 0.99
MI100 qwen3moe 30B.A3B Q8_0 512 pp1024 1483.53 1481.14 1.00
RX 7900 XTX gpt-oss 20B MXFP4 MoE 1 pp1024 198.24 197.62 1.00
RX 7900 XTX gpt-oss 20B MXFP4 MoE 4 pp1024 418.06 416.64 1.00
RX 7900 XTX gpt-oss 20B MXFP4 MoE 8 pp1024 582.21 579.94 1.00
RX 7900 XTX gpt-oss 20B MXFP4 MoE 512 pp1024 1317.01 1320.76 1.00
RX 7900 XTX mistral3 14B Q8_0 1 pp1024 33.78 33.49 0.99
RX 7900 XTX mistral3 14B Q8_0 4 pp1024 116.07 115.43 0.99
RX 7900 XTX mistral3 14B Q8_0 8 pp1024 185.63 185.71 1.00
RX 7900 XTX mistral3 14B Q8_0 512 pp1024 370.08 372.74 1.01

The slight slowdown with gpt-oss 20B is reproducible

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

The slight slowdown with gpt-oss 20B is reproducible

I don't have AMD hardware to test and tune the heuristic.

@ggerganov @JohannesGaessler @IMbackK should I limit this change to Nvidia for now?

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 20, 2026

+4% here -2% over here i dont think it matters really, can be merged as is from my side from a performance standpoint

did you check gpt oss on nv hardware?

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 21, 2026

did you check gpt oss on nv hardware?

I don't think it should have any effect on gpt-oss as the down dim is 2880, which this PR would not do anything for.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I'm currently running more comprehensive benchmarks, don't merge please.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I tested the performance of the original commit with which the PR was opened:

Performance
GPU Model Microbatch size Test t/s 312cf03 t/s cfbbfb2 Speedup
MI60 / MI50 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 90.27 94.05 1.04
MI60 / MI50 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 115.64 134.74 1.17
MI60 / MI50 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 131.13 154.79 1.18
MI60 / MI50 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 147.86 177.64 1.20
MI60 / MI50 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 62.13 59.51 0.96
MI60 / MI50 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 72.68 81.30 1.12
MI60 / MI50 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 81.20 90.82 1.12
MI60 / MI50 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 87.81 100.11 1.14
MI60 / MI50 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 64.74 61.32 0.95
MI60 / MI50 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 77.24 83.77 1.08
MI60 / MI50 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 86.24 94.28 1.09
MI60 / MI50 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 93.81 104.79 1.12
MI60 / MI50 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 65.01 63.92 0.98
MI60 / MI50 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 77.56 83.92 1.08
MI60 / MI50 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 87.95 96.74 1.10
MI60 / MI50 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 93.49 105.58 1.13
MI60 / MI50 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 57.45 56.46 0.98
MI60 / MI50 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 60.60 74.48 1.23
MI60 / MI50 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 65.46 82.57 1.26
MI60 / MI50 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 69.13 89.32 1.29
MI60 / MI50 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 60.06 59.31 0.99
MI60 / MI50 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 62.99 77.84 1.24
MI60 / MI50 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 67.56 85.20 1.26
MI60 / MI50 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 71.30 91.94 1.29
MI60 / MI50 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 56.83 56.60 1.00
MI60 / MI50 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 63.08 76.86 1.22
MI60 / MI50 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 68.42 85.11 1.24
MI60 / MI50 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 72.96 93.01 1.27
MI60 / MI50 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 59.47 58.35 0.98
MI60 / MI50 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 69.74 81.01 1.16
MI60 / MI50 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 76.25 90.74 1.19
MI60 / MI50 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 81.24 98.80 1.22
MI60 / MI50 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 100.71 103.96 1.03
MI60 / MI50 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 128.77 139.69 1.08
MI60 / MI50 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 139.25 153.47 1.10
MI60 / MI50 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 154.44 173.69 1.12
MI60 / MI50 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 94.12 91.44 0.97
MI60 / MI50 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 126.67 128.98 1.02
MI60 / MI50 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 137.25 141.10 1.03
MI60 / MI50 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 156.21 161.02 1.03
MI60 / MI50 qwen3moe 30B.A3B Q2_K_M 1 pp512 78.83 80.52 1.02
MI60 / MI50 qwen3moe 30B.A3B Q2_K_M 2 pp512 100.73 118.63 1.18
MI60 / MI50 qwen3moe 30B.A3B Q2_K_M 3 pp512 108.36 128.89 1.19
MI60 / MI50 qwen3moe 30B.A3B Q2_K_M 4 pp512 120.67 146.09 1.21
MI60 / MI50 qwen3moe 30B.A3B Q3_K_S 1 pp512 68.72 69.85 1.02
MI60 / MI50 qwen3moe 30B.A3B Q3_K_S 2 pp512 87.33 98.04 1.12
MI60 / MI50 qwen3moe 30B.A3B Q3_K_S 3 pp512 93.98 108.21 1.15
MI60 / MI50 qwen3moe 30B.A3B Q3_K_S 4 pp512 101.93 118.35 1.16
MI60 / MI50 qwen3moe 30B.A3B Q4_0 1 pp512 102.70 109.57 1.07
MI60 / MI50 qwen3moe 30B.A3B Q4_0 2 pp512 131.70 147.62 1.12
MI60 / MI50 qwen3moe 30B.A3B Q4_0 3 pp512 140.12 159.53 1.14
MI60 / MI50 qwen3moe 30B.A3B Q4_0 4 pp512 158.34 183.89 1.16
MI60 / MI50 qwen3moe 30B.A3B Q4_1 1 pp512 102.34 110.68 1.08
MI60 / MI50 qwen3moe 30B.A3B Q4_1 2 pp512 131.85 143.64 1.09
MI60 / MI50 qwen3moe 30B.A3B Q4_1 3 pp512 141.05 154.34 1.09
MI60 / MI50 qwen3moe 30B.A3B Q4_1 4 pp512 159.43 179.65 1.13
MI60 / MI50 qwen3moe 30B.A3B Q4_K_S 1 pp512 90.94 95.07 1.05
MI60 / MI50 qwen3moe 30B.A3B Q4_K_S 2 pp512 114.58 118.59 1.03
MI60 / MI50 qwen3moe 30B.A3B Q4_K_S 3 pp512 126.00 130.37 1.03
MI60 / MI50 qwen3moe 30B.A3B Q4_K_S 4 pp512 138.11 146.05 1.06
MI60 / MI50 qwen3moe 30B.A3B Q5_0 1 pp512 92.99 97.07 1.04
MI60 / MI50 qwen3moe 30B.A3B Q5_0 2 pp512 120.91 130.58 1.08
MI60 / MI50 qwen3moe 30B.A3B Q5_0 3 pp512 129.70 142.13 1.10
MI60 / MI50 qwen3moe 30B.A3B Q5_0 4 pp512 146.17 163.85 1.12
MI60 / MI50 qwen3moe 30B.A3B Q5_1 1 pp512 94.89 100.90 1.06
MI60 / MI50 qwen3moe 30B.A3B Q5_1 2 pp512 123.04 133.97 1.09
MI60 / MI50 qwen3moe 30B.A3B Q5_1 3 pp512 132.91 147.36 1.11
MI60 / MI50 qwen3moe 30B.A3B Q5_1 4 pp512 149.63 169.88 1.14
MI60 / MI50 qwen3moe 30B.A3B Q5_K_S 1 pp512 85.36 85.22 1.00
MI60 / MI50 qwen3moe 30B.A3B Q5_K_S 2 pp512 111.76 111.99 1.00
MI60 / MI50 qwen3moe 30B.A3B Q5_K_S 3 pp512 122.38 123.10 1.01
MI60 / MI50 qwen3moe 30B.A3B Q5_K_S 4 pp512 131.69 132.27 1.00
MI60 / MI50 qwen3moe 30B.A3B Q6_K 1 pp512 83.61 87.35 1.04
MI60 / MI50 qwen3moe 30B.A3B Q6_K 2 pp512 107.60 114.49 1.06
MI60 / MI50 qwen3moe 30B.A3B Q6_K 3 pp512 120.24 128.34 1.07
MI60 / MI50 qwen3moe 30B.A3B Q6_K 4 pp512 129.20 138.44 1.07
MI60 / MI50 qwen3moe 30B.A3B Q8_0 1 pp512 91.67 95.29 1.04
MI60 / MI50 qwen3moe 30B.A3B Q8_0 2 pp512 114.66 124.67 1.09
MI60 / MI50 qwen3moe 30B.A3B Q8_0 3 pp512 119.55 129.70 1.08
MI60 / MI50 qwen3moe 30B.A3B Q8_0 4 pp512 136.65 149.57 1.09
MI100 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 102.58 99.76 0.97
MI100 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 137.33 146.02 1.06
MI100 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 164.31 180.87 1.10
MI100 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 147.47 163.11 1.11
MI100 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 75.11 69.66 0.93
MI100 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 96.79 97.40 1.01
MI100 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 111.04 116.32 1.05
MI100 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 104.48 111.29 1.07
MI100 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 78.78 71.93 0.91
MI100 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 100.80 100.70 1.00
MI100 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 116.56 119.33 1.02
MI100 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 110.25 114.24 1.04
MI100 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 80.29 73.66 0.92
MI100 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 101.54 102.81 1.01
MI100 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 118.98 122.38 1.03
MI100 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 112.40 116.16 1.03
MI100 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 72.87 68.20 0.94
MI100 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 83.82 93.07 1.11
MI100 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 94.15 110.16 1.17
MI100 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 90.72 105.24 1.16
MI100 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 75.06 70.47 0.94
MI100 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 86.03 98.35 1.14
MI100 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 96.97 113.85 1.17
MI100 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 92.96 108.49 1.17
MI100 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 71.35 66.70 0.93
MI100 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 83.04 93.01 1.12
MI100 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 94.23 110.34 1.17
MI100 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 90.05 105.91 1.18
MI100 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 73.59 68.44 0.93
MI100 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 84.12 97.52 1.16
MI100 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 95.87 114.90 1.20
MI100 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 91.43 109.26 1.20
MI100 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 106.51 108.01 1.01
MI100 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 141.41 144.43 1.02
MI100 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 166.78 176.55 1.06
MI100 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 148.21 156.24 1.05
MI100 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 102.16 98.67 0.97
MI100 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 137.25 140.56 1.02
MI100 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 165.33 169.78 1.03
MI100 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 149.31 153.78 1.03
MI100 qwen3moe 30B.A3B Q2_K_M 1 pp512 93.31 92.24 0.99
MI100 qwen3moe 30B.A3B Q2_K_M 2 pp512 120.38 130.91 1.09
MI100 qwen3moe 30B.A3B Q2_K_M 3 pp512 136.36 154.15 1.13
MI100 qwen3moe 30B.A3B Q2_K_M 4 pp512 124.72 138.40 1.11
MI100 qwen3moe 30B.A3B Q3_K_S 1 pp512 84.86 86.51 1.02
MI100 qwen3moe 30B.A3B Q3_K_S 2 pp512 109.47 117.89 1.08
MI100 qwen3moe 30B.A3B Q3_K_S 3 pp512 123.41 136.75 1.11
MI100 qwen3moe 30B.A3B Q3_K_S 4 pp512 115.35 125.69 1.09
MI100 qwen3moe 30B.A3B Q4_0 1 pp512 115.03 112.84 0.98
MI100 qwen3moe 30B.A3B Q4_0 2 pp512 145.80 152.20 1.04
MI100 qwen3moe 30B.A3B Q4_0 3 pp512 171.50 185.88 1.08
MI100 qwen3moe 30B.A3B Q4_0 4 pp512 156.13 166.88 1.07
MI100 qwen3moe 30B.A3B Q4_1 1 pp512 111.52 117.03 1.05
MI100 qwen3moe 30B.A3B Q4_1 2 pp512 145.93 156.68 1.07
MI100 qwen3moe 30B.A3B Q4_1 3 pp512 176.35 187.22 1.06
MI100 qwen3moe 30B.A3B Q4_1 4 pp512 155.00 166.99 1.08
MI100 qwen3moe 30B.A3B Q4_K_S 1 pp512 102.40 104.45 1.02
MI100 qwen3moe 30B.A3B Q4_K_S 2 pp512 130.26 131.62 1.01
MI100 qwen3moe 30B.A3B Q4_K_S 3 pp512 154.30 155.47 1.01
MI100 qwen3moe 30B.A3B Q4_K_S 4 pp512 142.22 143.51 1.01
MI100 qwen3moe 30B.A3B Q5_0 1 pp512 102.25 105.08 1.03
MI100 qwen3moe 30B.A3B Q5_0 2 pp512 135.35 141.69 1.05
MI100 qwen3moe 30B.A3B Q5_0 3 pp512 160.03 169.72 1.06
MI100 qwen3moe 30B.A3B Q5_0 4 pp512 146.22 152.27 1.04
MI100 qwen3moe 30B.A3B Q5_1 1 pp512 105.79 105.58 1.00
MI100 qwen3moe 30B.A3B Q5_1 2 pp512 139.70 143.27 1.03
MI100 qwen3moe 30B.A3B Q5_1 3 pp512 161.99 174.16 1.08
MI100 qwen3moe 30B.A3B Q5_1 4 pp512 147.81 155.39 1.05
MI100 qwen3moe 30B.A3B Q5_K_S 1 pp512 98.59 98.88 1.00
MI100 qwen3moe 30B.A3B Q5_K_S 2 pp512 127.59 129.94 1.02
MI100 qwen3moe 30B.A3B Q5_K_S 3 pp512 150.36 156.16 1.04
MI100 qwen3moe 30B.A3B Q5_K_S 4 pp512 137.57 139.29 1.01
MI100 qwen3moe 30B.A3B Q6_K 1 pp512 92.18 93.85 1.02
MI100 qwen3moe 30B.A3B Q6_K 2 pp512 121.71 123.51 1.01
MI100 qwen3moe 30B.A3B Q6_K 3 pp512 143.35 145.80 1.02
MI100 qwen3moe 30B.A3B Q6_K 4 pp512 128.13 132.66 1.04
MI100 qwen3moe 30B.A3B Q8_0 1 pp512 101.01 103.44 1.02
MI100 qwen3moe 30B.A3B Q8_0 2 pp512 130.77 136.30 1.04
MI100 qwen3moe 30B.A3B Q8_0 3 pp512 147.02 156.62 1.07
MI100 qwen3moe 30B.A3B Q8_0 4 pp512 134.52 143.03 1.06
P40 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 70.83 76.51 1.08
P40 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 88.17 115.48 1.31
P40 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 99.24 133.59 1.35
P40 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 105.75 146.94 1.39
P40 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 62.88 58.25 0.93
P40 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 74.83 81.19 1.08
P40 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 83.76 92.92 1.11
P40 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 89.26 99.82 1.12
P40 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 65.09 60.78 0.93
P40 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 77.21 85.80 1.11
P40 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 86.05 98.40 1.14
P40 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 90.69 104.95 1.16
P40 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 67.68 63.56 0.94
P40 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 81.07 92.54 1.14
P40 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 91.42 105.21 1.15
P40 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 95.81 113.05 1.18
P40 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 58.66 47.53 0.81
P40 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 69.16 64.93 0.94
P40 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 76.45 72.78 0.95
P40 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 80.78 77.19 0.96
P40 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 59.44 48.92 0.82
P40 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 71.17 67.93 0.95
P40 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 78.16 75.82 0.97
P40 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 82.84 80.26 0.97
P40 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 59.45 50.51 0.85
P40 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 70.94 68.53 0.97
P40 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 79.24 77.41 0.98
P40 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 83.34 82.28 0.99
P40 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 61.72 56.10 0.91
P40 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 74.85 79.29 1.06
P40 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 83.75 90.94 1.09
P40 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 88.52 97.27 1.10
P40 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 75.63 79.99 1.06
P40 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 98.15 106.92 1.09
P40 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 113.17 126.04 1.11
P40 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 118.34 132.86 1.12
P40 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 68.96 70.81 1.03
P40 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 82.60 106.18 1.29
P40 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 92.93 123.12 1.32
P40 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 96.02 133.42 1.39
P40 qwen3moe 30B.A3B Q2_K_M 1 pp512 56.22 54.84 0.98
P40 qwen3moe 30B.A3B Q2_K_M 2 pp512 67.84 66.99 0.99
P40 qwen3moe 30B.A3B Q2_K_M 3 pp512 74.55 73.84 0.99
P40 qwen3moe 30B.A3B Q2_K_M 4 pp512 77.75 77.04 0.99
P40 qwen3moe 30B.A3B Q3_K_S 1 pp512 46.80 45.62 0.97
P40 qwen3moe 30B.A3B Q3_K_S 2 pp512 58.47 57.51 0.98
P40 qwen3moe 30B.A3B Q3_K_S 3 pp512 63.75 63.11 0.99
P40 qwen3moe 30B.A3B Q3_K_S 4 pp512 66.88 66.65 1.00
P40 qwen3moe 30B.A3B Q4_0 1 pp512 71.93 81.17 1.13
P40 qwen3moe 30B.A3B Q4_0 2 pp512 89.16 106.55 1.20
P40 qwen3moe 30B.A3B Q4_0 3 pp512 100.71 123.03 1.22
P40 qwen3moe 30B.A3B Q4_0 4 pp512 105.71 130.71 1.24
P40 qwen3moe 30B.A3B Q4_1 1 pp512 76.72 84.86 1.11
P40 qwen3moe 30B.A3B Q4_1 2 pp512 93.39 106.73 1.14
P40 qwen3moe 30B.A3B Q4_1 3 pp512 106.70 125.35 1.17
P40 qwen3moe 30B.A3B Q4_1 4 pp512 110.77 132.10 1.19
P40 qwen3moe 30B.A3B Q4_K_S 1 pp512 72.98 78.11 1.07
P40 qwen3moe 30B.A3B Q4_K_S 2 pp512 88.19 97.88 1.11
P40 qwen3moe 30B.A3B Q4_K_S 3 pp512 97.96 110.96 1.13
P40 qwen3moe 30B.A3B Q4_K_S 4 pp512 102.94 117.92 1.15
P40 qwen3moe 30B.A3B Q5_0 1 pp512 64.88 69.30 1.07
P40 qwen3moe 30B.A3B Q5_0 2 pp512 87.31 96.85 1.11
P40 qwen3moe 30B.A3B Q5_0 3 pp512 99.77 112.66 1.13
P40 qwen3moe 30B.A3B Q5_0 4 pp512 105.02 119.03 1.13
P40 qwen3moe 30B.A3B Q5_1 1 pp512 70.22 75.87 1.08
P40 qwen3moe 30B.A3B Q5_1 2 pp512 87.94 98.49 1.12
P40 qwen3moe 30B.A3B Q5_1 3 pp512 100.20 114.40 1.14
P40 qwen3moe 30B.A3B Q5_1 4 pp512 105.59 122.58 1.16
P40 qwen3moe 30B.A3B Q5_K_S 1 pp512 68.96 72.10 1.05
P40 qwen3moe 30B.A3B Q5_K_S 2 pp512 84.64 91.48 1.08
P40 qwen3moe 30B.A3B Q5_K_S 3 pp512 95.10 104.07 1.09
P40 qwen3moe 30B.A3B Q5_K_S 4 pp512 98.49 108.79 1.10
P40 qwen3moe 30B.A3B Q6_K 1 pp512 58.24 64.91 1.11
P40 qwen3moe 30B.A3B Q6_K 2 pp512 71.46 82.72 1.16
P40 qwen3moe 30B.A3B Q6_K 3 pp512 80.03 95.21 1.19
P40 qwen3moe 30B.A3B Q6_K 4 pp512 84.41 101.49 1.20
Radeon 8060S Graphics qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 108.46 108.74 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 148.12 148.53 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 178.58 178.59 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 203.17 202.89 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 74.35 75.00 1.01
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 97.01 97.99 1.01
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 111.42 112.72 1.01
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 120.27 121.49 1.01
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 76.90 76.62 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 99.77 99.58 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 114.49 114.46 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 123.62 123.66 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 77.58 77.49 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 101.56 101.46 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 116.47 116.46 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 127.01 127.04 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 74.55 73.68 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 96.43 95.48 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 108.99 108.01 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 118.01 117.48 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 75.76 75.80 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 95.53 95.55 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 107.47 107.58 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 116.24 116.42 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 73.48 73.34 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 96.77 96.90 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 109.72 109.53 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 119.05 119.10 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 73.71 73.42 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 97.75 97.71 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 111.09 110.81 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 119.52 119.70 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 83.19 83.21 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 113.93 114.14 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 139.18 139.10 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 163.77 163.73 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 87.52 87.65 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 120.41 120.44 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 146.51 147.06 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 172.58 173.06 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q2_K_M 1 pp512 91.54 91.32 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q2_K_M 2 pp512 116.76 116.18 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q2_K_M 3 pp512 134.08 133.40 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B Q2_K_M 4 pp512 146.83 146.62 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q3_K_S 1 pp512 83.20 83.00 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q3_K_S 2 pp512 106.28 106.35 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q3_K_S 3 pp512 120.01 120.03 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q3_K_S 4 pp512 131.62 131.49 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_0 1 pp512 84.29 84.25 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_0 2 pp512 118.77 118.14 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_0 3 pp512 147.35 146.11 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_0 4 pp512 173.72 172.29 0.99
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_1 1 pp512 79.19 77.80 0.98
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_1 2 pp512 112.43 108.71 0.97
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_1 3 pp512 140.56 135.21 0.96
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_1 4 pp512 164.42 159.12 0.97
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_K_S 1 pp512 77.44 75.66 0.98
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_K_S 2 pp512 99.38 96.75 0.97
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_K_S 3 pp512 115.26 112.31 0.97
Radeon 8060S Graphics qwen3moe 30B.A3B Q4_K_S 4 pp512 128.98 125.53 0.97
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_0 1 pp512 74.82 74.67 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_0 2 pp512 103.56 103.25 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_0 3 pp512 127.55 127.10 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_0 4 pp512 147.81 147.16 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_1 1 pp512 68.87 68.85 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_1 2 pp512 96.44 96.39 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_1 3 pp512 119.62 119.43 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_1 4 pp512 139.10 138.91 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_K_S 1 pp512 69.05 69.32 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_K_S 2 pp512 90.81 90.99 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_K_S 3 pp512 105.92 105.91 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q5_K_S 4 pp512 118.80 118.91 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q6_K 1 pp512 63.25 63.35 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q6_K 2 pp512 85.68 85.64 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q6_K 3 pp512 101.84 101.57 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q6_K 4 pp512 114.62 114.55 1.00
Radeon 8060S Graphics qwen3moe 30B.A3B Q8_0 1 pp512 55.25 56.02 1.01
Radeon 8060S Graphics qwen3moe 30B.A3B Q8_0 2 pp512 79.07 80.32 1.02
Radeon 8060S Graphics qwen3moe 30B.A3B Q8_0 3 pp512 96.36 98.16 1.02
Radeon 8060S Graphics qwen3moe 30B.A3B Q8_0 4 pp512 111.30 113.76 1.02
RTX 3090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 189.53 213.21 1.12
RTX 3090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 244.29 300.16 1.23
RTX 3090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 289.30 358.46 1.24
RTX 3090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 307.82 392.24 1.27
RTX 3090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 171.12 173.58 1.01
RTX 3090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 214.47 236.03 1.10
RTX 3090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 247.09 283.70 1.15
RTX 3090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 263.75 306.44 1.16
RTX 3090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 173.64 179.18 1.03
RTX 3090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 217.46 242.50 1.12
RTX 3090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 248.86 289.42 1.16
RTX 3090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 270.21 317.24 1.17
RTX 3090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 174.00 189.17 1.09
RTX 3090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 222.10 260.50 1.17
RTX 3090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 257.90 312.07 1.21
RTX 3090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 275.03 340.99 1.24
RTX 3090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 161.93 153.79 0.95
RTX 3090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 199.97 212.66 1.06
RTX 3090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 228.65 250.35 1.09
RTX 3090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 244.13 271.21 1.11
RTX 3090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 163.69 155.30 0.95
RTX 3090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 201.22 214.86 1.07
RTX 3090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 230.11 252.86 1.10
RTX 3090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 246.45 274.56 1.11
RTX 3090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 164.90 160.88 0.98
RTX 3090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 204.87 218.98 1.07
RTX 3090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 233.90 258.67 1.11
RTX 3090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 250.79 281.64 1.12
RTX 3090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 164.48 165.69 1.01
RTX 3090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 204.62 228.67 1.12
RTX 3090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 235.11 272.11 1.16
RTX 3090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 252.28 297.18 1.18
RTX 3090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 187.90 201.31 1.07
RTX 3090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 241.19 263.95 1.09
RTX 3090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 280.84 314.27 1.12
RTX 3090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 299.98 339.79 1.13
RTX 3090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 187.45 197.12 1.05
RTX 3090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 236.58 266.82 1.13
RTX 3090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 276.99 318.43 1.15
RTX 3090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 293.12 345.99 1.18
RTX 3090 qwen3moe 30B.A3B Q2_K_M 1 pp512 158.59 169.98 1.07
RTX 3090 qwen3moe 30B.A3B Q2_K_M 2 pp512 197.48 215.83 1.09
RTX 3090 qwen3moe 30B.A3B Q2_K_M 3 pp512 222.96 246.80 1.11
RTX 3090 qwen3moe 30B.A3B Q2_K_M 4 pp512 236.54 261.46 1.11
RTX 3090 qwen3moe 30B.A3B Q3_K_S 1 pp512 135.48 143.80 1.06
RTX 3090 qwen3moe 30B.A3B Q3_K_S 2 pp512 168.75 180.50 1.07
RTX 3090 qwen3moe 30B.A3B Q3_K_S 3 pp512 187.56 203.20 1.08
RTX 3090 qwen3moe 30B.A3B Q3_K_S 4 pp512 199.89 217.29 1.09
RTX 3090 qwen3moe 30B.A3B Q4_0 1 pp512 195.13 212.76 1.09
RTX 3090 qwen3moe 30B.A3B Q4_0 2 pp512 256.75 284.70 1.11
RTX 3090 qwen3moe 30B.A3B Q4_0 3 pp512 303.51 345.00 1.14
RTX 3090 qwen3moe 30B.A3B Q4_0 4 pp512 321.95 372.90 1.16
RTX 3090 qwen3moe 30B.A3B Q4_1 1 pp512 195.97 205.27 1.05
RTX 3090 qwen3moe 30B.A3B Q4_1 2 pp512 256.50 275.14 1.07
RTX 3090 qwen3moe 30B.A3B Q4_1 3 pp512 307.57 338.74 1.10
RTX 3090 qwen3moe 30B.A3B Q4_1 4 pp512 329.26 369.48 1.12
RTX 3090 qwen3moe 30B.A3B Q4_K_S 1 pp512 183.42 195.24 1.06
RTX 3090 qwen3moe 30B.A3B Q4_K_S 2 pp512 230.32 251.27 1.09
RTX 3090 qwen3moe 30B.A3B Q4_K_S 3 pp512 264.87 296.14 1.12
RTX 3090 qwen3moe 30B.A3B Q4_K_S 4 pp512 281.64 316.24 1.12
RTX 3090 qwen3moe 30B.A3B Q5_0 1 pp512 179.32 186.92 1.04
RTX 3090 qwen3moe 30B.A3B Q5_0 2 pp512 228.03 242.95 1.07
RTX 3090 qwen3moe 30B.A3B Q5_0 3 pp512 269.28 291.09 1.08
RTX 3090 qwen3moe 30B.A3B Q5_0 4 pp512 288.78 312.11 1.08
RTX 3090 qwen3moe 30B.A3B Q5_1 1 pp512 176.02 184.50 1.05
RTX 3090 qwen3moe 30B.A3B Q5_1 2 pp512 228.45 243.46 1.07
RTX 3090 qwen3moe 30B.A3B Q5_1 3 pp512 270.89 296.03 1.09
RTX 3090 qwen3moe 30B.A3B Q5_1 4 pp512 290.62 320.12 1.10
RTX 3090 qwen3moe 30B.A3B Q5_K_S 1 pp512 171.62 181.38 1.06
RTX 3090 qwen3moe 30B.A3B Q5_K_S 2 pp512 213.79 228.32 1.07
RTX 3090 qwen3moe 30B.A3B Q5_K_S 3 pp512 248.31 267.21 1.08
RTX 3090 qwen3moe 30B.A3B Q5_K_S 4 pp512 264.46 287.63 1.09
RTX 3090 qwen3moe 30B.A3B Q6_K 1 pp512 145.94 160.96 1.10
RTX 3090 qwen3moe 30B.A3B Q6_K 2 pp512 185.11 204.42 1.10
RTX 3090 qwen3moe 30B.A3B Q6_K 3 pp512 210.21 235.29 1.12
RTX 3090 qwen3moe 30B.A3B Q6_K 4 pp512 221.56 252.56 1.14
RTX 4090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 275.62 296.28 1.07
RTX 4090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 374.01 436.96 1.17
RTX 4090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 478.13 567.08 1.19
RTX 4090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 540.34 658.31 1.22
RTX 4090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 261.33 262.85 1.01
RTX 4090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 353.78 376.75 1.06
RTX 4090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 442.49 489.95 1.11
RTX 4090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 496.61 557.23 1.12
RTX 4090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 262.60 266.79 1.02
RTX 4090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 355.30 385.26 1.08
RTX 4090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 446.19 501.83 1.12
RTX 4090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 500.68 571.30 1.14
RTX 4090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 268.52 278.23 1.04
RTX 4090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 361.60 400.15 1.11
RTX 4090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 458.39 531.00 1.16
RTX 4090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 505.78 604.65 1.20
RTX 4090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 250.82 236.55 0.94
RTX 4090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 336.82 336.44 1.00
RTX 4090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 421.28 428.05 1.02
RTX 4090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 470.61 485.89 1.03
RTX 4090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 249.33 236.80 0.95
RTX 4090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 337.83 337.69 1.00
RTX 4090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 426.12 433.69 1.02
RTX 4090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 476.22 489.83 1.03
RTX 4090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 252.01 244.67 0.97
RTX 4090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 344.06 346.08 1.01
RTX 4090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 429.39 446.41 1.04
RTX 4090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 478.61 506.43 1.06
RTX 4090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 249.26 248.34 1.00
RTX 4090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 340.82 358.74 1.05
RTX 4090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 425.90 465.95 1.09
RTX 4090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 479.76 531.79 1.11
RTX 4090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 249.43 251.70 1.01
RTX 4090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 351.69 363.74 1.03
RTX 4090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 456.52 482.46 1.06
RTX 4090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 518.89 560.41 1.08
RTX 4090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 249.47 250.00 1.00
RTX 4090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 351.88 365.77 1.04
RTX 4090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 454.37 482.81 1.06
RTX 4090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 516.34 556.45 1.08
RTX 4090 qwen3moe 30B.A3B Q2_K_M 1 pp512 255.89 265.51 1.04
RTX 4090 qwen3moe 30B.A3B Q2_K_M 2 pp512 349.65 368.14 1.05
RTX 4090 qwen3moe 30B.A3B Q2_K_M 3 pp512 435.84 465.30 1.07
RTX 4090 qwen3moe 30B.A3B Q2_K_M 4 pp512 483.53 519.95 1.08
RTX 4090 qwen3moe 30B.A3B Q3_K_S 1 pp512 237.38 246.43 1.04
RTX 4090 qwen3moe 30B.A3B Q3_K_S 2 pp512 320.04 335.53 1.05
RTX 4090 qwen3moe 30B.A3B Q3_K_S 3 pp512 387.40 412.58 1.06
RTX 4090 qwen3moe 30B.A3B Q3_K_S 4 pp512 426.60 451.53 1.06
RTX 4090 qwen3moe 30B.A3B Q4_0 1 pp512 246.19 253.70 1.03
RTX 4090 qwen3moe 30B.A3B Q4_0 2 pp512 351.35 368.45 1.05
RTX 4090 qwen3moe 30B.A3B Q4_0 3 pp512 454.28 488.26 1.07
RTX 4090 qwen3moe 30B.A3B Q4_0 4 pp512 519.65 566.31 1.09
RTX 4090 qwen3moe 30B.A3B Q4_1 1 pp512 242.45 241.99 1.00
RTX 4090 qwen3moe 30B.A3B Q4_1 2 pp512 346.79 354.83 1.02
RTX 4090 qwen3moe 30B.A3B Q4_1 3 pp512 448.82 473.76 1.06
RTX 4090 qwen3moe 30B.A3B Q4_1 4 pp512 516.61 554.30 1.07
RTX 4090 qwen3moe 30B.A3B Q4_K_S 1 pp512 248.47 250.73 1.01
RTX 4090 qwen3moe 30B.A3B Q4_K_S 2 pp512 347.63 360.06 1.04
RTX 4090 qwen3moe 30B.A3B Q4_K_S 3 pp512 451.47 474.59 1.05
RTX 4090 qwen3moe 30B.A3B Q4_K_S 4 pp512 511.42 538.10 1.05
RTX 4090 qwen3moe 30B.A3B Q5_0 1 pp512 232.24 231.88 1.00
RTX 4090 qwen3moe 30B.A3B Q5_0 2 pp512 330.20 335.94 1.02
RTX 4090 qwen3moe 30B.A3B Q5_0 3 pp512 427.35 440.67 1.03
RTX 4090 qwen3moe 30B.A3B Q5_0 4 pp512 491.22 511.40 1.04
RTX 4090 qwen3moe 30B.A3B Q5_1 1 pp512 223.72 222.66 1.00
RTX 4090 qwen3moe 30B.A3B Q5_1 2 pp512 319.12 323.92 1.02
RTX 4090 qwen3moe 30B.A3B Q5_1 3 pp512 412.22 429.70 1.04
RTX 4090 qwen3moe 30B.A3B Q5_1 4 pp512 477.92 504.43 1.06
RTX 4090 qwen3moe 30B.A3B Q5_K_S 1 pp512 230.24 231.59 1.01
RTX 4090 qwen3moe 30B.A3B Q5_K_S 2 pp512 324.08 328.89 1.01
RTX 4090 qwen3moe 30B.A3B Q5_K_S 3 pp512 417.06 432.65 1.04
RTX 4090 qwen3moe 30B.A3B Q5_K_S 4 pp512 476.98 496.50 1.04
RTX 5090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 292.76 321.06 1.10
RTX 5090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 358.68 418.36 1.17
RTX 5090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 465.03 559.90 1.20
RTX 5090 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 528.75 656.82 1.24
RTX 5090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 299.32 298.47 1.00
RTX 5090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 363.61 381.60 1.05
RTX 5090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 473.57 510.63 1.08
RTX 5090 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 534.68 596.05 1.11
RTX 5090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 304.93 304.23 1.00
RTX 5090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 371.36 390.49 1.05
RTX 5090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 481.55 523.45 1.09
RTX 5090 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 540.47 604.90 1.12
RTX 5090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 303.53 306.78 1.01
RTX 5090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 376.67 405.58 1.08
RTX 5090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 488.87 538.14 1.10
RTX 5090 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 550.05 621.16 1.13
RTX 5090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 284.00 281.43 0.99
RTX 5090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 437.69 478.44 1.09
RTX 5090 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 492.00 551.27 1.12
RTX 5090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 282.59 282.37 1.00
RTX 5090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 342.50 364.30 1.06
RTX 5090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 440.37 483.70 1.10
RTX 5090 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 492.99 555.35 1.13
RTX 5090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 287.75 284.17 0.99
RTX 5090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 350.33 365.44 1.04
RTX 5090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 449.62 485.92 1.08
RTX 5090 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 504.92 556.57 1.10
RTX 5090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 291.71 288.21 0.99
RTX 5090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 358.55 369.23 1.03
RTX 5090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 462.43 493.33 1.07
RTX 5090 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 514.64 564.51 1.10
RTX 5090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 300.45 315.84 1.05
RTX 5090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 372.28 399.38 1.07
RTX 5090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 489.67 536.95 1.10
RTX 5090 qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 556.85 623.00 1.12
RTX 5090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 303.22 309.75 1.02
RTX 5090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 375.25 398.29 1.06
RTX 5090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 488.91 538.03 1.10
RTX 5090 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 557.56 626.30 1.12
RTX 5090 qwen3moe 30B.A3B Q2_K_M 1 pp512 296.08 306.54 1.04
RTX 5090 qwen3moe 30B.A3B Q2_K_M 2 pp512 358.71 381.08 1.06
RTX 5090 qwen3moe 30B.A3B Q2_K_M 3 pp512 460.17 497.71 1.08
RTX 5090 qwen3moe 30B.A3B Q2_K_M 4 pp512 516.42 563.74 1.09
RTX 5090 qwen3moe 30B.A3B Q3_K_S 1 pp512 285.90 297.23 1.04
RTX 5090 qwen3moe 30B.A3B Q3_K_S 2 pp512 345.37 365.85 1.06
RTX 5090 qwen3moe 30B.A3B Q3_K_S 3 pp512 438.77 472.66 1.08
RTX 5090 qwen3moe 30B.A3B Q3_K_S 4 pp512 491.52 534.97 1.09
RTX 5090 qwen3moe 30B.A3B Q4_0 1 pp512 304.33 315.88 1.04
RTX 5090 qwen3moe 30B.A3B Q4_0 2 pp512 380.73 405.44 1.06
RTX 5090 qwen3moe 30B.A3B Q4_0 3 pp512 499.66 549.19 1.10
RTX 5090 qwen3moe 30B.A3B Q4_0 4 pp512 571.26 637.18 1.12
RTX 5090 qwen3moe 30B.A3B Q4_1 1 pp512 292.45 305.65 1.05
RTX 5090 qwen3moe 30B.A3B Q4_1 2 pp512 365.35 393.42 1.08
RTX 5090 qwen3moe 30B.A3B Q4_1 3 pp512 481.88 533.45 1.11
RTX 5090 qwen3moe 30B.A3B Q4_1 4 pp512 547.87 616.73 1.13
RTX 5090 qwen3moe 30B.A3B Q4_K_S 1 pp512 293.32 309.13 1.05
RTX 5090 qwen3moe 30B.A3B Q4_K_S 2 pp512 363.53 392.72 1.08
RTX 5090 qwen3moe 30B.A3B Q4_K_S 3 pp512 473.17 522.43 1.10
RTX 5090 qwen3moe 30B.A3B Q4_K_S 4 pp512 520.44 583.45 1.12
RTX 5090 qwen3moe 30B.A3B Q5_0 1 pp512 289.18 295.60 1.02
RTX 5090 qwen3moe 30B.A3B Q5_0 2 pp512 364.73 383.63 1.05
RTX 5090 qwen3moe 30B.A3B Q5_0 3 pp512 480.35 516.30 1.07
RTX 5090 qwen3moe 30B.A3B Q5_0 4 pp512 547.69 601.14 1.10
RTX 5090 qwen3moe 30B.A3B Q5_1 1 pp512 283.34 288.91 1.02
RTX 5090 qwen3moe 30B.A3B Q5_1 2 pp512 361.44 375.99 1.04
RTX 5090 qwen3moe 30B.A3B Q5_1 3 pp512 475.28 508.29 1.07
RTX 5090 qwen3moe 30B.A3B Q5_1 4 pp512 543.60 594.50 1.09
RTX 5090 qwen3moe 30B.A3B Q5_K_S 1 pp512 283.90 290.96 1.02
RTX 5090 qwen3moe 30B.A3B Q5_K_S 2 pp512 360.62 377.53 1.05
RTX 5090 qwen3moe 30B.A3B Q5_K_S 3 pp512 470.84 503.30 1.07
RTX 5090 qwen3moe 30B.A3B Q5_K_S 4 pp512 523.41 566.43 1.08
RTX 5090 qwen3moe 30B.A3B Q6_K 1 pp512 268.47 274.46 1.02
RTX 5090 qwen3moe 30B.A3B Q6_K 2 pp512 338.65 353.35 1.04
RTX 5090 qwen3moe 30B.A3B Q6_K 3 pp512 440.05 468.04 1.06
RTX 5090 qwen3moe 30B.A3B Q6_K 4 pp512 495.19 534.16 1.08
RTX 5090 qwen3moe 30B.A3B Q8_0 1 pp512 249.68 251.10 1.01
RTX 5090 qwen3moe 30B.A3B Q8_0 2 pp512 321.88 331.43 1.03
RTX 5090 qwen3moe 30B.A3B Q8_0 3 pp512 416.73 438.63 1.05
RTX 5090 qwen3moe 30B.A3B Q8_0 4 pp512 474.39 508.06 1.07
RX 6800 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 72.03 71.94 1.00
RX 6800 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 96.51 96.55 1.00
RX 6800 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 117.96 117.91 1.00
RX 6800 qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 137.26 137.24 1.00
RX 6800 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 57.38 57.52 1.00
RX 6800 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 76.47 76.72 1.00
RX 6800 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 92.72 92.84 1.00
RX 6800 qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 105.21 105.33 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 58.56 58.38 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 78.11 78.21 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 94.64 94.65 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 106.49 106.53 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 59.16 58.94 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 79.28 79.43 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 96.14 96.39 1.00
RX 6800 qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 108.82 108.82 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 56.63 56.76 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 75.86 76.07 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 91.61 91.48 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 103.98 104.08 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 56.38 56.67 1.01
RX 6800 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 76.09 76.22 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 91.24 91.17 1.00
RX 6800 qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 103.21 103.22 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 56.77 56.76 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 76.18 76.08 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 91.77 91.86 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 104.48 104.43 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 57.35 57.35 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 77.05 76.85 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 92.83 92.80 1.00
RX 6800 qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 105.06 104.86 1.00
RX 6800 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 78.01 77.96 1.00
RX 6800 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 110.51 110.88 1.00
RX 6800 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 140.28 140.18 1.00
RX 6800 qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 163.44 163.52 1.00
RX 6800 qwen3moe 30B.A3B Q2_K_M 1 pp512 62.89 62.88 1.00
RX 6800 qwen3moe 30B.A3B Q2_K_M 2 pp512 82.14 82.05 1.00
RX 6800 qwen3moe 30B.A3B Q2_K_M 3 pp512 96.11 96.12 1.00
RX 6800 qwen3moe 30B.A3B Q2_K_M 4 pp512 107.72 107.57 1.00
RX 6800 qwen3moe 30B.A3B Q3_K_S 1 pp512 55.87 55.99 1.00
RX 6800 qwen3moe 30B.A3B Q3_K_S 2 pp512 73.89 73.80 1.00
RX 6800 qwen3moe 30B.A3B Q3_K_S 3 pp512 83.14 83.14 1.00
RX 6800 qwen3moe 30B.A3B Q3_K_S 4 pp512 93.52 93.32 1.00
RX 9060 XT qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 79.47 79.16 1.00
RX 9060 XT qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 105.44 110.52 1.05
RX 9060 XT qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 149.83 153.22 1.02
RX 9060 XT qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 132.42 137.79 1.04
RX 9060 XT qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 59.94 59.58 0.99
RX 9060 XT qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 82.89 82.75 1.00
RX 9060 XT qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 112.28 112.38 1.00
RX 9060 XT qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 104.39 104.36 1.00
RX 9060 XT qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 61.40 60.85 0.99
RX 9060 XT qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 82.00 83.88 1.02
RX 9060 XT qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 106.56 106.00 0.99
RX 9060 XT qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 103.02 104.59 1.02
RX 9060 XT qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 61.41 60.80 0.99
RX 9060 XT qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 82.65 85.63 1.04
RX 9060 XT qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 109.97 112.07 1.02
RX 9060 XT qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 104.56 103.58 0.99
RX 9060 XT qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 60.03 59.23 0.99
RX 9060 XT qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 82.60 82.53 1.00
RX 9060 XT qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 109.07 109.04 1.00
RX 9060 XT qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 102.42 101.79 0.99
RX 9060 XT qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 62.78 63.23 1.01
RX 9060 XT qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 77.77 78.37 1.01
RX 9060 XT qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 100.33 102.14 1.02
RX 9060 XT qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 96.67 98.19 1.02
RX 9060 XT qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 59.67 59.77 1.00
RX 9060 XT qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 82.63 82.08 0.99
RX 9060 XT qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 106.82 107.69 1.01
RX 9060 XT qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 101.65 101.28 1.00
RX 9060 XT qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 59.23 59.70 1.01
RX 9060 XT qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 83.48 82.23 0.99
RX 9060 XT qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 109.68 109.29 1.00
RX 9060 XT qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 101.27 101.48 1.00
RX 9060 XT qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 61.35 52.64 0.86
RX 9060 XT qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 69.37 80.50 1.16
RX 9060 XT qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 88.32 107.77 1.22
RX 9060 XT qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 85.97 103.78 1.21
RX 9060 XT qwen3moe 30B.A3B Q2_K_M 1 pp512 64.47 57.64 0.89
RX 9060 XT qwen3moe 30B.A3B Q2_K_M 2 pp512 82.36 89.29 1.08
RX 9060 XT qwen3moe 30B.A3B Q2_K_M 3 pp512 108.10 119.22 1.10
RX 9060 XT qwen3moe 30B.A3B Q2_K_M 4 pp512 102.32 109.79 1.07
RX 9060 XT qwen3moe 30B.A3B Q3_K_S 1 pp512 64.15 63.46 0.99
RX 9060 XT qwen3moe 30B.A3B Q3_K_S 2 pp512 83.62 81.14 0.97
RX 9060 XT qwen3moe 30B.A3B Q3_K_S 3 pp512 105.94 104.98 0.99
RX 9060 XT qwen3moe 30B.A3B Q3_K_S 4 pp512 100.77 99.92 0.99
V100-PCIE-32GB qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 1 pp512 138.28 145.63 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 2 pp512 171.80 200.93 1.17
V100-PCIE-32GB qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 3 pp512 195.89 235.88 1.20
V100-PCIE-32GB qwen3moe 30B.A3B IQ1_S - 1.5625 bpw 4 pp512 215.38 263.87 1.23
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_S - 2.5 bpw 1 pp512 125.40 120.29 0.96
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_S - 2.5 bpw 2 pp512 159.34 166.95 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_S - 2.5 bpw 3 pp512 176.35 193.09 1.09
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_S - 2.5 bpw 4 pp512 195.73 214.21 1.09
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 1 pp512 127.83 124.77 0.98
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 2 pp512 160.69 169.52 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 3 pp512 182.50 198.53 1.09
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XS - 2.3125 bpw 4 pp512 199.78 218.32 1.09
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 1 pp512 132.82 135.59 1.02
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 2 pp512 164.19 185.61 1.13
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 3 pp512 186.76 216.04 1.16
V100-PCIE-32GB qwen3moe 30B.A3B IQ2_XXS - 2.0625 bpw 4 pp512 204.50 237.48 1.16
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 1 pp512 122.48 111.26 0.91
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 2 pp512 150.14 157.16 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 3 pp512 169.33 183.38 1.08
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S - 3.4375 bpw 4 pp512 186.45 200.99 1.08
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 1 pp512 123.90 115.06 0.93
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 2 pp512 150.81 164.19 1.09
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 3 pp512 169.71 187.60 1.11
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_S mix - 3.66 bpw 4 pp512 186.09 207.85 1.12
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 1 pp512 124.15 113.22 0.91
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 2 pp512 152.50 154.15 1.01
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 3 pp512 171.37 182.01 1.06
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XS - 3.3 bpw 4 pp512 188.00 201.67 1.07
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 1 pp512 123.09 114.06 0.93
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 2 pp512 154.33 162.78 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 3 pp512 177.34 188.56 1.06
V100-PCIE-32GB qwen3moe 30B.A3B IQ3_XXS - 3.0625 bpw 4 pp512 192.57 208.90 1.08
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 1 pp512 139.36 152.71 1.10
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 2 pp512 178.39 198.02 1.11
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 3 pp512 201.29 226.78 1.13
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_NL - 4.5 bpw 4 pp512 220.58 253.69 1.15
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 1 pp512 137.28 136.01 0.99
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 2 pp512 175.50 184.55 1.05
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 3 pp512 197.67 214.10 1.08
V100-PCIE-32GB qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 4 pp512 218.25 238.79 1.09
V100-PCIE-32GB qwen3moe 30B.A3B Q2_K_M 1 pp512 122.02 128.34 1.05
V100-PCIE-32GB qwen3moe 30B.A3B Q2_K_M 2 pp512 149.48 153.95 1.03
V100-PCIE-32GB qwen3moe 30B.A3B Q2_K_M 3 pp512 167.58 175.63 1.05
V100-PCIE-32GB qwen3moe 30B.A3B Q2_K_M 4 pp512 181.20 191.64 1.06
V100-PCIE-32GB qwen3moe 30B.A3B Q3_K_S 1 pp512 108.15 113.02 1.04
V100-PCIE-32GB qwen3moe 30B.A3B Q3_K_S 2 pp512 132.65 137.68 1.04
V100-PCIE-32GB qwen3moe 30B.A3B Q3_K_S 3 pp512 152.04 158.41 1.04
V100-PCIE-32GB qwen3moe 30B.A3B Q3_K_S 4 pp512 163.44 172.33 1.05
V100-PCIE-32GB qwen3moe 30B.A3B Q4_0 1 pp512 141.46 156.30 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q4_0 2 pp512 182.64 206.93 1.13
V100-PCIE-32GB qwen3moe 30B.A3B Q4_0 3 pp512 203.25 234.19 1.15
V100-PCIE-32GB qwen3moe 30B.A3B Q4_0 4 pp512 223.09 263.47 1.18
V100-PCIE-32GB qwen3moe 30B.A3B Q4_1 1 pp512 142.73 157.52 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q4_1 2 pp512 183.68 208.80 1.14
V100-PCIE-32GB qwen3moe 30B.A3B Q4_1 3 pp512 212.27 246.77 1.16
V100-PCIE-32GB qwen3moe 30B.A3B Q4_1 4 pp512 224.62 265.06 1.18
V100-PCIE-32GB qwen3moe 30B.A3B Q4_K_S 1 pp512 136.98 149.14 1.09
V100-PCIE-32GB qwen3moe 30B.A3B Q4_K_S 2 pp512 174.24 195.13 1.12
V100-PCIE-32GB qwen3moe 30B.A3B Q4_K_S 3 pp512 197.72 223.91 1.13
V100-PCIE-32GB qwen3moe 30B.A3B Q4_K_S 4 pp512 218.03 250.08 1.15
V100-PCIE-32GB qwen3moe 30B.A3B Q5_0 1 pp512 134.69 144.25 1.07
V100-PCIE-32GB qwen3moe 30B.A3B Q5_0 2 pp512 174.43 190.19 1.09
V100-PCIE-32GB qwen3moe 30B.A3B Q5_0 3 pp512 195.20 214.88 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q5_0 4 pp512 214.29 239.66 1.12
V100-PCIE-32GB qwen3moe 30B.A3B Q5_1 1 pp512 134.47 143.74 1.07
V100-PCIE-32GB qwen3moe 30B.A3B Q5_1 2 pp512 174.89 193.25 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q5_1 3 pp512 196.12 218.08 1.11
V100-PCIE-32GB qwen3moe 30B.A3B Q5_1 4 pp512 214.01 243.87 1.14
V100-PCIE-32GB qwen3moe 30B.A3B Q5_K_S 1 pp512 131.74 139.20 1.06
V100-PCIE-32GB qwen3moe 30B.A3B Q5_K_S 2 pp512 166.07 176.54 1.06
V100-PCIE-32GB qwen3moe 30B.A3B Q5_K_S 3 pp512 189.34 203.55 1.08
V100-PCIE-32GB qwen3moe 30B.A3B Q5_K_S 4 pp512 206.04 224.08 1.09
V100-PCIE-32GB qwen3moe 30B.A3B Q6_K 1 pp512 116.28 125.42 1.08
V100-PCIE-32GB qwen3moe 30B.A3B Q6_K 2 pp512 149.60 164.99 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q6_K 3 pp512 169.24 187.16 1.11
V100-PCIE-32GB qwen3moe 30B.A3B Q6_K 4 pp512 183.98 205.34 1.12
V100-PCIE-32GB qwen3moe 30B.A3B Q8_0 1 pp512 116.09 125.16 1.08
V100-PCIE-32GB qwen3moe 30B.A3B Q8_0 2 pp512 146.13 160.51 1.10
V100-PCIE-32GB qwen3moe 30B.A3B Q8_0 3 pp512 167.21 185.72 1.11
V100-PCIE-32GB qwen3moe 30B.A3B Q8_0 4 pp512 177.62 197.46 1.11

I found that batch sizes 2-4 in particular benefit from this change. Moreover, even with this change the contribution of MMVQ to the overall compilation time and binary size is not that high by comparison. So my opinion is that we should re-enable it. Based on the data I collected I think the logic for disabling it should be should be:

  • On NVIDIA Turing or newer disable if: batch size 1 and one of iq3_xxs, iq3_s. Otherwise disable if
  • batch size 1 and one of iq1_s, iq1_m, iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq3_s, iq4_xs. Or if
  • NVIDIA Pascal or older and one of iq3_s, q2_k, q3_k. Or if
  • AMD RDNA.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 22, 2026

I think it makes sense to separate ncols by template files. I have not enabled fusion for bs > 1 because of the same reason of compilation slowdown. I can refactor the code in a subsequent PR, and we can merge this PR as is for now. What do you think?

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging this PR as-is and making a follow-up PR.

@am17an am17an merged commit ccb87fa into ggml-org:master Mar 22, 2026
47 of 48 checks passed
@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 22, 2026

@JohannesGaessler
Copy link
Copy Markdown
Contributor

In any case, this PR did not touch any FlashAttention code that would be causing the CI failure.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 22, 2026

the numbers dont make any sense either, the compiler would not choose to allocate 41 registers and spill 418, 41 isent an occupancy boundary so that makes no sense. This must be a parsing failure in the script, i will take a look.

@gaugarg-nv gaugarg-nv deleted the small_k_optimization branch April 13, 2026 09:51
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…imension is small (ggml-org#20635)

* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…imension is small (ggml-org#20635)

* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants