Optimization: Qwen3 next autoregressive pass by pwilkin · Pull Request #17996 · ggml-org/llama.cpp

pwilkin · 2025-12-13T15:00:49Z

This change adds a dedicated autoregressive version of delta-net which short cirtuits all the recurrent computations for n_seq_tokens == 1. The end result is roughly a 40% bump in token generation speed.

jeffbolznv · 2025-12-13T16:05:27Z

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -r 10 --prio 1 -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      3071.35 ± 18.75 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           tg128 |         74.65 ± 0.39 |

build: 5266379bc (7387)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -r 10 --prio 1 -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      3075.92 ± 14.32 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           tg128 |         94.36 ± 1.10 |

build: 4a494ab77 (7387)

jacekpoplawski · 2025-12-13T16:23:12Z

before:

ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen3next 80B.A3B Q6_K	61.20 GiB	79.67 B	CUDA	99	pp512	740.56 ± 4.37
qwen3next 80B.A3B Q6_K	61.20 GiB	79.67 B	CUDA	99	tg128	43.33 ± 0.35

after:

ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen3next 80B.A3B Q6_K	61.20 GiB	79.67 B	CUDA	99	pp512	739.84 ± 3.79
qwen3next 80B.A3B Q6_K	61.20 GiB	79.67 B	CUDA	99	tg128	51.40 ± 1.09

IIIIIllllIIIIIlllll · 2025-12-13T16:48:12Z

Sadly, there are no changes on AI MAX+ 395 (ROCm 7.1.1 build, latest code).

master:

mark@MarkPC:~/llama.cpp/llama.cpp-master$  ./llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -p 2048 -n 32 -ub 2048 -b 2048 -fa 1 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        497.00 ± 2.64 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         17.04 ± 0.26 |

build: unknown (0)

this PR:

mark@MarkPC:~/llama.cpp-lean_mean_token_machine/build/bin$ ./llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -p 2048 -n 32 -ub 2048 -b 2048 -fa 1 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        493.22 ± 2.55 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         18.68 ± 0.01 |

build: unknown (0)

Edited:
Sorry, I didn't realize this was an optimization for CUDA, but I'm keeping the comment here for other AMD users to see.

pwilkin · 2025-12-13T16:53:44Z

Nah, this should be a general optimization. This means there are other bottlenecks in play for the ROCm implementation than the slow delta-net.

Can you run inference with --verbose and GGML_SCHED_DEBUG=2 to dump the entire graph split?

othermod · 2025-12-13T17:07:17Z

Sadly, there are no changes on AI MAX+ 395 (ROCm 7.1.1 build, latest code).

master:

mark@MarkPC:~/llama.cpp/llama.cpp-master$  ./llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -p 2048 -n 32 -ub 2048 -b 2048 -fa 1 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        497.00 ± 2.64 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         17.04 ± 0.26 |

build: unknown (0)

this PR:

mark@MarkPC:~/llama.cpp-lean_mean_token_machine/build/bin$ ./llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -p 2048 -n 32 -ub 2048 -b 2048 -fa 1 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        493.22 ± 2.55 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         18.68 ± 0.01 |

build: unknown (0)

That looks like a 10% bump, right?

IIIIIllllIIIIIlllll · 2025-12-13T17:10:46Z

Nah, this should be a general optimization. This means there are other bottlenecks in play for the ROCm implementation than the slow delta-net.

Can you run inference with --verbose and GGML_SCHED_DEBUG=2 to dump the entire graph split?

@pwilkin Hopefully this log is what you need :)
qwen-next-bench.zip

CISC

There's an excessive amount of conts and asserts here, most of which I'm sure are unnecessary, but I think qwen3next needs a general cleanup of these anyway, so will leave that to you at a later stage.

pwilkin · 2025-12-13T17:42:22Z

@IIIIIllllIIIIIlllll can you do a bench for -ub 512,1024,2048? Shouldn't influence generation, but I'm wondering about the pp overhead of such huge graphs.

CISC · 2025-12-13T17:49:10Z

@pwilkin In case you're wondering, I think the ggml_cont_Xd of q/k/v/state is unnecessary (the latter may be beneficial, though hardly critical).

IIIIIllllIIIIIlllll · 2025-12-13T17:49:46Z

mark@MarkPC:~/llama.cpp-lean_mean_token_machine/build/bin$ ./llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -p 2048 -n 32 -ub 512,1024,2048 -fa 1 -mmp 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |      512 |  1 |    0 |          pp2048 |        445.64 ± 1.74 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |      512 |  1 |    0 |            tg32 |         18.66 ± 0.03 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     1024 |  1 |    0 |          pp2048 |        469.62 ± 0.83 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     1024 |  1 |    0 |            tg32 |         18.71 ± 0.04 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        503.91 ± 0.32 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         18.70 ± 0.03 |

build: unknown (0)

@pwilkin
Here, or use --verbose & GGML_SCHED_DEBUG=2 do again ?

mpapili · 2025-12-13T18:46:25Z

Adding some multi-GPU ROCm data with several experts offloaded to CPU:

Setup

Specs

CPU: Ryzen 9 3950x
Memory: 64GB DDR4 3000mhz
GPU1: Rx 6800 (16GB)
GPU2: Rx 6800 (16GB)

Model

Qwen3-Next-80B-A3B-Thinking-Q4_K_S

Command

/llama.cpp/build/build/bin/llama-server --host 127.0.0.1 --jinja --min-p 0 --mlock --mmap -ncmoe 20 --port 44163 --repeat-penalty 1.05 --temp 0.5 --top-k 0.20 --top-p 0.95 --warmup --alias Qwen3-Next-80B-A3B-Thinking-Q4_K_S --ctx-size 75000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_S.gguf --n-gpu-layers 999 --threads 8 --tensor-split 67,33 --log-verbose

Results

ggml-org/main Branch

17.3 tokens/second

pwilkin:lean_mean_token_machine Branch

22.5 tokens/second

Increase of >5 tokens/second or ~30% increase in token-gen speed

heislera763 · 2025-12-13T19:26:36Z

Some 4x V100 32GB results w/ q8_0 gguf

master:

alexander@alexander-main:~/.llama-server$ llama-bench -m models/qwen_qwen3-next-80b-a3b-thinking-q8_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 3: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           pp512 |        363.55 ± 1.14 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           tg128 |         38.39 ± 0.08 |

build: 5266379bc (7387)

lean_mean_token_machine:

alexander@alexander-main:~/.llama-server$ llama-bench -m models/qwen_qwen3-next-80b-a3b-thinking-q8_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 3: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           pp512 |        360.25 ± 2.75 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           tg128 |         44.60 ± 0.03 |

build: 4a494ab77 (7387)

Before: 38.39 t/s
After: 44.60 t/s
Gain: 6.21 t/s (+16.2%)

heislera763 · 2025-12-13T19:53:11Z

I was feeling a bit bored and naively asked gemini-cli to make the changes CISC suggested, it seems like it's consistently faster and it seems coherent (only did very brief testing). I do remember it breaking when it changed the sum_row conts though, but I don't know if any of the rest are needed.

cont/assert reduction:

alexander@alexander-main:~/dev$ ./llama.cpp/build/bin/llama-bench -m ~/.llama-server/models/qwen_qwen3-next-80b-a3b-thinking-q8_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
  Device 3: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           pp512 |        370.85 ± 1.33 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           tg128 |         45.79 ± 0.02 |

build: df1be3be (14)

Gain of 1.19 t/s over this commit (+2.67%) for a total gain of 7.4 t/s (+19.3%) over master

patch file if your interested: qwen3.patch

CISC · 2025-12-13T20:01:15Z

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | CUDA       |  99 |  1 |           pp512 |        370.85 ± 1.33 |

Nice little PP boost.

Som-anon · 2025-12-13T20:04:27Z

anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q5_K_XL-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.15 ± 0.04 |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           tg128 |          3.36 ± 0.01 |

build: 4d5ae24c0 (7386)
./build/bin/llama-bench --model   1923.22s user 22.33s system 425% cpu 7:37.05 total
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % cd ~/work/llama.cpp-qwen.next
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q5_K_XL-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.14 ± 0.10 |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           tg128 |          2.65 ± 0.00 |

build: 4a494ab7 (7387)
./build/bin/llama-bench --model   2143.68s user 8.83s system 450% cpu 7:58.32 total
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % cd ~/work/llama.cpp
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen__Qwen3-Next-80B-A3B-Instruct-Q5_K_S.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           pp512 |         12.98 ± 0.03 |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           tg128 |          3.42 ± 0.00 |

build: 4d5ae24c0 (7386)
./build/bin/llama-bench --model   1949.32s user 21.52s system 426% cpu 7:41.65 total
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % cd ~/work/llama.cpp-qwen.next
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen__Qwen3-Next-80B-A3B-Instruct-Q5_K_S.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.11 ± 0.10 |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           tg128 |          2.67 ± 0.00 |

build: 4a494ab7 (7387)
./build/bin/llama-bench --model   2116.94s user 8.65s system 448% cpu 7:53.83 total

jeffbolznv · 2025-12-13T20:55:30Z

patch file if your interested: qwen3.patch

Worth a few percent on my system:

| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           tg128 |         96.69 ± 0.66 |

The number of CONT ops for -p 0 -n 128 -r 1 decreases from 119196 to 86688, so still plenty to go.

fuutott · 2025-12-13T20:57:15Z

D:\llama>d:/llama/latest/llama-bench.exe   -m d:\models\lmstudio-community\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  45.08 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |      1747.04 ± 56.82 |
| qwen3next 80B.A3B Q4_K - Medium |  45.08 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |         21.44 ± 0.11 |

build: 5266379bc (7387)

D:\llama>d:/llama/llama.cpp/build/bin/llama-bench.exe   -m d:\models\lmstudio-community\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  45.08 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |       1763.99 ± 4.62 |
| qwen3next 80B.A3B Q4_K - Medium |  45.08 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |         29.36 ± 0.71 |

build: 4a494ab7 (7387)

IIIIIllllIIIIIlllll · 2025-12-13T20:59:49Z

please ignore my previous reply. The test results in my previous reply were executed in the PuTTY terminal, and I don't know why they were so bad.

It's really strange, changing -DGGML_HIP_ROCWMMA_FATTN to OFF significantly improved pp's speed...
tg's speed has increased by about 9%.

Perhaps the performance of AI MAX+ 395 has reached its limit (this is questionable).

this PR - DGGML_HIP_ROCWMMA_FATTN=OFF:

/home/mark/llama.cpp-lean_mean_token_machine/build/bin/llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -r 5 -p 2048 -n 32 -b 2048 -ub 2048 -fa 1 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        573.84 ± 1.12 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         26.57 ± 0.03 |

build: unknown (0)

this PR - DGGML_HIP_ROCWMMA_FATTN=ON:

/home/mark/llama.cpp-lean_mean_token_machine/build/bin/llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -r 5 -p 2048 -n 32 -b 2048 -ub 2048 -fa 1 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        502.11 ± 0.80 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         26.47 ± 0.02 |

build: unknown (0)

master- DGGML_HIP_ROCWMMA_FATTN=OFF:

/home/mark/llama.cpp/llama.cpp-master/llama-bench -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_KL.gguf -r 5 -p 2048 -n 32 -b 2048 -ub 2048 -fa 1 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        485.42 ± 0.55 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         24.14 ± 0.04 |

build: unknown (0)

ggerganov · 2025-12-13T21:00:32Z

+    // Choose between build_delta_net_chunking, build_delta_net_recurrent, and build_delta_net_autoregressive based on n_tokens
+    ggml_tensor * attn_out;
+    if (n_seq_tokens == 1) {
+        attn_out = build_delta_net_autoregressive(q_conv, k_conv, v_conv, gate, beta, state, il);
+    } else if (n_seq_tokens > CHUNK_SIZE) {
+        attn_out = build_delta_net_chunking(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il);
+    } else {
+        attn_out = build_delta_net_recurrent(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il);
+    }


This is highly not recommended. Instead of adding more branches, we have to figure out how to make the graph static. Start with simplifying the existing graphs by removing redundant ops.

But in this case we can't make the graph static since the special branch here is one where the decay mask computation doesn't happen (because n_seq_tokens == 1, so it all collapses to trivial transformations, therefore they can be optimized out).

I can probably remove the recurrent part now since I'm not sure there's a realistic case for it, it'll be either chunking or autoregressive.

maybe a bit off-topic, but I had a look quickly at the version on master branch and it seems like some ggml_cont_* and ggml_transpose can potentially be redundant. I suspect something like this can be reduced further:

ggml_tensor * k_cumdecay = ggml_cont(ctx0, ggml_transpose(ctx0, ggml_mul_mat(ctx0, attn, ggml_cont(ctx0, ggml_transpose(ctx0, kbeta_gexp)))));

This trick in the contribution guide sometimes saved me a transpose:

$C^T = A B^T \Leftrightarrow C = B A^T.$

Otherwise, sometimes you can also use a non-contiguous tensor if the next ops accept it

Also, sometimes unsqueeze(-1) can be just a ggml_view which costs almost nothing in term of speed

Edit: sometimes, you can also transpose the weight upon converting to GGUF, which make it usable in the formula mentioned above

Som-anon · 2025-12-13T21:10:38Z

anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q5_K_XL-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.15 ± 0.04 |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           tg128 |          3.36 ± 0.01 |

build: 4d5ae24c0 (7386)
./build/bin/llama-bench --model   1923.22s user 22.33s system 425% cpu 7:37.05 total
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % cd ~/work/llama.cpp-qwen.next
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q5_K_XL-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.14 ± 0.10 |
| qwen3next 80B.A3B Q5_K - Medium |  52.86 GiB |    79.67 B | BLAS       |       4 |           tg128 |          2.65 ± 0.00 |

build: 4a494ab7 (7387)
./build/bin/llama-bench --model   2143.68s user 8.83s system 450% cpu 7:58.32 total
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % cd ~/work/llama.cpp
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen__Qwen3-Next-80B-A3B-Instruct-Q5_K_S.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           pp512 |         12.98 ± 0.03 |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           tg128 |          3.42 ± 0.00 |

build: 4d5ae24c0 (7386)
./build/bin/llama-bench --model   1949.32s user 21.52s system 426% cpu 7:41.65 total
anon@t480 ~/work/llama.cpp (git)-[remotes/origin/HEAD] % cd ~/work/llama.cpp-qwen.next
anon@t480 ~/work/llama.cpp-qwen.next (git)-[lean_mean_token_machine] % ./build/bin/llama-bench --model ~/models/Qwen-Next/Qwen__Qwen3-Next-80B-A3B-Instruct-Q5_K_S.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           pp512 |         13.11 ± 0.10 |
| qwen3next 80B.A3B Q5_K - Small |  51.21 GiB |    79.67 B | BLAS       |       4 |           tg128 |          2.67 ± 0.00 |

build: 4a494ab7 (7387)
./build/bin/llama-bench --model   2116.94s user 8.65s system 448% cpu 7:53.83 total

Is there any reason why it could have gotten slower for me? I'm compiling it with

cmake -B build -DGGML_VULKAN=0 -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS && cmake --build build --config Release -j 12

kiuckhuang · 2025-12-13T22:17:51Z

got an interesting finding in Win11 + RTX5090, compile with vulkan support and force to use vulkan0 device with pp512 60%+ and tg128 100%+

vulkan0:
$ llama-bench.exe -dev vulkan0 -fa 1 -ngl 99 -m Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from G:\ai\llama.cpp\build\bin\Release\ggml-cuda.dll
load_backend: loaded RPC backend from G:\ai\llama.cpp\build\bin\Release\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) UHD Graphics 770 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from G:\ai\llama.cpp\build\bin\Release\ggml-vulkan.dll
load_backend: loaded CPU backend from G:\ai\llama.cpp\build\bin\Release\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	fa	dev	test	t/s
qwen3next 80B.A3B Q2_K - Medium	27.31 GiB	79.67 B	CUDA,Vulkan	99	1	Vulkan0	**pp512	2841.57 ± 9.18
qwen3next 80B.A3B Q2_K - Medium	27.31 GiB	79.67 B	CUDA,Vulkan	99	1	Vulkan0	**tg128	90.77 ± 0.26

build: c00ff92 (7389)

cuda0:
$ llama-bench.exe -dev cuda0 -fa 1 -ngl 99 -m Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from G:\ai\llama.cpp\build\bin\Release\ggml-cuda.dll
load_backend: loaded RPC backend from G:\ai\llama.cpp\build\bin\Release\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) UHD Graphics 770 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from G:\ai\llama.cpp\build\bin\Release\ggml-vulkan.dll
load_backend: loaded CPU backend from G:\ai\llama.cpp\build\bin\Release\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	fa	dev	test	t/s
qwen3next 80B.A3B Q2_K - Medium	27.31 GiB	79.67 B	CUDA,Vulkan	99	1	CUDA0	pp512	1699.91 ± 44.21
qwen3next 80B.A3B Q2_K - Medium	27.31 GiB	79.67 B	CUDA,Vulkan	99	1	CUDA0	tg128	38.16 ± 0.60

build: c00ff92 (7389)

pwilkin · 2025-12-15T14:58:45Z

Alright, I've done the final refactorings:
-> removed extra ASSERTs
-> changed more CONTs to RESHAPEs

I also removed the recurrent version of the delta_net in favor of the chunked version since the use-case for the recurrent one was very narrow (prompt processing but with less than 64 tokens) and it didn't make sense to keep it just for that.

Final numbers for IQ1_M quant on my box:

load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-haswell.so

model	size	params	backend	threads	fa	test	t/s
qwen3next 80B.A3B IQ1_M - 1.75 bpw	16.01 GiB	79.67 B	BLAS,CUDA	8	0	pp512	831.44 ± 17.67
qwen3next 80B.A3B IQ1_M - 1.75 bpw	16.01 GiB	79.67 B	BLAS,CUDA	8	0	tg128	50.97 ± 0.16
qwen3next 80B.A3B IQ1_M - 1.75 bpw	16.01 GiB	79.67 B	BLAS,CUDA	8	1	pp512	837.93 ± 3.50
qwen3next 80B.A3B IQ1_M - 1.75 bpw	16.01 GiB	79.67 B	BLAS,CUDA	8	1	tg128	51.84 ± 0.32

ggerganov · 2025-12-15T18:07:40Z

                chunk_size,         causal_mask->ne[2], causal_mask->ne[3],
-                causal_mask->nb[1], causal_mask->nb[2], causal_mask->nb[3], 0);
+                causal_mask->nb[1], causal_mask->nb[2], causal_mask->nb[3], 0) :
+            ggml_tri(ctx0, ggml_fill_inplace(ctx0, ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, chunk_size, chunk_size), 1.0f),


ggml_new_tensor_2d should be avoided in general, especially inside loops. It creates new tensors increasing the graph size and the compute buffers. Use it only for input tensors at the beginning of the graph.

ggerganov · 2025-12-15T18:09:47Z


    ggml_tensor * chunked_mask =
-        ggml_view_4d(ctx0, causal_mask, chunk_size,
+        n_tokens >= chunk_size ?


Can we avoid these branches? The old version is more friendly towards keeping the graph topology static, so if it still works, it would be better to keep it.

Yeah, I'll redo it.

lovedheart · 2025-12-15T23:20:44Z

./build/bin/llama-cli -m ~/Downloads/Qwen3-Next-80B-A3B-Instruct-Q8_0-00001-of-00002.gguf --device Vulkan0,CUDA0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = NVIDIA GeForce RTX 5060 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
-> [ Prompt: 18.5 t/s | Generation: 11.6 t/s ]

./build/bin/llama-cli -m ~/Downloads/Qwen3-Next-80B-A3B-Instruct-Q8_0-00001-of-00002.gguf --device Vulkan0,Vulkan1
-> [ Prompt: 9.4 t/s | Generation: 11.3 t/s ]

As comparison from fastllm on my machine, running directly Qwen3-Next-80B-A3B-Thinking-FP8 with 5060TI offload, the TG keeps around 21 t/s at the beginning and drops down to 13 t/s at around 45K context length.

pwilkin · 2025-12-16T01:06:27Z

@ggerganov aight, I think it's as clean as I can make it at this point.

* It's Qwen3 Next, the lean mean token generation machine! * Apply patches from thread * Remove recurrent version, only keep chunked and autoregressive * Remove unnecessary conts and asserts * Remove more extra conts and asserts * Cleanup masking

pwilkin requested a review from CISC as a code owner December 13, 2025 15:00

This was referenced Dec 13, 2025

Eval bug: Qwen 3 Next CUDA poor performance #17822

Open

Feature Request: Qwen3-Next: CPU performance optimization #17936

Closed

CISC approved these changes Dec 13, 2025

View reviewed changes

github-actions Bot added the model Model specific label Dec 13, 2025

ggerganov reviewed Dec 13, 2025

View reviewed changes

github-actions Bot mentioned this pull request Dec 14, 2025

Reddit News Daily 2025-12-14 gitlawr/reddit-daily-news#93

Open

It's Qwen3 Next, the lean mean token generation machine!

b739b11

pwilkin force-pushed the lean_mean_token_machine branch from 4a494ab to b739b11 Compare December 15, 2025 13:37

Apply patches from thread

bd7b710

loci-dev mentioned this pull request Dec 15, 2025

UPSTREAM PR #17996: Optimization: Qwen3 next autoregressive pass auroralabs-loci/llama.cpp#577

Open

pwilkin added 3 commits December 15, 2025 15:37

Remove recurrent version, only keep chunked and autoregressive

9357f6d

Remove unnecessary conts and asserts

a58f2ac

Remove more extra conts and asserts

2c44d21

ggerganov reviewed Dec 15, 2025

View reviewed changes

Cleanup masking

b1477de

ggerganov approved these changes Dec 16, 2025

View reviewed changes

pwilkin merged commit a5251ca into ggml-org:master Dec 16, 2025
66 of 76 checks passed

nifgraup mentioned this pull request Dec 16, 2025

Misc. bug: Qwen3-Next token generation performance regression (CPU-only) #18112

Closed

ProgenyAlpha mentioned this pull request Feb 10, 2026

Qwen3-Coder-Next (Qwen3-Next-80B) CPU inference ~5x slower than expected — consumer hardware benchmarks #19480

Closed

Conversation

pwilkin commented Dec 13, 2025

Uh oh!

jeffbolznv commented Dec 13, 2025

Uh oh!

jacekpoplawski commented Dec 13, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025

Uh oh!

pwilkin commented Dec 13, 2025

Uh oh!

othermod commented Dec 13, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 13, 2025

Uh oh!

CISC commented Dec 13, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpapili commented Dec 13, 2025

Setup

Specs

Model

Command

Results

ggml-org/main Branch

pwilkin:lean_mean_token_machine Branch

Uh oh!

heislera763 commented Dec 13, 2025

Uh oh!

heislera763 commented Dec 13, 2025

Uh oh!

CISC commented Dec 13, 2025

Uh oh!

Som-anon commented Dec 13, 2025

Uh oh!

jeffbolznv commented Dec 13, 2025

Uh oh!

fuutott commented Dec 13, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

pwilkin Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Som-anon commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiuckhuang commented Dec 13, 2025

Uh oh!

pwilkin commented Dec 15, 2025

Uh oh!

ggerganov Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

pwilkin Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

lovedheart commented Dec 15, 2025

Uh oh!

pwilkin commented Dec 16, 2025

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025 •

edited

Loading

IIIIIllllIIIIIlllll commented Dec 13, 2025 •

edited

Loading

ngxson Dec 14, 2025 •

edited

Loading

Som-anon commented Dec 13, 2025 •

edited

Loading