Skip to content

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196

Merged
am17an merged 23 commits intoggml-org:masterfrom
michaelw9999:nvfp4-blackwell
Apr 28, 2026
Merged

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196
am17an merged 23 commits intoggml-org:masterfrom
michaelw9999:nvfp4-blackwell

Conversation

@michaelw9999
Copy link
Copy Markdown
Contributor

This is a restored clone of PR #21896 ggml-cuda: Blackwell native NVFP4 support .
Unfortunately it closed during a rebase error and it cannot be reopened
The exact commits are here as they were before. Sorry about this mixup!

michaelw9999 and others added 23 commits April 19, 2026 16:50
…ead of block_nvfp4, removed UE4M3 max cap check, merged use_native_mxfp4/nvfp4 into use_native_fp4, merged quantize_mmq_nvfp4/mxfp4/cuda to quantize_mmq_fp4_Cuda, merged mma/mxfp4/nvfp4 into one templated mma_block_scaled_fp4
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@michaelw9999 michaelw9999 requested review from a team and ggerganov as code owners April 21, 2026 04:33
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 21, 2026
@anskumar01
Copy link
Copy Markdown

@am17an , @JohannesGaessler , original PR #21896 was approved before it got closed by mistake. Can this PR be merged now in its current form while design discussion for remaining gaps continue and be implemented separately.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 23, 2026

I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@am17an sorry, which reservations are you talking about?

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 25, 2026

This one #21896 (comment)

@michaelw9999
Copy link
Copy Markdown
Contributor Author

michaelw9999 commented Apr 25, 2026

I do agree we would benefit with some fixes here with regards to the tensor scale incorporation.
How and with what implementation exactly is still TBD @ORippler

Right now with this PR on Qwen3.5-4B :
Mean PPL(Q) : 11.658689 ± 0.371901.

Details
====== Perplexity statistics ======
Mean PPL(Q)                   :  11.658689 ±   0.371901
Mean PPL(base)                :  10.812440 ±   0.339678
Cor(ln(PPL(Q)), ln(PPL(base))):  98.55%
Mean ln(PPL(Q)/PPL(base))     :   0.075354 ±   0.005421
Mean PPL(Q)/PPL(base)         :   1.078266 ±   0.005845
Mean PPL(Q)-PPL(base)         :   0.846249 ±   0.068650

====== KL divergence statistics ======
Mean    KLD:   0.092041 ±   0.002901
Maximum KLD:  11.383351

With input scale linked up via build_lora, 11.599

When both weight/input scale are factored directly into ggml_mul_mat (the details of where and in what spot don't seem to matter) it's down to:
Mean PPL(Q) : 11.557739 ± 0.366391

Details
====== Perplexity statistics ======
Mean PPL(Q)                   :  11.557739 ±   0.366391
Mean PPL(base)                :  10.812440 ±   0.339678
Cor(ln(PPL(Q)), ln(PPL(base))):  98.51%
Mean ln(PPL(Q)/PPL(base))     :   0.066658 ±   0.005458
Mean PPL(Q)/PPL(base)         :   1.068930 ±   0.005834
Mean PPL(Q)-PPL(base)         :   0.745298 ±   0.066528

====== KL divergence statistics ======
Mean    KLD:   0.091930 ±   0.002889
Maximum KLD:  13.888078

So we have a 0.1 difference for this one particular model. It is not as large of a difference on the larger Nemotron 30B MoE. I just started playing around with Qwen3.6-27B dense to experiment back to back and see if there is any differences.
On the current generic NVFP4 x Q8's 11.40, changing the weight scale inclusion point will give 11.39, so that does not seem quite worth the effort for nvfp4_q8. Whether it changes real world model quality with the bigger 11.65 to 11.55 difference, I do not know, and I'm not sure the gap on other models, I imagine it will be more significant for smaller models.
On an experimental all NVFP4 x NVFP4 MMVQ kernel it also helped out.

@michaelw9999 michaelw9999 marked this pull request as draft April 27, 2026 03:48
@ORippler
Copy link
Copy Markdown
Collaborator

I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging.

Sorry for the radio silence. From our side, proceeding with the split-responsibility (ggml_cgraph constructer is responsible for multiplying per-tensor weight scales onto ops that consume NVFP4 tensors) is fine for the time being. The only part currently missing is the actual TC acceleration for BW GPUs, so it doesn't make sense to stop here. We will monitor NVFP4 and non-NVFP4 quants for a set of models we are interested in to ensure quality stays as expected within llama.cpp for the time being.

I do agree we would benefit with some fixes here with regards to the tensor scale incorporation.
How and with what implementation exactly is still TBD @ORippler

Regarding optimizations for quantizing incoming activations from F32->A4 (both perf and quality-wise), we feel these can be addressed in separate follow-up PRs.

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 28, 2026

@ORippler then let's merge this when tests are green

@michaelw9999 michaelw9999 marked this pull request as ready for review April 28, 2026 19:31
@am17an am17an merged commit fc2b005 into ggml-org:master Apr 28, 2026
46 of 52 checks passed
cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 29, 2026
@ORippler
Copy link
Copy Markdown
Collaborator

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here):

Quality:

Q4_K (W4Q8):
Final estimate: PPL = 4.6237 +/- 0.02802NVFP4 master (W4Q8)

NVFP4 (W4Q8 Int fallback path)
Final estimate: PPL = 4.6283 +/- 0.02814NVFP4 branch (W4Q4)

NVFP4 (W4A4 TC path)
Final estimate: PPL = 4.6577 +/- 0.02838

See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6).

Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script):

(base) osimons@spark-9c20:~/llama.cpp$ ./build_50494a/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        506.21 ± 1.31 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         12.01 ± 0.01 |

(base) osimons@spark-9c20:~/llama.cpp_nvfp4$ ./build_nvfp4/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        611.30 ± 1.30 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         11.91 ± 0.01 

We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past)

@michaelw9999
Copy link
Copy Markdown
Contributor Author

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here):

Quality:

Q4_K (W4Q8):
Final estimate: PPL = 4.6237 +/- 0.02802NVFP4 master (W4Q8)

NVFP4 (W4Q8 Int fallback path)
Final estimate: PPL = 4.6283 +/- 0.02814NVFP4 branch (W4Q4)

NVFP4 (W4A4 TC path)
Final estimate: PPL = 4.6577 +/- 0.02838

See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6).

Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script):

(base) osimons@spark-9c20:~/llama.cpp$ ./build_50494a/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        506.21 ± 1.31 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         12.01 ± 0.01 |

(base) osimons@spark-9c20:~/llama.cpp_nvfp4$ ./build_nvfp4/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        611.30 ± 1.30 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         11.91 ± 0.01 

We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past)

@ORippler I will quantize this model with my modified llama-quantizer which does more scale search and try to upload it to hf, if you want to compare. I have not tried to run models this large yet as I only have a 5090/32gb, so it may be difficult for me to run; on smaller models thus far, it has better ppl and kld than those converted with the hf script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants