ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196
ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196am17an merged 23 commits intoggml-org:masterfrom
Conversation
…ead of block_nvfp4, removed UE4M3 max cap check, merged use_native_mxfp4/nvfp4 into use_native_fp4, merged quantize_mmq_nvfp4/mxfp4/cuda to quantize_mmq_fp4_Cuda, merged mma/mxfp4/nvfp4 into one templated mma_block_scaled_fp4
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
@am17an , @JohannesGaessler , original PR #21896 was approved before it got closed by mistake. Can this PR be merged now in its current form while design discussion for remaining gaps continue and be implemented separately. |
|
I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging. |
|
@am17an sorry, which reservations are you talking about? |
|
This one #21896 (comment) |
|
I do agree we would benefit with some fixes here with regards to the tensor scale incorporation. Right now with this PR on Qwen3.5-4B : DetailsWith input scale linked up via build_lora, When both weight/input scale are factored directly into DetailsSo we have a 0.1 difference for this one particular model. It is not as large of a difference on the larger Nemotron 30B MoE. I just started playing around with Qwen3.6-27B dense to experiment back to back and see if there is any differences. |
Sorry for the radio silence. From our side, proceeding with the split-responsibility (
Regarding optimizations for quantizing incoming activations from F32->A4 (both perf and quality-wise), we feel these can be addressed in separate follow-up PRs. I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available. |
|
@ORippler then let's merge this when tests are green |
FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here): Quality: See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6). Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script): We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past) |
@ORippler I will quantize this model with my modified llama-quantizer which does more scale search and try to upload it to hf, if you want to compare. I have not tried to run models this large yet as I only have a 5090/32gb, so it may be difficult for me to run; on smaller models thus far, it has better ppl and kld than those converted with the hf script. |
This is a restored clone of PR #21896 ggml-cuda: Blackwell native NVFP4 support .
Unfortunately it closed during a rebase error and it cannot be reopened
The exact commits are here as they were before. Sorry about this mixup!