ggml-cuda: Blackwell native NVFP4 support #21896
ggml-cuda: Blackwell native NVFP4 support #21896michaelw9999 wants to merge 0 commit intoggml-org:masterfrom
Conversation
| float subblock_scale = 0.0f; | ||
|
|
||
| #pragma unroll // Check +/- 2 to find best code to reduce NVFP4 activation loss. Negligible overhead on Blackwell. | ||
| for (int i = 0; i < 5; i++) { |
There was a problem hiding this comment.
I'm unsure what value this provides, is this the standard way to quantize to nvfp4? If not then please remove it, avoid novelty wherever possible please.
There was a problem hiding this comment.
I did not invent the idea, I found it in make_qkx2_quants() already in llama.cpp. It is almost free, I don't see any tangible change in perf with every bench still staying within the same band, perhaps lower a fraction, but negligible enough to not matter.
The real value is that it improves activation loss vs Q8. It reduces ppl on Qwen3.5/4B from 11.80 to 11.65 [Q8 is 11.40], reduces max kld from 12.0 to 11.38 (5.2%) , mean ln 0.075354 from 0.087845 (14% ), mean kld 0.092041 from 0.098035 (6.5% ), RMS Δp 7.925 from 8.136, and every parameter is improved, so this will help any model's quality in some way and reduces outliers. I can take it out if you want, or put it to onto its own PR or just let it go.
There was a problem hiding this comment.
Afaik dynamic activation quantization would be the standard. However, this would require computing max value of the whole tensor to figure out the per-tensor scale:
There was a problem hiding this comment.
@ORippler I have a lot up my sleeve for NVFP4 I've been working on but I did not want to bring out too much all at once, I made that mistake a while back :)) You will like the new repack I think.
There was a problem hiding this comment.
Before merging this PR we can do a eval check like done here #17906 (comment) for a Nemotron model which has results for AIME-25 or some similar eval. If we get similar numbers as the original model then I think this should be fine.
There was a problem hiding this comment.
The real value is that it improves activation loss vs Q8. It reduces ppl on Qwen3.5/4B from 11.80 to 11.65 [Q8 is 11.40], reduces max kld from 12.0 to 11.38 (5.2%) , mean ln 0.075354 from 0.087845 (14% ), mean kld 0.092041 from 0.098035 (6.5% ), RMS Δp 7.925 from 8.136, and every parameter is improved, so this will help any model's quality in some way and reduces outliers. I can take it out if you want, or put it to onto its own PR or just let it go.
I feel we should reevaluate/compare this heuristic against correctly handling nvfp4 as a derived tensor (i.e. all incoming activations have to be divided by the per-tensor scale before entering this function). An alternative to the heuristic here (and something that could be done in the absence of per-tensor nvfp4 scaling) would be to derive the per-activation max at run-time, and scale it by this value.
There was a problem hiding this comment.
First of all, thanks for the PR and your contribution!
There remains a small quantization loss due to using NVFP4 for activations. On Qwen3.5-4B, this moved ppl from 11.40 on the baseline to 11.65 after this PR, and on Nemotron-Cascade-2-30B, from 9.81 to 9.85. This was kept in check by doing a small +/- 2 code search during quantization to improve the subblock scale by finding which amongst them has the lowest error, which is negligible overheard, as it's calculated via GPU. That improved ppl and lowered max kld, for example, from 12.24 to 11.65 (vs baseline 11.40 with Q8) on Qwen3.5 4B, and from 9.88 to 9.85 on Nemotron, and is likely worth any tiny overhead. Like MXFP4, test-backend-ops was updated to add NVFP4 to the same override as certain tests would otherwise fail for excess error.
Unless I am misreading the PR we are:
- Missing per-tensor F32 scale for the activation quantization (Ideally, this should be applied in the GEMM-epilogue before write-back. Theoretically, we can also grab the per-tensor scale we apply as a separate GGML_OP for the weights and apply both of them at the same time, though this would require adding node-fusion (and may have been done by someone else in the mean-time)).
- Discarding negative values for the per-block FP8 scales
My feeling is we should resolve both for functional correctness/best available quality
Hi @ORippler!
They are loaded if present in the GGUF and are just waiting to be used. I updated the hf convert script to pull them in from any source from same PR above. My own NVFP4 quantizer (not in any PR but I have published some quantized GGUF NVFP4 models onto HF) uses imatrix to derive the input scale.
|
Can we evaluate 4/6 scaling for optimal block-scale search? |
Yes, I've already used that in a very old implementation POC, it can be added in a future PR. |
| float subblock_scale = 0.0f; | ||
|
|
||
| #pragma unroll // Check +/- 2 to find best code to reduce NVFP4 activation loss. Negligible overhead on Blackwell. | ||
| for (int i = 0; i < 5; i++) { |
There was a problem hiding this comment.
Afaik dynamic activation quantization would be the standard. However, this would require computing max value of the whole tensor to figure out the per-tensor scale:
There was a problem hiding this comment.
on a 5090:
| Model | Microbatch size | Test | t/s e21cdc1 | t/s nvfp4-blackwell | Speedup |
|---|---|---|---|---|---|
| nemotron_h_moe 31B.A3.5B NVFP4 | 512 | pp2048 | 6941.51 | 11039.13 | 1.59 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 1024 | pp2048 | 9209.10 | 13265.28 | 1.44 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 2048 | pp2048 | 10423.61 | 14035.40 | 1.35 |
DGX spark:
| Model | Microbatch size | Test | t/s 8dc530b | t/s nvfp4-blackwell | Speedup |
|---|---|---|---|---|---|
| nemotron_h_moe 31B.A3.5B NVFP4 | 512 | pp2048 | 1913.84 | 2900.42 | 1.52 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 1024 | pp2048 | 2458.35 | 3188.62 | 1.30 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 2048 | pp2048 | 2793.08 | 3430.60 | 1.23 |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@stevelikesrhino please wait till the PR is merged |
50df4df to
0d5fb0c
Compare
|
Fixed, pushed and tested with Qwen_Qwen3.5-27B-Q6_K_L.gguf, now it's
working OK.
… Message ID: ***@***.***>
|
|
I used https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 and simply ran the |
|
@am17an |
It does, there is currently no other way; llama-quantize doesn't currently convert NVFP4. |
How else would one make GGUFs like this? :) |
@CISC that's awesome! I will go try it out out, I haven't had time to even do gemma4 yet.. and now Qwen3.6... ahh! Did you do anything special or just enable it in quantizer the standard way? I've been working on a heavily modified llama-quantizer for NVFP4 for a long time, you can check https://huggingface.co/michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF , I was having trouble getting the testsuites working right the last time I tried it AIME25 (100% of answers timing out, it just kept thinking ) but this has ppl |
Nope, it's just a straight-up conversion of LilaRest's quant. |
|
Can I get a second approval? @ggml-org/ggml-cuda |
|
Tested with a custom quant converted from nvidia’s gemma4 31B nvfp4 quant, I’ve never heard my 5090 screeching like that. Functionality wise didn’t have any problem. Windows 11 cuda 13.1 |
I don't know, BLACK MAGIC? It says it doesn't and I believe it: You guys must be using some hacks. |
|
(and by this I mean there should be some sort of fallback that basically says "yes, picking nvfp4 here because it's a repack") |
|
Well it's not an valid |
|
We should fix that. The way to do is not set the type to nvfp4. Just use
no args
… Message ID: ***@***.***>
|
There's nothing to fix really, |
No, that's obviously also handled by black magic. |
I know there's nothing to fix technically, but the error message is misleading. It should say something like "if you want to repack an existing NVFP4-quantized model, run without |
The error is output by |
|
I opened a discussion about general issues/limitations of current NVFP4 support here #22042, and would advocate for withholding this PR until we have had some discussion over there |
| const int64_t i2 = blockIdx.z % ne2; | ||
| const int64_t i3 = blockIdx.z / ne2; |
There was a problem hiding this comment.
I would expect this kernel to be mostly I/O bound anyways, to the extent that compute makes a difference I would first try replacing the division and modulo with fast_div_modulo since that should just be free.
| if (i00 < ne00) { | ||
| const float v = x[base_idx + i00]; | ||
| vals_raw[k] = v; | ||
| amax_raw = fmaxf(amax_raw, fabsf(v)); | ||
| } else { | ||
| vals_raw[k] = 0.0f; | ||
| } |
There was a problem hiding this comment.
The quantized activations are padded with zeros in order to avoid having to do an out-of-bounds check for the weights. I'm not sure what you mean by support for partially filled blocks but this should still be needed unless you ensure that src[0]->ne[0] is exactly divided by the MMQ iteration size in k direction.
It seems it can? master (a678916):
your branch:
|
26df325 to
29b95b6
Compare
29b95b6 to
9789512
Compare
|
This PR closed in error after a local rebase/restore did not go correctly, so it lost all commits and history, and cannot be reopened. The identical commits and last state were recovered into PR 22196 . Sorry for duplication and trouble. |
Oh yeah my bad, I did not think about having to make sure we produce valid K tiles. I was merely thinking about |
Description:This update is the first of several upcoming Blackwell NVFP4 features, with the first native MMA and MMQ kernel, significantly improving prefill performance over the generic version, using native NVFP4 with hardware accelerated blockscaling.
This will run as NVFP4 x NVFP4 (NVFP4 for activations) only when supported by the hardware presence of a Blackwell GPU on a CUDA-compiled version. This will not get executed by any other platform. This version, while primarily only increasing prefill speed, maintains nearly equivalent token generation speed as the previous Q8 version, despite the heavier overhead of on the fly NVFP4 quantization for activations.
Future PRs will provide additional support for NVFP4 activation scaling, MMVQ optimization, and AoSoA block repack which will also significantly increase performance, and other improvements.
Performance boost:There remains a small quantization loss due to using NVFP4 for activations. On Qwen3.5-4B, this moved ppl from 11.40 on the baseline to 11.65 after this PR, and on Nemotron-Cascade-2-30B, from 9.81 to 9.85. This was kept in check by doing a small +/- 2 code search during quantization to improve the subblock scale by finding which amongst them has the lowest error, which is negligible overheard, as it's calculated via GPU. That improved ppl and lowered max kld, for example, from 12.24 to 11.65 (vs baseline 11.40 with Q8) on Qwen3.5 4B, and from 9.88 to 9.85 on Nemotron, and is likely worth any tiny overhead. Like MXFP4,
test-backend-opswas updated to add NVFP4 to the same override as certain tests would otherwise fail for excess error.AI assistance was used during this development for help with debugging, optimizing, and creating some portions, especially involving stride/tile/layout calculations. All code has been meticulously reviewed and edited by hand.