fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs#39391
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements NaN and Inf clamping within the optimized topkGating kernel to prevent duplicate expert IDs, which avoids crashes in FlashInfer's MoE sort. While the change addresses the issue for the optimized path, the reviewer noted that the fallback paths in moeSoftmax and moeSigmoid remain vulnerable and should also be updated to ensure a complete fix for all expert configurations.
| for (int ii = 0; ii < VPT; ++ii) { | ||
| if (isnan(row_chunk[ii]) || isinf(row_chunk[ii])) { | ||
| row_chunk[ii] = 0.f; | ||
| } | ||
| } |
There was a problem hiding this comment.
The fix correctly addresses the issue for the optimized topkGating kernel. However, the same vulnerability exists in the fallback path used for models with a non-standard number of experts (those that are not a power of 2 or a multiple of 64). In topkGatingKernelLauncher (line 711), the default case calls moeSoftmax or moeSigmoid followed by moeTopK. These kernels currently lack NaN/Inf clamping, meaning they will still produce duplicate expert IDs for degenerate inputs, potentially leading to the same illegal memory access crash in downstream kernels like FlashInfer. To ensure a complete fix, NaN/Inf clamping should also be added to the output loops of moeSoftmax (line 125) and moeSigmoid (line 146).
When CUDA graph padding produces degenerate hidden states that result in NaN gating logits, softmax outputs all-NaN. The argmax loop then picks expert 0 for every top-k slot (since NaN > NaN is false per IEEE 754), producing duplicate expert IDs like [0,0,0,0,0,0,0,0]. These duplicates cause FlashInfer's three-step MoE sort (blockExpertPrefixSumKernel) to leave permutation entries uninitialized, leading to wild pointer dereferences in finalizeMoeRoutingKernel and CUDA illegal memory access crashes. The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring, before the argmax selection loop. With all-zero scores, the argmax uses index tie-breaking to pick unique experts [0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected — the clamp is a no-op. Tested: Qwen3.5-397B-A17B-FP8, TP=4 EP=4, CUDA graphs, 8 concurrent requests — 8/8 HTTP 200 (previously 5/8 OK + crash). Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
73b8ea7 to
2d6669d
Compare
Just wondering, what actual values we use for padding? |
It's zeros (found by claude) |
Parametrized over {bf16, fp16, fp32} x {NaN, +Inf} x {softmax, sigmoid}
x {topk=3,4} x {num_experts=8,16} -- 48 cases exercising the patched
topkGatingSoftmax warp kernel path.
Each test case poisons 3 of 4 gating rows with NaN or +Inf and asserts:
- poisoned rows produce unique top-k expert IDs (no duplicates)
- poisoned rows produce finite weights
- the clean row still matches the torch.topk reference
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
Hi @jhaotingc, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
The original fix in 2d6669d only patched the topkGatingSoftmax warp kernel, which handles num_experts that are a power of 2 or a multiple of 64. For other num_experts values (e.g., 6), topk_softmax dispatches to a fallback: moeSoftmax/moeSigmoid writes normalized scores into a workspace, then moeTopK runs argmax. Neither kernel clamps NaN/Inf, so the same crash mode (duplicate expert IDs -> uninitialized permutation entries -> OOB in finalizeMoeRoutingKernel) is reachable for models with non-power-of-2 expert counts. Add a clamp-to-zero at the output of moeSoftmax and moeSigmoid so the workspace moeTopK reads is always finite. With all-zero scores, argmax tie-breaking plus the -10000 winner zeroing produce unique experts [0, 1, ..., k-1], matching the warp-kernel fix. Extend tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clamp to parametrize num_experts over {6, 8, 16} so both paths are covered. Verified on H200 cu13.0.2 torch2.11.0 with three rebuild cycles: - All three clamps active: 72/72 pass - Default-path clamps disabled, warp clamp active: 18/72 fail (all failures are num_experts=6, except sigmoid+Inf which saturates to 1.0 and coincidentally produces valid output via tie-breaking) - All clamps disabled (baseline from commit 5f7fab8): prior cycle with warp-kernel-only test showed 36/48 fail Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
Hi @jhaotingc, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
@ZJY0516 added tests |
|
can we trigger pipeline 🤔? TY |
| const int idx = thread_row_offset + ii; | ||
| const float val = toFloat(input[idx]); | ||
| const float softmax_val = expf(val - float_max) * normalizing_factor; | ||
| float softmax_val = expf(val - float_max) * normalizing_factor; |
There was a problem hiding this comment.
Do we really need to change this?
There was a problem hiding this comment.
https://github.com/vllm-project/vllm/blob/main/csrc/moe/topk_softmax_kernels.cu#L644-L646
Here's the logics to choose either (1) topkGatingSoftmax kernel or (2) moeSoftmax/moeSigmoid kernel when calculating topk.
If the expert num is not what's listed in that switch-case, they'll fall to the 2nd path.
That's why gemini suggests fix the path as well, tho it's a rare case for expert to be a bad number.
One of the unit tests (num_expert=6) covers the 2nd path.
| const int idx = thread_row_offset + ii; | ||
| const float val = toFloat(input[idx]); | ||
| const float sigmoid_val = 1.0f / (1.0f + __expf(-val)); | ||
| float sigmoid_val = 1.0f / (1.0f + __expf(-val)); |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
This seems like a reasonable approach. My concern would be around fragility in case some other topk softmax kernel is used that doesn't suppress NaNs.
The softmax values are small so another reasonable and less fragile approach would be to call torch.nan_to_num
|
My only concern is whether the overhead from these additional check is acceptable.
I hope the fused_topk test will catch these in CI Btw I think we should merge this pr before v0.20 release |
Yeah I added some kernel level test in the Performance Impact section, seems that it's negligible. |
Following the code path, only these two paths (power-of-two and others) and two paths are all tested by the unit test. |
None related error. Can we merge? |
|
This test doesn't run in CI (see |
| gating_output = torch.randn((num_tokens, num_experts), dtype=dtype, device="cuda") | ||
| gating_output[1:, :] = bad_value | ||
|
|
||
| topk_weights, topk_ids, _ = fused_topk( |
There was a problem hiding this comment.
Could you add test for fused_topk_bias as well.
merge it wo fixing comment above because need in v0.20. |
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: root <root@gbt350-odcdh5-wbb3.png-odc.dcgpu>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Yifan <yzong@redhat.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Adrian <info@zzit.ch>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 28c2221)
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit fafa76f)
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 6d1e61a)
Purpose
Fix #39244
Fix CUDA illegal memory access crash when serving MoE models (e.g., Qwen3.5-397B-A17B-FP8) with FlashInfer CUTLASS MoE and CUDA graphs at high concurrency on H200.
CUDA graph replay pads the batch to the nearest capture size. Padded tokens have degenerate hidden states that produce NaN gating logits. The
topkGatingkernel's softmax outputs all-NaN, and the argmax loop picks expert 0 for every top-k slot (IEEE 754:NaN > NaNis false), producing duplicate expert IDs[0,0,0,0,0,0,0,0]. These duplicates trigger an uninitialized-memory bug in FlashInfer's three-step MoE sort, causingfinalizeMoeRoutingKernelto dereference wild pointers.The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring in
topkGating, before the argmax selection loop. With all-zero scores, argmax picks unique experts[0,1,2,...,k-1]via index tie-breaking. Zero performance overhead.Test Plan
Kernel unit test: verify
topk_softmaxproduces unique expert IDs for NaN/Inf/normal gating inputsKernel microbenchmark: compare eager + CUDA graph replay latency for normal vs NaN inputs (batch 1-512, 128/256 experts)
End-to-end: serve Qwen3.5-397B-A17B-FP8 (TP=4, EP=4, CUDA graphs,
VLLM_USE_FLASHINFER_MOE_FP8=1) with 8 concurrent requestsFull sweep: sglang benchmark conc 1-512, ISL=1600, OSL=600, REPEAT=5 on 4x H200
tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clampTest Result
Kernel correctness (H200):
[0,0,0,0,0,0,0,0](512/512 dup)[0,1,2,3,4,5,6,7](0 dup)[0,0,0,0,0,0,0,0](512/512 dup)[0,1,2,3,4,5,6,7](0 dup)Kernel perf (H200, CUDA graph replay, median of 1000 runs):
All within noise. Zero measurable overhead.
End-to-end (4x H200, Qwen3.5-397B-A17B-FP8):
unit test
Summary
vLLM crashes with
CUDA error: an illegal memory access was encounteredwhen serving Qwen3.5-397B-A17B-FP8 withVLLM_USE_FLASHINFER_MOE_FP8=1and CUDA graphs enabled. The crash occurs at high concurrency (8+ requests) when the MoE batch size exceeds 256 tokens.Root Cause
CUDA graph replay pads the batch to the nearest capture size (e.g., 300 real tokens padded to 512). Padded tokens have stale/degenerate hidden states that produce NaN gating logits in the MoE router. The
topk_softmaxCUDA kernel then produces duplicate expert IDs for NaN inputs (e.g.,[0,0,0,0,0,0,0,0]for every padded token), because IEEE 754NaN > NaNis always false, so the argmax never updates from expert 0, and the-10000zeroing of the winner also fails (-10000 > NaNis false).These duplicate expert IDs trigger a latent bug in FlashInfer's
blockExpertPrefixSumKernel(three-step MoE sort path, used when num_tokens > 256): it usesbreakafter the first expert match, so duplicate expert slots leaveunpermuted_row_to_permuted_rowentries uninitialized.finalizeMoeRoutingKernelthen reads garbage values as row indices, causing wild pointer dereferences.Chain of events
Why it only happens with CUDA graphs
In eager mode, there are no padded tokens -- the batch contains only real tokens with valid hidden states, the router produces unique expert IDs, and the three-step sort works correctly. The crash requires:
Both conditions only occur together during CUDA graph replay at high concurrency.
Fix
Clamp NaN/Inf values to 0 in
topk_softmaxafter softmax/sigmoid scoring, before the argmax selection loop:With all-zero scores, the argmax uses index tie-breaking to pick unique experts
[0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected -- the clamp is a no-op.Why this is the right fix location
The
topk_softmaxkernel (csrc/moe/topk_softmax_kernels.cu:266) is where the NaN propagates into duplicate expert IDs. Fixing it here:Performance Impact
Benchmarked on H200, production MoE configs (128/256 experts, top_k=8). The fix adds
isnan/isinfchecks (single PTX predicate instructions) per element. The kernel is memory-bandwidth bound, so the extra comparisons are invisible:Eager mode (us, median of 1000 runs)
CUDA graph replay mode (us, median of 1000 runs)
All differences are within noise (<2%). Zero measurable overhead.
Verification
Standalone (topk kernel)
Before fix:
After fix:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.