fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs by jhaotingc · Pull Request #39391 · vllm-project/vllm

jhaotingc · 2026-04-09T06:33:42Z

Purpose

Fix CUDA illegal memory access crash when serving MoE models (e.g., Qwen3.5-397B-A17B-FP8) with FlashInfer CUTLASS MoE and CUDA graphs at high concurrency on H200.

CUDA graph replay pads the batch to the nearest capture size. Padded tokens have degenerate hidden states that produce NaN gating logits. The topkGating kernel's softmax outputs all-NaN, and the argmax loop picks expert 0 for every top-k slot (IEEE 754: NaN > NaN is false), producing duplicate expert IDs [0,0,0,0,0,0,0,0]. These duplicates trigger an uninitialized-memory bug in FlashInfer's three-step MoE sort, causing finalizeMoeRoutingKernel to dereference wild pointers.

The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring in topkGating, before the argmax selection loop. With all-zero scores, argmax picks unique experts [0,1,2,...,k-1] via index tie-breaking. Zero performance overhead.

Test Plan

Kernel unit test: verify topk_softmax produces unique expert IDs for NaN/Inf/normal gating inputs
Kernel microbenchmark: compare eager + CUDA graph replay latency for normal vs NaN inputs (batch 1-512, 128/256 experts)
End-to-end: serve Qwen3.5-397B-A17B-FP8 (TP=4, EP=4, CUDA graphs, VLLM_USE_FLASHINFER_MOE_FP8=1) with 8 concurrent requests
Full sweep: sglang benchmark conc 1-512, ISL=1600, OSL=600, REPEAT=5 on 4x H200
tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clamp

Test Result

Kernel correctness (H200):

Input	Before fix	After fix
normal	unique IDs	unique IDs (unchanged)
all_nan	`[0,0,0,0,0,0,0,0]` (512/512 dup)	`[0,1,2,3,4,5,6,7]` (0 dup)
all_inf	`[0,0,0,0,0,0,0,0]` (512/512 dup)	`[0,1,2,3,4,5,6,7]` (0 dup)

Kernel perf (H200, CUDA graph replay, median of 1000 runs):

Batch	Experts	normal (us)	all_nan (us)	diff
1	128	8.29	8.22	-0.8%
128	128	8.26	8.32	+0.8%
512	128	8.51	8.64	+1.5%
512	256	8.90	8.93	+0.4%

All within noise. Zero measurable overhead.

End-to-end (4x H200, Qwen3.5-397B-A17B-FP8):

Test	Before	After
8 concurrent curl	5/8 OK, 3/8 crash	8/8 HTTP 200
sweep conc 1-512	crash at conc 16+	all pass

unit test

State	Kernel	Test result
BEFORE	partial fix	60 failed, 12 passed
AFTER — all three clamps active	full fix	72 passed

Summary

vLLM crashes with CUDA error: an illegal memory access was encountered when serving Qwen3.5-397B-A17B-FP8 with VLLM_USE_FLASHINFER_MOE_FP8=1 and CUDA graphs enabled. The crash occurs at high concurrency (8+ requests) when the MoE batch size exceeds 256 tokens.

Root Cause

CUDA graph replay pads the batch to the nearest capture size (e.g., 300 real tokens padded to 512). Padded tokens have stale/degenerate hidden states that produce NaN gating logits in the MoE router. The topk_softmax CUDA kernel then produces duplicate expert IDs for NaN inputs (e.g., [0,0,0,0,0,0,0,0] for every padded token), because IEEE 754 NaN > NaN is always false, so the argmax never updates from expert 0, and the -10000 zeroing of the winner also fails (-10000 > NaN is false).

These duplicate expert IDs trigger a latent bug in FlashInfer's blockExpertPrefixSumKernel (three-step MoE sort path, used when num_tokens > 256): it uses break after the first expert match, so duplicate expert slots leave unpermuted_row_to_permuted_row entries uninitialized. finalizeMoeRoutingKernel then reads garbage values as row indices, causing wild pointer dereferences.

Chain of events

CUDA graph replay with padded tokens
  -> stale hidden states -> NaN gating logits
    -> topk_softmax produces [0,0,0,0,0,0,0,0] for padded tokens
      -> duplicate expert IDs enter cutlass_fused_moe (num_tokens > 256)
        -> blockExpertPrefixSumKernel skips duplicate slots (break)
          -> unpermuted_row_to_permuted_row has uninitialized entries
            -> finalizeMoeRoutingKernel reads garbage -> OOB -> CRASH

Why it only happens with CUDA graphs

In eager mode, there are no padded tokens -- the batch contains only real tokens with valid hidden states, the router produces unique expert IDs, and the three-step sort works correctly. The crash requires:

Batch size > 256 (three-step sort path)
Duplicate expert IDs (from NaN gating on padded tokens)

Both conditions only occur together during CUDA graph replay at high concurrency.

Fix

Clamp NaN/Inf values to 0 in topk_softmax after softmax/sigmoid scoring, before the argmax selection loop:

// csrc/moe/topk_softmax_kernels.cu, after line 443
#pragma unroll
for (int ii = 0; ii < VPT; ++ii) {
    if (isnan(row_chunk[ii]) || isinf(row_chunk[ii])) {
        row_chunk[ii] = 0.f;
    }
}

With all-zero scores, the argmax uses index tie-breaking to pick unique experts [0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected -- the clamp is a no-op.

Why this is the right fix location

The topk_softmax kernel (csrc/moe/topk_softmax_kernels.cu:266) is where the NaN propagates into duplicate expert IDs. Fixing it here:

Prevents the bad input from reaching ANY downstream MoE kernel (FlashInfer, Triton, etc.)
Zero performance overhead (see benchmarks below)
Handles all NaN sources (CUDA graph padding, numerical overflow, any future degenerate input)

Performance Impact

Benchmarked on H200, production MoE configs (128/256 experts, top_k=8). The fix adds isnan/isinf checks (single PTX predicate instructions) per element. The kernel is memory-bandwidth bound, so the extra comparisons are invisible:

Eager mode (us, median of 1000 runs)

Batch	Experts	normal	all_nan	diff
1	128	10.94	10.91	-0.3%
8	128	10.72	10.72	0.0%
32	128	10.72	10.75	+0.3%
128	128	10.72	10.66	-0.6%
256	128	10.78	10.72	-0.6%
512	128	10.66	10.69	+0.3%
512	256	10.72	10.72	0.0%

CUDA graph replay mode (us, median of 1000 runs)

Batch	Experts	normal	all_nan	diff
1	128	8.29	8.22	-0.8%
8	128	8.19	8.16	-0.4%
32	128	8.29	8.19	-1.2%
128	128	8.26	8.32	+0.8%
256	128	8.32	8.32	0.0%
512	128	8.51	8.64	+1.5%
512	256	8.90	8.93	+0.4%

All differences are within noise (<2%). Zero measurable overhead.

Verification

Standalone (topk kernel)

Before fix:

all_nan: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]
all_inf: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]

After fix:

all_nan: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]
all_inf: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request implements NaN and Inf clamping within the optimized topkGating kernel to prevent duplicate expert IDs, which avoids crashes in FlashInfer's MoE sort. While the change addresses the issue for the optimized path, the reviewer noted that the fallback paths in moeSoftmax and moeSigmoid remain vulnerable and should also be updated to ensure a complete fix for all expert configurations.

gemini-code-assist · 2026-04-09T06:35:58Z

+    for (int ii = 0; ii < VPT; ++ii) {
+      if (isnan(row_chunk[ii]) || isinf(row_chunk[ii])) {
+        row_chunk[ii] = 0.f;
+      }
+    }


The fix correctly addresses the issue for the optimized topkGating kernel. However, the same vulnerability exists in the fallback path used for models with a non-standard number of experts (those that are not a power of 2 or a multiple of 64). In topkGatingKernelLauncher (line 711), the default case calls moeSoftmax or moeSigmoid followed by moeTopK. These kernels currently lack NaN/Inf clamping, meaning they will still produce duplicate expert IDs for degenerate inputs, potentially leading to the same illegal memory access crash in downstream kernels like FlashInfer. To ensure a complete fix, NaN/Inf clamping should also be added to the output loops of moeSoftmax (line 125) and moeSigmoid (line 146).

Also included

ZJY0516

please add a test for this

When CUDA graph padding produces degenerate hidden states that result in NaN gating logits, softmax outputs all-NaN. The argmax loop then picks expert 0 for every top-k slot (since NaN > NaN is false per IEEE 754), producing duplicate expert IDs like [0,0,0,0,0,0,0,0]. These duplicates cause FlashInfer's three-step MoE sort (blockExpertPrefixSumKernel) to leave permutation entries uninitialized, leading to wild pointer dereferences in finalizeMoeRoutingKernel and CUDA illegal memory access crashes. The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring, before the argmax selection loop. With all-zero scores, the argmax uses index tie-breaking to pick unique experts [0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected — the clamp is a no-op. Tested: Qwen3.5-397B-A17B-FP8, TP=4 EP=4, CUDA graphs, 8 concurrent requests — 8/8 HTTP 200 (previously 5/8 OK + crash). Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

vadiklyutiy · 2026-04-16T18:17:13Z

Padded tokens have degenerate hidden states that produce NaN gating logits.

Just wondering, what actual values we use for padding?

jhaotingc · 2026-04-16T18:29:14Z

Padded tokens have degenerate hidden states that produce NaN gating logits.

Just wondering, what actual values we use for padding?

It's zeros (found by claude)
https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L3281-L3284

Parametrized over {bf16, fp16, fp32} x {NaN, +Inf} x {softmax, sigmoid} x {topk=3,4} x {num_experts=8,16} -- 48 cases exercising the patched topkGatingSoftmax warp kernel path. Each test case poisons 3 of 4 gating rows with NaN or +Inf and asserts: - poisoned rows produce unique top-k expert IDs (no duplicates) - poisoned rows produce finite weights - the clean row still matches the torch.topk reference Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

mergify · 2026-04-16T18:48:58Z

Hi @jhaotingc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

The original fix in 2d6669d only patched the topkGatingSoftmax warp kernel, which handles num_experts that are a power of 2 or a multiple of 64. For other num_experts values (e.g., 6), topk_softmax dispatches to a fallback: moeSoftmax/moeSigmoid writes normalized scores into a workspace, then moeTopK runs argmax. Neither kernel clamps NaN/Inf, so the same crash mode (duplicate expert IDs -> uninitialized permutation entries -> OOB in finalizeMoeRoutingKernel) is reachable for models with non-power-of-2 expert counts. Add a clamp-to-zero at the output of moeSoftmax and moeSigmoid so the workspace moeTopK reads is always finite. With all-zero scores, argmax tie-breaking plus the -10000 winner zeroing produce unique experts [0, 1, ..., k-1], matching the warp-kernel fix. Extend tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clamp to parametrize num_experts over {6, 8, 16} so both paths are covered. Verified on H200 cu13.0.2 torch2.11.0 with three rebuild cycles: - All three clamps active: 72/72 pass - Default-path clamps disabled, warp clamp active: 18/72 fail (all failures are num_experts=6, except sigmoid+Inf which saturates to 1.0 and coincidentally produces valid output via tie-breaking) - All clamps disabled (baseline from commit 5f7fab8): prior cycle with warp-kernel-only test showed 36/48 fail Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

mergify · 2026-04-16T19:03:31Z

Hi @jhaotingc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

jhaotingc · 2026-04-16T20:17:07Z

@ZJY0516 added tests

jhaotingc · 2026-04-17T06:03:16Z

can we trigger pipeline 🤔? TY

ZJY0516 · 2026-04-17T15:03:58Z

        const int idx = thread_row_offset + ii;
        const float val = toFloat(input[idx]);
-        const float softmax_val = expf(val - float_max) * normalizing_factor;
+        float softmax_val = expf(val - float_max) * normalizing_factor;


Do we really need to change this?

https://github.com/vllm-project/vllm/blob/main/csrc/moe/topk_softmax_kernels.cu#L644-L646

Here's the logics to choose either (1) topkGatingSoftmax kernel or (2) moeSoftmax/moeSigmoid kernel when calculating topk.
If the expert num is not what's listed in that switch-case, they'll fall to the 2nd path.
That's why gemini suggests fix the path as well, tho it's a rare case for expert to be a bad number.
One of the unit tests (num_expert=6) covers the 2nd path.

ZJY0516 · 2026-04-17T15:04:05Z

        const int idx = thread_row_offset + ii;
        const float val = toFloat(input[idx]);
-        const float sigmoid_val = 1.0f / (1.0f + __expf(-val));
+        float sigmoid_val = 1.0f / (1.0f + __expf(-val));


tlrmchlsmth

This seems like a reasonable approach. My concern would be around fragility in case some other topk softmax kernel is used that doesn't suppress NaNs.

The softmax values are small so another reasonable and less fragile approach would be to call torch.nan_to_num

ZJY0516 · 2026-04-20T13:53:16Z

My only concern is whether the overhead from these additional check is acceptable.

My concern would be around fragility in case some other topk softmax kernel is used that doesn't suppress NaNs.

I hope the fused_topk test will catch these in CI

Btw I think we should merge this pr before v0.20 release

jhaotingc · 2026-04-20T17:28:51Z

My only concern is whether the overhead from these additional check is acceptable.

My concern would be around fragility in case some other topk softmax kernel is used that doesn't suppress NaNs.

I hope the fused_topk test will catch these in CI

Btw I think we should merge this pr before v0.20 release

Yeah I added some kernel level test in the Performance Impact section, seems that it's negligible.

jhaotingc · 2026-04-20T17:31:04Z

I hope the fused_topk test will catch these in CI

Following the code path, only these two paths (power-of-two and others) and two paths are all tested by the unit test.
Also rarely the number of expert is a weird number 🤣

jhaotingc · 2026-04-20T20:29:48Z

(APIServer pid=19255) 2026-04-20 18:42:19,444	ERROR serialization.py:533 -- Failed to unpickle serialized exception
--
(APIServer pid=19255) Traceback (most recent call last):
(APIServer pid=19255)   File "/usr/local/lib/python3.12/dist-packages/ray/exceptions.py", line 50, in from_ray_exception
(APIServer pid=19255)     return pickle.loads(ray_exception.serialized_exception)
(APIServer pid=19255)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=19255) TypeError: RaySystemError.__init__() missing 1 required positional argument: 'client_exc'

None related error. Can we merge?

vadiklyutiy · 2026-04-21T00:27:59Z

This test doesn't run in CI (see .buildkite/) . Is it intentionally?

vadiklyutiy · 2026-04-21T00:53:23Z

+    gating_output = torch.randn((num_tokens, num_experts), dtype=dtype, device="cuda")
+    gating_output[1:, :] = bad_value
+
+    topk_weights, topk_ids, _ = fused_topk(


Could you add test for fused_topk_bias as well.

vadiklyutiy

After fixing

add to CI
add fused_topk_bias

will look good to me

vadiklyutiy · 2026-04-21T11:04:00Z

After fixing

add to CI

add fused_topk_bias

will look good to me

merge it wo fixing comment above because need in v0.20.
Created #40457 as reminder to fix it. @jhaotingc could you pls open new PR to fix it.

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: root <root@gbt350-odcdh5-wbb3.png-odc.dcgpu>

…39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 28c2221)

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Yifan <yzong@redhat.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Adrian <info@zzit.ch>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 28c2221)

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit fafa76f)

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 6d1e61a)

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

jhaotingc mentioned this pull request Apr 14, 2026

Fix: CUDA illegal memory access in MoE three-step sort fallback (num_tokens > 256) flashinfer-ai/flashinfer#3011

Closed

5 tasks

ZJY0516 reviewed Apr 16, 2026

View reviewed changes

ZJY0516 requested a review from zyongye April 16, 2026 16:50

ZJY0516 linked an issue Apr 16, 2026 that may be closed by this pull request

[Bug]: qwen 3.5 crash with mtp #39077

Closed

1 task

jhaotingc force-pushed the fix/topk-nan-clamp branch from 73b8ea7 to 2d6669d Compare April 16, 2026 18:10

vadiklyutiy added the verified Run pre-commit for new contributors without triggering other tests label Apr 16, 2026

vadiklyutiy self-requested a review April 16, 2026 18:12

jhaotingc requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners April 16, 2026 18:43

style: pre-commit ruff fix for topk NaN/Inf clamp test

5ad2384

Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

vadiklyutiy reviewed Apr 16, 2026

View reviewed changes

Comment thread csrc/moe/topk_softmax_kernels.cu

Sandermage mentioned this pull request Apr 17, 2026

[Bug/Feature] TurboQuant + Hybrid MoE (Qwen3.6-35B-A3B) broken on Ampere (SM 80-86) — 13 patches with fixes #40124

Open

ZJY0516 reviewed Apr 17, 2026

View reviewed changes

ZJY0516 mentioned this pull request Apr 18, 2026

[Bug]:CUDA illegal instruction during decode (V1 Engine + NVFP4) on aarch64 (NVIDIA GB10) #39761

Open

tlrmchlsmth reviewed Apr 20, 2026

View reviewed changes

ZJY0516 approved these changes Apr 20, 2026

View reviewed changes

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 20, 2026

Merge branch 'main' into fix/topk-nan-clamp

619d861

vadiklyutiy mentioned this pull request Apr 20, 2026

[Bugfix] Fix GDN FLA kernel crashes with NULL_BLOCK_ID=0 CUDA graph padding #39064

Merged

5 tasks

vadiklyutiy reviewed Apr 21, 2026

View reviewed changes

ZJY0516 added this to the v0.20.0 cherry picks milestone Apr 21, 2026

vadiklyutiy mentioned this pull request Apr 21, 2026

Add fused_topk to CI. Follow-up to 39391 #40457

Closed

vadiklyutiy approved these changes Apr 21, 2026

View reviewed changes

vadiklyutiy merged commit 28c2221 into vllm-project:main Apr 21, 2026
146 checks passed

jhaotingc mentioned this pull request Apr 21, 2026

test: add nan/inf clamp regression test for fused_topk_bias #40553

Merged

4 tasks

khluu pushed a commit that referenced this pull request Apr 22, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (#…

4699f1b

…39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 28c2221)

This was referenced Apr 22, 2026

[Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA #39593

Closed

[Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP #37758

Closed

baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

1603e04

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 23, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

956485c

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Yifan <yzong@redhat.com>

Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

8335fb4

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Adrian <info@zzit.ch>

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

cc115f2

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> (cherry picked from commit 28c2221)

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

2fb90c6

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

fafa76f

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (v…

6d1e61a

…llm-project#39391) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>

Uh oh!

Conversation

jhaotingc commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Summary

Root Cause

Chain of events

Why it only happens with CUDA graphs

Fix

Why this is the right fix location

Performance Impact

Eager mode (us, median of 1000 runs)

CUDA graph replay mode (us, median of 1000 runs)

Verification

Standalone (topk kernel)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jhaotingc Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Apr 16, 2026

Uh oh!

jhaotingc commented Apr 16, 2026

Uh oh!

mergify Bot commented Apr 16, 2026

Uh oh!

mergify Bot commented Apr 16, 2026

Uh oh!

jhaotingc commented Apr 16, 2026

Uh oh!

Uh oh!

jhaotingc commented Apr 17, 2026

Uh oh!

ZJY0516 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jhaotingc Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhaotingc commented Apr 20, 2026

Uh oh!

jhaotingc commented Apr 20, 2026

Uh oh!

jhaotingc commented Apr 20, 2026

Uh oh!

vadiklyutiy commented Apr 21, 2026

Uh oh!

vadiklyutiy Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

jhaotingc commented Apr 9, 2026 •

edited

Loading

ZJY0516 commented Apr 20, 2026 •

edited

Loading

vadiklyutiy left a comment •

edited

Loading