chore(serve): bake flashinfer-autotune-off flag into serve-cute.sh

Natfii · claude · Natfii · commit 2b21f3450587 · 2026-04-25T14:14:07.000-04:00
Per memory feedback_flashinfer_autotune_sm120, the SM120/GB10 host hard-reboots when flashinfer.jit's autotuner runs at serve startup (no clean OOM, no traceback, kernel-panic). Fix is universal: pass --kernel-config '{"enable_flashinfer_autotune":false}' to every vllm serve invocation in this repo. serve-cute.sh was missing it. serve.sh (triton_attn) is unaffected because it doesn't engage the cute_paged + flashinfer codepath. Refs: memory:feedback_flashinfer_autotune_sm120 Flashinfer issue vllm-project#2884, vLLM issue vllm-project#36999 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/scripts/serve-cute.sh b/scripts/serve-cute.sh
@@ -96,6 +96,7 @@ docker run -d \
   --trust-remote-code \
   --gpu-memory-utilization "${SERVE_GPU_UTIL:-0.70}" \
   --max-num-batched-tokens 65536 \
+  --kernel-config '{"enable_flashinfer_autotune":false}' \
   "${EXTRA_ARGS[@]}"
 
 echo "Container started: $CONTAINER"