NVFP4 inference on Blackwell GeForce (RTX 5090/5080/5070 Ti/RTX PRO 6000) — SM120 patches for vLLM + FlashInfer + CUTLASS. 175 tok/s on Qwen3.6-35B MoE.
-
Updated
Apr 27, 2026 - Python
NVFP4 inference on Blackwell GeForce (RTX 5090/5080/5070 Ti/RTX PRO 6000) — SM120 patches for vLLM + FlashInfer + CUTLASS. 175 tok/s on Qwen3.6-35B MoE.
(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs with a optimized GB10 kernel
Lna-Lab production pipeline: GGUF -> modelopt-format NVFP4 + working MTP head for vLLM on RTX PRO 6000 Blackwell (SM120). Stages 2 (NVFP4) and 3 (MTP graft) are Lna-Lab originals; stage 1 (GGUF->bf16) reuses li-yifei/gguf-to-nvfp4.
Add a description, image, and links to the sm120 topic page so that developers can more easily learn about it.
To associate your repository with the sm120 topic, visit your repo's landing page and select "manage topics."