Use FlashAttention for `multi_query_kv_attention` by WoosukKwon · Pull Request #4 · vllm-project/vllm

WoosukKwon · 2023-03-02T05:05:56Z

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

FlashAttention is fast and memory-efficient.
FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

FlashAttention does NOT support FP32.
FlashAttention does not support head_size > 128. (This is fine for all models except GPT-J).
- Ref: Support for 256 head dim Dao-AILab/flash-attention#67
FlashAttention does not support attention bias (GPT-J, BLOOM, LLaMA).

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* Init * refine

Support for optimum-intel models

…o-model-executor Adapt OpenVINO CPU plugin implementation

BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida

patching for having type su

…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

fix te error

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…n-files-check Fix vLLM framework documentation to match actual source code structure

Signed-off-by: omerpaz95 <omerpaz95@gmail.com>

…_like, .contiguous() DecodeKernel.__call__ changes: - query: assert contiguity instead of .contiguous() copy - output: accept caller-provided persistent buffer - wo_global_scale: pass tensor pointer, not .item() float - grid.z: use padded_num_seqs for stable graph capture - gate_ptr/gate_fused: new kernel args for output gate fusion Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seven-commit Gemma2 push landed this session. Compiler stays architecture-agnostic — the acid test passed for generality: zero "gemma" strings below the parser, the entire architecture fits in DSL body + JSON configs + library Impls. New sections: - "Next session starts here" rewritten around the Gemma2 state: coherent output on common prompts, known hard-fail on `test_cuda_correctness_gemma2_2b` prompt 7 (a translation prompt with apostrophe-quoted text), ruled-out causes listed (not CUDA graphs, not final softcap, not ferrite-vs-hand-written drift, not the embed-scale/sliding-flip bugs I already fixed). - "Commit history" grew a fresh block of 8 commits. - "Landed since the correctness fix" gained a full item vllm-project#7 for the Gemma2 port, enumerating each sub-piece and its commit. - Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the logical next architecture using the same DSL constructs). - Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice to have)" to "per-layer tensor-dump debug harness" — the concrete tool needed to unblock the prompt-7 failure. Spelled out the approach including the forward_backbone quirk (for Gemma2 it returns pre-softcap logits, not pre-lm_head hidden_states, because the DSL's terminal tile is `tanh_softcap`). - "Already landed" list gained nine new bullets covering the DSL constructs, the scalars mechanism, Gemma2 Impls, kernel `weight_offset` param, Gemma2 DSL + configs, and `Gemma2Ferrite` cuda_worker wiring. - Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses scalar multiply via the Gemma embed-scale plumbing. Honest accounting of the known-fail: correlation r=0.11 between our post-softcap logits and Python vLLM's on prompt 7. Divergence is upstream of softcap, specific to this prompt's numerics, and likely lives in a shared CUDA kernel (flash-attn softcap + sliding, or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite and the pre-existing hand-written Gemma2 path exhibit the same failure — the fix, once found, applies to both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Land the Silly Streaming microbench trail (four pre-run scripts + README) alongside the negative verdict from the gatekeeper probe (#1). - HW backend: nvCOMP 5.2 NVCOMP_DECOMPRESS_BACKEND_HARDWARE raises RuntimeError code=12 on GB10 — Blackwell HW-DE is datacenter-only (B100/B200), not exposed on SM120. - SW fallback: CUDA-backend Deflate on 1 GiB = 3.7 GB/s (uniform- random) / 12-13 GB/s (trivially compressible). NVFP4 weights would sit near the random number, still below the NVMe ceiling from #4. Scripts are preserved as evidence trail; they still reference the removed kvikio.nvcomp_codec.NvCompBatchCodec API — use nvidia.nvcomp directly if someone revives this on future HW. Long live silly stream. o7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Section 4.3: Update single-request benchmarks with post-fix data (Config 3 TPOT 9.88ms->7.86ms, Config 4 TPOT 8.41ms->7.31ms) - Section 5.2.1: Rewrite from "pending fix" to "fixed and verified" with profile data comparing NCCL ops before/after fix - Section 5.2.2: Update concurrent TPOT worsening percentages with new single-request baselines - Section 5.3.2: Update single-request ITL comparison - Section 7: Add config3_fix profile directory and NCCL comparison table - Section 8: Update conclusion vllm-project#3 and vllm-project#4 with fix results Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three layers of work, all driving toward "load real-world IQ-quantized GGUFs and run them coherently with quantized weights kept on GPU." GGUF quantized-storage default - Default load path now keeps GGUF weights compressed on GPU instead of dequant-to-BF16, saving ~2x weight memory. FERRITE_DEQUANT_AT_LOAD=1 opts back into the legacy dense path for debugging. - Fix the load-time host_buf race: pinned H2D for quantized weights was reusing host_buf for the next tensor before the previous DMA finished, leaving subsequent quantized weights corrupted on GPU. One-line stream_synchronize before the host_buf reuse. - Unpin Q6_K from max_ncols_y=1 (cargo cult, fixed by the sync above). - mul_mat_vec_q template: align rows_per_cuda_block with upstream (ncols_y < 4 → 1, else 2). IQ kernel coverage - ggml IQ1_M (codebook + dequant + MMVQ + dispatch). - Expanded IQ* support: IQ1_S / IQ2_XXS / IQ4_NL / IQ4_XS dequant + MMVQ kernels, sign-bit machinery shared across types (kmask_iq2xs / ksigns_iq2xs / unpack_ksigns / __vcmpne4 / __vsub4 / ggml_cuda_dp4a). - Add IQ2_XS (GgufDType tag 17): GgufDType + GgmlDType variants, the 512-entry iq2xs_grid codebook (uint64), block_iq2_xs struct, dequantize_block_iq2_xs (+ DEQUANTIZE_K macro), vec_dot_iq2_xs_q8_1, mul_mat_vec_iq2_xs_q8_1_cuda1, the three launch wrappers, extern "C" declarations and dispatch arms in ggml.rs. Mistral-7B-Instruct-v0.3 -IQ2_S.gguf (which mixes IQ2_XS for attention with IQ2_S for FFN) now loads and both `[ggml dispatch] first matmul via IQ2_XS` and `IQ3_S` notifiers fire during forward. - Batch IQ prefill: 56 new mul_mat_vec_iq*_q8_1_cuda2 through _cuda8 template instantiations across 8 IQ types, each launcher updated to switch on ncols_y exactly like Q4_K, max_ncols_y cap lifted from is_iq_quant() ? 1 : 8 to a flat 8. Reduces prefill kernel launches ~8x for IQ-family weights. Probe infrastructure (stays in the tree) - crates/ferrite-kernels/src/layers.rs has a ggml_probe module behind two env vars, off by default: - FERRITE_PROBE_MMVQ=1 — runs both the quantized kernel AND a dequant+cuBLAS F32 reference per GgmlLinear::forward, prints per-call max_abs_diff / rel / NaN counts. Used to isolate the host_buf race; now used to validate the cuda2..cuda8 IQ kernels. - FERRITE_USE_REFERENCE=1 — replaces ggml_matmul with the F32 cuBLAS reference at every linear. If inference becomes coherent under this flag, kernels are the bug; if it stays broken, look upstream. End-to-end coverage - Llama-3.2-1B chat tests for Q4_K_M / IQ1_S / IQ1_M / IQ2_XXS / IQ4_NL / IQ4_XS pass with FERRITE_PROBE_MMVQ=1 showing rel <= 2.9% at M=42 prefill (IQ4_NL <= 0.82%, IQ4_XS <= 1.1%, IQ1_M <= 0.81%, IQ1_S <= 2.9%, IQ2_XXS <= 1.3%); for every type the batched cuda2..cuda8 path is at least as accurate as the cuda1 baseline. - assert_coherent_text strengthened. - Mistral-7B IQ2_S / IQ3_S regression-marker tests added under #[ignore]. They go green when the tokenizer-from-GGUF fallback (vllm-serve/src/llm.rs:tokenize_text) is fixed — currently every prompt's ASCII bytes get used as token IDs because bartowski's GGUF-only repo ships no tokenizer.json, which is what the documented "x_max=0 from L0 q_proj" symptom actually was. Docs - HANDOFF_QUANTIZED.md captures the current state: items vllm-project#2 (IQ2_XS) and vllm-project#3 (IQ batched prefill) closed, item vllm-project#1 reclassified as a tokenizer-fallback bug (not a quant-path bug), item vllm-project#4 (IQ decode DMMV) reclassified do-not-port — upstream llama.cpp deleted ggml/src/ggml-cuda/dmmv.cu entirely; ik_llama.cpp only ports DMMV for its custom IQ*_KT types not the standard ones. - vllm-parity/parity.csv: GGUF IQ row updated to enumerate all 8 wired types with per-type rel numbers; new rows for "GGUF dequant kernels (IQ types)", "IQ decode DMMV (wontfix)", and "Tokenizer from GGUF metadata". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing Phase E baseline leg ran concurrent=4 max_tokens=256 while β-lite ran concurrent=8 max_tokens=64 (per Caveat #1 in benchmarks/nvllm/traces/phase_e/2026-04-23-initial/summary.md). The per-kernel μs comparison wasn't apples-to-apples. This script re-captures a baseline leg (CUTE_PHASE_E_FUSION=0) at the same workload as the β-lite leg — num_seqs=8, concurrent=8, max_tokens=64, warmup=4, timed=5 — so β-lite vs baseline kernel-duration deltas can be read directly from the CSVs produced by extract_e2e_kernels.py. Mirrors the structure of capture_beta_only.sh (same profiler config, memory watchdog, readiness gate, CUPTI flush delay). Runs on the current nvllm:gb10 image; FUSION=0 bypasses all Phase E code paths so no rebuild is required for this leg. Output: benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/ Evidence bundle (summary.md + kernel CSV) lands in the follow-up session that ships E.1 #2 (β-coop SMEM shrink). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ched concurrency Matched-concurrency baseline (CUTE_PHASE_E_FUSION=0, num_seqs=8) vs existing β-lite leg. Same model, PIECEWISE, FP8 KV, active_iterations=200. Finding: Phase_D_MLP_Kernel fires 2× per full_attn layer per decode step in β-lite (n_calls=2016) vs 1× in baseline (n_calls=1008). Per-call MLP is 13.5% faster (90,408 vs 104,499 μs), but the 2× firing swamps the win. Net: +76,349 μs/layer/step, i.e. +62.8% slower per-full-attn-layer decode cost. Raises Phase E.1 #2 (β-coop SMEM shrink → num_seqs≥2) priority from "lower leverage if num_seqs=1 is 95%" to "regression fix for the user's steady-state workload." See memory updates for num_seqs=2 target. Extends .gitignore to mirror the phase_e/** policy to phase_e_1/** (raw .pt.trace.json.gz local-only; CSV + logs + md + txt + json committed) plus pre-ships phase_f/** rules for upcoming Phase F.1. Evidence bundle: benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/ ├── baseline_matched_kernels.csv (67 kernels, per-call + totals) ├── baseline_matched_serve.log (EngineCore — confirms FUSION=0) ├── baseline_matched_mem.log (host + docker mem watchdog) ├── profiler_out_0.txt └── summary.md (apples-to-apples comparison) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gate Spec + implementation plan for a coupled fix discovered when working through the β-lite 2× firing regression from E.1 #4: * **Phase E.2** — fresh-eyes spec-reviewer audit caught that both β kernels multiply by raw γ while Qwen3_5RMSNorm uses x * (1 + γ). Latent because the consume branch at qwen3_5.py:473 dead-branches under PIECEWISE (@support_torch_compile + fullgraph=True + skip_all_guards_unsafe), so β's wrong output is orphaned and the legacy path produces the actual answer. Fix: (1 + γ) at mlp_kernel.py:1502 and phase_e_kernel.py:641 (audit Phase 4 epilogue during implementation), plus a reference-diff harness cross-checking against Qwen3_5RMSNorm.forward_native. * **Phase F.1** — opaque-gate refactor (first slice of Phase F, FULL-graph enablement). Two new custom ops: cute_phase_e_dispatch (replaces qwen3_5.py:473-481 consume gate) and cute_phase_e_skip_input_layernorm (wraps qwen3_5.py:386 so layer N+1 skips its input_layernorm when β pre-applied it). Fail-loud on consume-branch errors; nested op calls into cute_mlp_forward on the NOT-consumed path. Audit also surfaced that Phase E's reported 51.7% speedup measures β kernel launch latency, not end-to-end decode wall time — the legacy full path fires alongside β due to the same dead-branch. Memory updated accordingly (project_phase_e_phantom_speedup, project_phase_e_beta_math_bug). True end-to-end numbers pending post-E.2+F.1 re-trace. Audit Finding #13 (flagged, deferred to Phase F.N): _fusion_active at qwen3_5.py:423,445 is the same bug class — Phase B/C attention fusion wins from project_cute_paged_bench may also be phantom. Files: docs/superpowers/specs/2026-04-24-phase-f1-opaque-gate-refactor-design.md docs/superpowers/plans/2026-04-24-phase-e2-f1-beta-correctness-opaque-gate.md Force-add past docs/superpowers/ gitignore rule (consistent with existing specs/plans which are grandfathered-in and tracked). Execution next session via superpowers:subagent-driven-development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ Embedding) Tensor-parallel safetensors → GPU loaders matching Python vLLM's ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires the dispatch from the macro side; this commit lands only the runtime helpers). Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`: - `Linear::load_sharded(weights, prefix, dim, rank, world)`. The load-bearing bias rules: - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias shards along dim 0 too — each rank holds its own slice. Mirrors Python `ColumnParallelLinear.weight_loader` → `loaded_weight.narrow(output_dim=0, …)`. - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated full-size on rank 0 only**, `None` on other ranks. The forward path adds bias before the cross-rank AllReduce-sum; only rank 0's contribution survives the sum, giving exactly one bias add to the residual stream. Mirrors Python `RowParallelLinear.forward` line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`. - `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`. Thin wrapper over `Linear::load_sharded`. The codegen entry point for non-fused (single-prefix) sharded loads. - `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream, rank, world)`. Sharded variant of `load_dense_concat` for the fused QKV / gate_up paths. Always column-parallel (no row-parallel concat exists in any current arch). Each source weight slices along dim 0 to `[out_i / world, hidden]` then packs into one contiguous `[(sum out_i) / world, hidden]` GPU buffer via per-source `take_shard_into`. Biases follow the column-parallel rule (sliced along dim 0) — matches Python `MergedColumnParallelLinear` / `QKVParallelLinear`. Per-rank divisibility is guaranteed by the macro's outer-loop fanout `skip` of indivisible (variant, tp) tuples (commit `889c44b2f`). - `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab- parallel: slices the embedding table along dim 0 (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`). Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as `Linear::load_sharded(dim=0)` — that's what makes `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent at tp>1. Defers FP8 / Marlin / BNB sharded variants — the verify model (commandr) is dense bf16. World == 1 short-circuits to byte-equivalent behavior with the existing unsharded paths in every helper, plus shard-kind-aware bias rules. No tests added at this layer (CUDA stream + safetensors fixtures aren't worth the infra spend; the integration test is task vllm-project#8). `take_shard` / `take_shard_into` on `GpuWeights` already exist (used by the prior hand-written TP path); these wrappers are pure call-site plumbing on top. Build clean: ferrite-kernels checks + clippy at default features. The macro-side consumer that chooses sharded vs unsharded based on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers have no runtime caller (intentionally — wholesale codegen migration per the no_piecemeal_codegen_migration rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous debug commit (`aabb517a9`) printed bindings for ICB compute commands but skipped the Gemm path (which goes through `resolve_gemm_buffers` rather than `record_compute_dispatch`). This left bug vllm-project#4 — the MPS-side `offset: 0u64` hardcoding fixed in the previous commit — invisible to the diagnostic. Add the matching print for `BucketStep::Gemm`: kernel/dims and `(buf, off)` for each of a / b / c. Drop with the rest of `aabb517a9`'s instrumentation when the metal correctness work is merged-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oject#4 still open Update FERRITE_METAL_PROGRESS.md to reflect the third bug fix this session: `MPSMatrix.initWithBuffer:offset:` was hardcoded to 0, silently undoing the per-binding offsets the lowering + worker had threaded all the way to the Gemm encoder. Caught by extending `FERRITE_METAL_BAKE_DEBUG=1` to dump Gemm bindings — the offset discrepancy fell out as soon as we could see the bound `(buf, off)` vs MPS' offset=0 invariant. Post-fix the model output is finally a function of the input: "hi" samples token 29966 (`<`) and continues `<assistant>\n…`; different prompts produce different sequences. But the model still loops on chat-template-like tokens — at least one more bug remains, likely in attention math, RoPE, or kernel arithmetic at runtime conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a Phase 5.K section documenting this session's startup-time work: 16 commits between 79da7db and f9ae0df, dropping warm init engine from 2.91s to 240ms on Llama-3.2-3B (~20% under vllm-mlx's 300ms baseline). Also closes Phase 5.J's bug vllm-project#4 (bf16 KV slot-0 race) which landed in 3658f2b, and updates the Notes block. Cross-references project_metal_unpacked_mlp_next.md for the structural follow-up (real MetalMulImpl + Instruction::ElementwiseMul) that would eliminate the remaining ~200ms gate_up pack memcpy and drop startup to ~50ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

WoosukKwon added 8 commits March 2, 2023 04:20

Add a FlashAttention test

5685bac

Define MAX_SEQ_LEN

320b20c

Minor

4932c71

Use FlashAttention for multi_query_kv_attention

c302754

Add more head sizes for test

95e5c0f

Add error msgs

1c28f4f

Enhance the server script

eb9e9a0

Add flash-attn to README

72052b7

WoosukKwon merged commit 3e9f991 into main Mar 2, 2023

WoosukKwon deleted the flash-attn branch March 2, 2023 05:13

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

tmm1 mentioned this pull request Aug 3, 2023

Fix the rushed out multi-query kernel #44

Closed

xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Add BigDL Llama worker for batching on decoding (vllm-project#4)

02b4cac

* Init * refine

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Use FlashAttention for multi_query_kv_attention (#4)

e7c912b

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#4 from slyalin/optimum_models

8a9862f

Support for optimum-intel models

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#4 from luo-cheng2021/luocheng/openvin…

658407a

…o-model-executor Adapt OpenVINO CPU plugin implementation

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024

Merge pull request vllm-project#4 from beagleski/bapatra/patching-for-su

7646e00

patching for having type su

Alexei-V-Ivanov-AMD mentioned this pull request May 16, 2024

[Speculative decoding][Re-take] Enable TP>1 speculative decoding #4840

Merged

afeldman-nm mentioned this pull request May 21, 2024

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

haichuan1221 mentioned this pull request Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

Isotr0py mentioned this pull request Feb 25, 2026

[Bug]: Large Video Request cause vLLM Progress Core Dump #35285

Open

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026

Merge pull request vllm-project#4 from IWantFight/bf_ant_group

825cc98

fix te error

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#4 from bcacdwk/copilot/update-markdow…

8b8201c

…n-files-check Fix vLLM framework documentation to match actual source code structure

omerpaz95 referenced this pull request in omerpaz95/vllm Apr 7, 2026

Round #4 for Or's review fixes.

fe5305b

Signed-off-by: omerpaz95 <omerpaz95@gmail.com>

starpit added a commit to starpit/vllm that referenced this pull request Apr 16, 2026

HANDOFF: Granite landed — 4 goldens green, gap vllm-project#4 closed

49b2471

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SongXiaoMao mentioned this pull request May 13, 2026

[Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) #40756

Open

1 task

DoradusResearch mentioned this pull request May 13, 2026

cumem allocator: double-free and stale error codes during sleep/wake cycles #36651

Open

1 task

maeehart mentioned this pull request May 18, 2026

[ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X (gfx942) #42893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use FlashAttention for `multi_query_kv_attention`#4

Use FlashAttention for `multi_query_kv_attention`#4
WoosukKwon merged 8 commits into
mainfrom
flash-attn

WoosukKwon commented Mar 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pros

Cons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WoosukKwon commented Mar 2, 2023 •

edited

Loading