Use FlashAttention for multi_query_kv_attention#4
Merged
Conversation
xiangyuT
added a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 24, 2023
hongxiayang
referenced
this pull request
in hongxiayang/vllm
Feb 13, 2024
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 12, 2024
Support for optimum-intel models
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 25, 2024
…o-model-executor Adapt OpenVINO CPU plugin implementation
mzusman
pushed a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida
linxihui
added a commit
to linxihui/vllm
that referenced
this pull request
May 14, 2024
patching for having type su
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
1 task
IWantFight
pushed a commit
to IWantFight/vllm
that referenced
this pull request
Mar 11, 2026
1 task
1 task
1 task
1 task
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 26, 2026
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore
pushed a commit
to Damon-Salvetore/vllm
that referenced
this pull request
Mar 31, 2026
…n-files-check Fix vLLM framework documentation to match actual source code structure
omerpaz95
referenced
this pull request
in omerpaz95/vllm
Apr 7, 2026
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 14, 2026
…_like, .contiguous() DecodeKernel.__call__ changes: - query: assert contiguity instead of .contiguous() copy - output: accept caller-provided persistent buffer - wo_global_scale: pass tensor pointer, not .item() float - grid.z: use padded_num_seqs for stable graph capture - gate_ptr/gate_fused: new kernel args for output gate fusion Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 16, 2026
Seven-commit Gemma2 push landed this session. Compiler stays architecture-agnostic — the acid test passed for generality: zero "gemma" strings below the parser, the entire architecture fits in DSL body + JSON configs + library Impls. New sections: - "Next session starts here" rewritten around the Gemma2 state: coherent output on common prompts, known hard-fail on `test_cuda_correctness_gemma2_2b` prompt 7 (a translation prompt with apostrophe-quoted text), ruled-out causes listed (not CUDA graphs, not final softcap, not ferrite-vs-hand-written drift, not the embed-scale/sliding-flip bugs I already fixed). - "Commit history" grew a fresh block of 8 commits. - "Landed since the correctness fix" gained a full item vllm-project#7 for the Gemma2 port, enumerating each sub-piece and its commit. - Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the logical next architecture using the same DSL constructs). - Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice to have)" to "per-layer tensor-dump debug harness" — the concrete tool needed to unblock the prompt-7 failure. Spelled out the approach including the forward_backbone quirk (for Gemma2 it returns pre-softcap logits, not pre-lm_head hidden_states, because the DSL's terminal tile is `tanh_softcap`). - "Already landed" list gained nine new bullets covering the DSL constructs, the scalars mechanism, Gemma2 Impls, kernel `weight_offset` param, Gemma2 DSL + configs, and `Gemma2Ferrite` cuda_worker wiring. - Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses scalar multiply via the Gemma embed-scale plumbing. Honest accounting of the known-fail: correlation r=0.11 between our post-softcap logits and Python vLLM's on prompt 7. Divergence is upstream of softcap, specific to this prompt's numerics, and likely lives in a shared CUDA kernel (flash-attn softcap + sliding, or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite and the pre-existing hand-written Gemma2 path exhibit the same failure — the fix, once found, applies to both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 16, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 20, 2026
Land the Silly Streaming microbench trail (four pre-run scripts + README) alongside the negative verdict from the gatekeeper probe (#1). - HW backend: nvCOMP 5.2 NVCOMP_DECOMPRESS_BACKEND_HARDWARE raises RuntimeError code=12 on GB10 — Blackwell HW-DE is datacenter-only (B100/B200), not exposed on SM120. - SW fallback: CUDA-backend Deflate on 1 GiB = 3.7 GB/s (uniform- random) / 12-13 GB/s (trivially compressible). NVFP4 weights would sit near the random number, still below the NVMe ceiling from #4. Scripts are preserved as evidence trail; they still reference the removed kvikio.nvcomp_codec.NvCompBatchCodec API — use nvidia.nvcomp directly if someone revives this on future HW. Long live silly stream. o7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 22, 2026
Land the Silly Streaming microbench trail (four pre-run scripts + README) alongside the negative verdict from the gatekeeper probe (#1). - HW backend: nvCOMP 5.2 NVCOMP_DECOMPRESS_BACKEND_HARDWARE raises RuntimeError code=12 on GB10 — Blackwell HW-DE is datacenter-only (B100/B200), not exposed on SM120. - SW fallback: CUDA-backend Deflate on 1 GiB = 3.7 GB/s (uniform- random) / 12-13 GB/s (trivially compressible). NVFP4 weights would sit near the random number, still below the NVMe ceiling from #4. Scripts are preserved as evidence trail; they still reference the removed kvikio.nvcomp_codec.NvCompBatchCodec API — use nvidia.nvcomp directly if someone revives this on future HW. Long live silly stream. o7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jianzs
pushed a commit
to jianzs/vllm
that referenced
this pull request
Apr 23, 2026
- Section 4.3: Update single-request benchmarks with post-fix data (Config 3 TPOT 9.88ms->7.86ms, Config 4 TPOT 8.41ms->7.31ms) - Section 5.2.1: Rewrite from "pending fix" to "fixed and verified" with profile data comparing NCCL ops before/after fix - Section 5.2.2: Update concurrent TPOT worsening percentages with new single-request baselines - Section 5.3.2: Update single-request ITL comparison - Section 7: Add config3_fix profile directory and NCCL comparison table - Section 8: Update conclusion vllm-project#3 and vllm-project#4 with fix results Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 25, 2026
Three layers of work, all driving toward "load real-world IQ-quantized
GGUFs and run them coherently with quantized weights kept on GPU."
GGUF quantized-storage default
- Default load path now keeps GGUF weights compressed on GPU instead
of dequant-to-BF16, saving ~2x weight memory. FERRITE_DEQUANT_AT_LOAD=1
opts back into the legacy dense path for debugging.
- Fix the load-time host_buf race: pinned H2D for quantized weights
was reusing host_buf for the next tensor before the previous DMA
finished, leaving subsequent quantized weights corrupted on GPU.
One-line stream_synchronize before the host_buf reuse.
- Unpin Q6_K from max_ncols_y=1 (cargo cult, fixed by the sync above).
- mul_mat_vec_q template: align rows_per_cuda_block with upstream
(ncols_y < 4 → 1, else 2).
IQ kernel coverage
- ggml IQ1_M (codebook + dequant + MMVQ + dispatch).
- Expanded IQ* support: IQ1_S / IQ2_XXS / IQ4_NL / IQ4_XS dequant +
MMVQ kernels, sign-bit machinery shared across types
(kmask_iq2xs / ksigns_iq2xs / unpack_ksigns / __vcmpne4 / __vsub4
/ ggml_cuda_dp4a).
- Add IQ2_XS (GgufDType tag 17): GgufDType + GgmlDType variants, the
512-entry iq2xs_grid codebook (uint64), block_iq2_xs struct,
dequantize_block_iq2_xs (+ DEQUANTIZE_K macro), vec_dot_iq2_xs_q8_1,
mul_mat_vec_iq2_xs_q8_1_cuda1, the three launch wrappers, extern "C"
declarations and dispatch arms in ggml.rs. Mistral-7B-Instruct-v0.3
-IQ2_S.gguf (which mixes IQ2_XS for attention with IQ2_S for FFN)
now loads and both `[ggml dispatch] first matmul via IQ2_XS` and
`IQ3_S` notifiers fire during forward.
- Batch IQ prefill: 56 new mul_mat_vec_iq*_q8_1_cuda2 through _cuda8
template instantiations across 8 IQ types, each launcher updated to
switch on ncols_y exactly like Q4_K, max_ncols_y cap lifted from
is_iq_quant() ? 1 : 8 to a flat 8. Reduces prefill kernel launches
~8x for IQ-family weights.
Probe infrastructure (stays in the tree)
- crates/ferrite-kernels/src/layers.rs has a ggml_probe module behind
two env vars, off by default:
- FERRITE_PROBE_MMVQ=1 — runs both the quantized kernel AND a
dequant+cuBLAS F32 reference per GgmlLinear::forward, prints
per-call max_abs_diff / rel / NaN counts. Used to isolate the
host_buf race; now used to validate the cuda2..cuda8 IQ kernels.
- FERRITE_USE_REFERENCE=1 — replaces ggml_matmul with the F32
cuBLAS reference at every linear. If inference becomes coherent
under this flag, kernels are the bug; if it stays broken, look
upstream.
End-to-end coverage
- Llama-3.2-1B chat tests for Q4_K_M / IQ1_S / IQ1_M / IQ2_XXS /
IQ4_NL / IQ4_XS pass with FERRITE_PROBE_MMVQ=1 showing rel <= 2.9%
at M=42 prefill (IQ4_NL <= 0.82%, IQ4_XS <= 1.1%, IQ1_M <= 0.81%,
IQ1_S <= 2.9%, IQ2_XXS <= 1.3%); for every type the batched
cuda2..cuda8 path is at least as accurate as the cuda1 baseline.
- assert_coherent_text strengthened.
- Mistral-7B IQ2_S / IQ3_S regression-marker tests added under
#[ignore]. They go green when the tokenizer-from-GGUF fallback
(vllm-serve/src/llm.rs:tokenize_text) is fixed — currently every
prompt's ASCII bytes get used as token IDs because bartowski's
GGUF-only repo ships no tokenizer.json, which is what the
documented "x_max=0 from L0 q_proj" symptom actually was.
Docs
- HANDOFF_QUANTIZED.md captures the current state: items vllm-project#2 (IQ2_XS)
and vllm-project#3 (IQ batched prefill) closed, item vllm-project#1 reclassified as a
tokenizer-fallback bug (not a quant-path bug), item vllm-project#4 (IQ decode
DMMV) reclassified do-not-port — upstream llama.cpp deleted
ggml/src/ggml-cuda/dmmv.cu entirely; ik_llama.cpp only ports DMMV
for its custom IQ*_KT types not the standard ones.
- vllm-parity/parity.csv: GGUF IQ row updated to enumerate all 8
wired types with per-type rel numbers; new rows for "GGUF dequant
kernels (IQ types)", "IQ decode DMMV (wontfix)", and "Tokenizer
from GGUF metadata".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 25, 2026
The existing Phase E baseline leg ran concurrent=4 max_tokens=256 while β-lite ran concurrent=8 max_tokens=64 (per Caveat #1 in benchmarks/nvllm/traces/phase_e/2026-04-23-initial/summary.md). The per-kernel μs comparison wasn't apples-to-apples. This script re-captures a baseline leg (CUTE_PHASE_E_FUSION=0) at the same workload as the β-lite leg — num_seqs=8, concurrent=8, max_tokens=64, warmup=4, timed=5 — so β-lite vs baseline kernel-duration deltas can be read directly from the CSVs produced by extract_e2e_kernels.py. Mirrors the structure of capture_beta_only.sh (same profiler config, memory watchdog, readiness gate, CUPTI flush delay). Runs on the current nvllm:gb10 image; FUSION=0 bypasses all Phase E code paths so no rebuild is required for this leg. Output: benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/ Evidence bundle (summary.md + kernel CSV) lands in the follow-up session that ships E.1 #2 (β-coop SMEM shrink). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 25, 2026
…ched concurrency Matched-concurrency baseline (CUTE_PHASE_E_FUSION=0, num_seqs=8) vs existing β-lite leg. Same model, PIECEWISE, FP8 KV, active_iterations=200. Finding: Phase_D_MLP_Kernel fires 2× per full_attn layer per decode step in β-lite (n_calls=2016) vs 1× in baseline (n_calls=1008). Per-call MLP is 13.5% faster (90,408 vs 104,499 μs), but the 2× firing swamps the win. Net: +76,349 μs/layer/step, i.e. +62.8% slower per-full-attn-layer decode cost. Raises Phase E.1 #2 (β-coop SMEM shrink → num_seqs≥2) priority from "lower leverage if num_seqs=1 is 95%" to "regression fix for the user's steady-state workload." See memory updates for num_seqs=2 target. Extends .gitignore to mirror the phase_e/** policy to phase_e_1/** (raw .pt.trace.json.gz local-only; CSV + logs + md + txt + json committed) plus pre-ships phase_f/** rules for upcoming Phase F.1. Evidence bundle: benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/ ├── baseline_matched_kernels.csv (67 kernels, per-call + totals) ├── baseline_matched_serve.log (EngineCore — confirms FUSION=0) ├── baseline_matched_mem.log (host + docker mem watchdog) ├── profiler_out_0.txt └── summary.md (apples-to-apples comparison) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 25, 2026
…gate Spec + implementation plan for a coupled fix discovered when working through the β-lite 2× firing regression from E.1 #4: * **Phase E.2** — fresh-eyes spec-reviewer audit caught that both β kernels multiply by raw γ while Qwen3_5RMSNorm uses x * (1 + γ). Latent because the consume branch at qwen3_5.py:473 dead-branches under PIECEWISE (@support_torch_compile + fullgraph=True + skip_all_guards_unsafe), so β's wrong output is orphaned and the legacy path produces the actual answer. Fix: (1 + γ) at mlp_kernel.py:1502 and phase_e_kernel.py:641 (audit Phase 4 epilogue during implementation), plus a reference-diff harness cross-checking against Qwen3_5RMSNorm.forward_native. * **Phase F.1** — opaque-gate refactor (first slice of Phase F, FULL-graph enablement). Two new custom ops: cute_phase_e_dispatch (replaces qwen3_5.py:473-481 consume gate) and cute_phase_e_skip_input_layernorm (wraps qwen3_5.py:386 so layer N+1 skips its input_layernorm when β pre-applied it). Fail-loud on consume-branch errors; nested op calls into cute_mlp_forward on the NOT-consumed path. Audit also surfaced that Phase E's reported 51.7% speedup measures β kernel launch latency, not end-to-end decode wall time — the legacy full path fires alongside β due to the same dead-branch. Memory updated accordingly (project_phase_e_phantom_speedup, project_phase_e_beta_math_bug). True end-to-end numbers pending post-E.2+F.1 re-trace. Audit Finding #13 (flagged, deferred to Phase F.N): _fusion_active at qwen3_5.py:423,445 is the same bug class — Phase B/C attention fusion wins from project_cute_paged_bench may also be phantom. Files: docs/superpowers/specs/2026-04-24-phase-f1-opaque-gate-refactor-design.md docs/superpowers/plans/2026-04-24-phase-e2-f1-beta-correctness-opaque-gate.md Force-add past docs/superpowers/ gitignore rule (consistent with existing specs/plans which are grandfathered-in and tracked). Execution next session via superpowers:subagent-driven-development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…+ Embedding) Tensor-parallel safetensors → GPU loaders matching Python vLLM's ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires the dispatch from the macro side; this commit lands only the runtime helpers). Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`: - `Linear::load_sharded(weights, prefix, dim, rank, world)`. The load-bearing bias rules: - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias shards along dim 0 too — each rank holds its own slice. Mirrors Python `ColumnParallelLinear.weight_loader` → `loaded_weight.narrow(output_dim=0, …)`. - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated full-size on rank 0 only**, `None` on other ranks. The forward path adds bias before the cross-rank AllReduce-sum; only rank 0's contribution survives the sum, giving exactly one bias add to the residual stream. Mirrors Python `RowParallelLinear.forward` line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`. - `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`. Thin wrapper over `Linear::load_sharded`. The codegen entry point for non-fused (single-prefix) sharded loads. - `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream, rank, world)`. Sharded variant of `load_dense_concat` for the fused QKV / gate_up paths. Always column-parallel (no row-parallel concat exists in any current arch). Each source weight slices along dim 0 to `[out_i / world, hidden]` then packs into one contiguous `[(sum out_i) / world, hidden]` GPU buffer via per-source `take_shard_into`. Biases follow the column-parallel rule (sliced along dim 0) — matches Python `MergedColumnParallelLinear` / `QKVParallelLinear`. Per-rank divisibility is guaranteed by the macro's outer-loop fanout `skip` of indivisible (variant, tp) tuples (commit `889c44b2f`). - `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab- parallel: slices the embedding table along dim 0 (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`). Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as `Linear::load_sharded(dim=0)` — that's what makes `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent at tp>1. Defers FP8 / Marlin / BNB sharded variants — the verify model (commandr) is dense bf16. World == 1 short-circuits to byte-equivalent behavior with the existing unsharded paths in every helper, plus shard-kind-aware bias rules. No tests added at this layer (CUDA stream + safetensors fixtures aren't worth the infra spend; the integration test is task vllm-project#8). `take_shard` / `take_shard_into` on `GpuWeights` already exist (used by the prior hand-written TP path); these wrappers are pure call-site plumbing on top. Build clean: ferrite-kernels checks + clippy at default features. The macro-side consumer that chooses sharded vs unsharded based on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers have no runtime caller (intentionally — wholesale codegen migration per the no_piecemeal_codegen_migration rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
May 9, 2026
The previous debug commit (`aabb517a9`) printed bindings for ICB compute commands but skipped the Gemm path (which goes through `resolve_gemm_buffers` rather than `record_compute_dispatch`). This left bug vllm-project#4 — the MPS-side `offset: 0u64` hardcoding fixed in the previous commit — invisible to the diagnostic. Add the matching print for `BucketStep::Gemm`: kernel/dims and `(buf, off)` for each of a / b / c. Drop with the rest of `aabb517a9`'s instrumentation when the metal correctness work is merged-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
May 9, 2026
…oject#4 still open Update FERRITE_METAL_PROGRESS.md to reflect the third bug fix this session: `MPSMatrix.initWithBuffer:offset:` was hardcoded to 0, silently undoing the per-binding offsets the lowering + worker had threaded all the way to the Gemm encoder. Caught by extending `FERRITE_METAL_BAKE_DEBUG=1` to dump Gemm bindings — the offset discrepancy fell out as soon as we could see the bound `(buf, off)` vs MPS' offset=0 invariant. Post-fix the model output is finally a function of the input: "hi" samples token 29966 (`<`) and continues `<assistant>\n…`; different prompts produce different sequences. But the model still loops on chat-template-like tokens — at least one more bug remains, likely in attention math, RoPE, or kernel arithmetic at runtime conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
May 9, 2026
Adds a Phase 5.K section documenting this session's startup-time work: 16 commits between 79da7db and f9ae0df, dropping warm init engine from 2.91s to 240ms on Llama-3.2-3B (~20% under vllm-mlx's 300ms baseline). Also closes Phase 5.J's bug vllm-project#4 (bf16 KV slot-0 race) which landed in 3658f2b, and updates the Notes block. Cross-references project_metal_unpacked_mlp_next.md for the structural follow-up (real MetalMulImpl + Instruction::ElementwiseMul) that would eliminate the remaining ~200ms gate_up pack memcpy and drop startup to ~50ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 task
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is to use FlashAttention kernels for
multi_query_kv_attention, which performs masked attention for the prompt inputs.Pros
Cons
Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.
Tested models:
Tested GPUs: