Skip to content

Use FlashAttention for multi_query_kv_attention#4

Merged
WoosukKwon merged 8 commits into
mainfrom
flash-attn
Mar 2, 2023
Merged

Use FlashAttention for multi_query_kv_attention#4
WoosukKwon merged 8 commits into
mainfrom
flash-attn

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 2, 2023

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

  • FlashAttention is fast and memory-efficient.
  • FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit 3e9f991 into main Mar 2, 2023
@WoosukKwon WoosukKwon deleted the flash-attn branch March 2, 2023 05:13
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
…o-model-executor

Adapt OpenVINO CPU plugin implementation
mzusman pushed a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
BA-78760: Jamba

* Add support for n concat and splitting

* change naming

* input_metadata is a dict list now in order to pass "n"

* clean up code from unecessary changes and prints

* Remove kv cache allocation in case of mamba layer

* Add the considerations of mamba layer cache into the num of blocks
calculation

* Delete mamba cache after profile

* Remove prints

* Cleaning

* - and not _ for requirements

Approved-by: Tomer Asida
linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…n-files-check

Fix vLLM framework documentation to match actual source code structure
omerpaz95 referenced this pull request in omerpaz95/vllm Apr 7, 2026
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 14, 2026
…_like, .contiguous()

DecodeKernel.__call__ changes:
- query: assert contiguity instead of .contiguous() copy
- output: accept caller-provided persistent buffer
- wo_global_scale: pass tensor pointer, not .item() float
- grid.z: use padded_num_seqs for stable graph capture
- gate_ptr/gate_fused: new kernel args for output gate fusion

Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 16, 2026
Seven-commit Gemma2 push landed this session. Compiler stays
architecture-agnostic — the acid test passed for generality:
zero "gemma" strings below the parser, the entire architecture
fits in DSL body + JSON configs + library Impls.

New sections:
- "Next session starts here" rewritten around the Gemma2 state:
  coherent output on common prompts, known hard-fail on
  `test_cuda_correctness_gemma2_2b` prompt 7 (a translation
  prompt with apostrophe-quoted text), ruled-out causes listed
  (not CUDA graphs, not final softcap, not ferrite-vs-hand-written
  drift, not the embed-scale/sliding-flip bugs I already fixed).
- "Commit history" grew a fresh block of 8 commits.
- "Landed since the correctness fix" gained a full item vllm-project#7 for
  the Gemma2 port, enumerating each sub-piece and its commit.
- Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the
  logical next architecture using the same DSL constructs).
- Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice
  to have)" to "per-layer tensor-dump debug harness" — the
  concrete tool needed to unblock the prompt-7 failure. Spelled
  out the approach including the forward_backbone quirk (for
  Gemma2 it returns pre-softcap logits, not pre-lm_head
  hidden_states, because the DSL's terminal tile is
  `tanh_softcap`).
- "Already landed" list gained nine new bullets covering the
  DSL constructs, the scalars mechanism, Gemma2 Impls, kernel
  `weight_offset` param, Gemma2 DSL + configs, and
  `Gemma2Ferrite` cuda_worker wiring.
- Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses
  scalar multiply via the Gemma embed-scale plumbing.

Honest accounting of the known-fail: correlation r=0.11 between
our post-softcap logits and Python vLLM's on prompt 7. Divergence
is upstream of softcap, specific to this prompt's numerics, and
likely lives in a shared CUDA kernel (flash-attn softcap + sliding,
or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite
and the pre-existing hand-written Gemma2 path exhibit the same
failure — the fix, once found, applies to both.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 16, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 20, 2026
Land the Silly Streaming microbench trail (four pre-run scripts + README)
alongside the negative verdict from the gatekeeper probe (#1).

  - HW backend: nvCOMP 5.2 NVCOMP_DECOMPRESS_BACKEND_HARDWARE raises
    RuntimeError code=12 on GB10 — Blackwell HW-DE is datacenter-only
    (B100/B200), not exposed on SM120.
  - SW fallback: CUDA-backend Deflate on 1 GiB = 3.7 GB/s (uniform-
    random) / 12-13 GB/s (trivially compressible). NVFP4 weights would
    sit near the random number, still below the NVMe ceiling from #4.

Scripts are preserved as evidence trail; they still reference the
removed kvikio.nvcomp_codec.NvCompBatchCodec API — use nvidia.nvcomp
directly if someone revives this on future HW.

Long live silly stream. o7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 22, 2026
Land the Silly Streaming microbench trail (four pre-run scripts + README)
alongside the negative verdict from the gatekeeper probe (#1).

  - HW backend: nvCOMP 5.2 NVCOMP_DECOMPRESS_BACKEND_HARDWARE raises
    RuntimeError code=12 on GB10 — Blackwell HW-DE is datacenter-only
    (B100/B200), not exposed on SM120.
  - SW fallback: CUDA-backend Deflate on 1 GiB = 3.7 GB/s (uniform-
    random) / 12-13 GB/s (trivially compressible). NVFP4 weights would
    sit near the random number, still below the NVMe ceiling from #4.

Scripts are preserved as evidence trail; they still reference the
removed kvikio.nvcomp_codec.NvCompBatchCodec API — use nvidia.nvcomp
directly if someone revives this on future HW.

Long live silly stream. o7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jianzs pushed a commit to jianzs/vllm that referenced this pull request Apr 23, 2026
- Section 4.3: Update single-request benchmarks with post-fix data
  (Config 3 TPOT 9.88ms->7.86ms, Config 4 TPOT 8.41ms->7.31ms)
- Section 5.2.1: Rewrite from "pending fix" to "fixed and verified"
  with profile data comparing NCCL ops before/after fix
- Section 5.2.2: Update concurrent TPOT worsening percentages
  with new single-request baselines
- Section 5.3.2: Update single-request ITL comparison
- Section 7: Add config3_fix profile directory and NCCL comparison table
- Section 8: Update conclusion vllm-project#3 and vllm-project#4 with fix results

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 25, 2026
Three layers of work, all driving toward "load real-world IQ-quantized
GGUFs and run them coherently with quantized weights kept on GPU."

GGUF quantized-storage default
- Default load path now keeps GGUF weights compressed on GPU instead
  of dequant-to-BF16, saving ~2x weight memory. FERRITE_DEQUANT_AT_LOAD=1
  opts back into the legacy dense path for debugging.
- Fix the load-time host_buf race: pinned H2D for quantized weights
  was reusing host_buf for the next tensor before the previous DMA
  finished, leaving subsequent quantized weights corrupted on GPU.
  One-line stream_synchronize before the host_buf reuse.
- Unpin Q6_K from max_ncols_y=1 (cargo cult, fixed by the sync above).
- mul_mat_vec_q template: align rows_per_cuda_block with upstream
  (ncols_y < 4 → 1, else 2).

IQ kernel coverage
- ggml IQ1_M (codebook + dequant + MMVQ + dispatch).
- Expanded IQ* support: IQ1_S / IQ2_XXS / IQ4_NL / IQ4_XS dequant +
  MMVQ kernels, sign-bit machinery shared across types
  (kmask_iq2xs / ksigns_iq2xs / unpack_ksigns / __vcmpne4 / __vsub4
  / ggml_cuda_dp4a).
- Add IQ2_XS (GgufDType tag 17): GgufDType + GgmlDType variants, the
  512-entry iq2xs_grid codebook (uint64), block_iq2_xs struct,
  dequantize_block_iq2_xs (+ DEQUANTIZE_K macro), vec_dot_iq2_xs_q8_1,
  mul_mat_vec_iq2_xs_q8_1_cuda1, the three launch wrappers, extern "C"
  declarations and dispatch arms in ggml.rs. Mistral-7B-Instruct-v0.3
  -IQ2_S.gguf (which mixes IQ2_XS for attention with IQ2_S for FFN)
  now loads and both `[ggml dispatch] first matmul via IQ2_XS` and
  `IQ3_S` notifiers fire during forward.
- Batch IQ prefill: 56 new mul_mat_vec_iq*_q8_1_cuda2 through _cuda8
  template instantiations across 8 IQ types, each launcher updated to
  switch on ncols_y exactly like Q4_K, max_ncols_y cap lifted from
  is_iq_quant() ? 1 : 8 to a flat 8. Reduces prefill kernel launches
  ~8x for IQ-family weights.

Probe infrastructure (stays in the tree)
- crates/ferrite-kernels/src/layers.rs has a ggml_probe module behind
  two env vars, off by default:
  - FERRITE_PROBE_MMVQ=1 — runs both the quantized kernel AND a
    dequant+cuBLAS F32 reference per GgmlLinear::forward, prints
    per-call max_abs_diff / rel / NaN counts. Used to isolate the
    host_buf race; now used to validate the cuda2..cuda8 IQ kernels.
  - FERRITE_USE_REFERENCE=1 — replaces ggml_matmul with the F32
    cuBLAS reference at every linear. If inference becomes coherent
    under this flag, kernels are the bug; if it stays broken, look
    upstream.

End-to-end coverage
- Llama-3.2-1B chat tests for Q4_K_M / IQ1_S / IQ1_M / IQ2_XXS /
  IQ4_NL / IQ4_XS pass with FERRITE_PROBE_MMVQ=1 showing rel <= 2.9%
  at M=42 prefill (IQ4_NL <= 0.82%, IQ4_XS <= 1.1%, IQ1_M <= 0.81%,
  IQ1_S <= 2.9%, IQ2_XXS <= 1.3%); for every type the batched
  cuda2..cuda8 path is at least as accurate as the cuda1 baseline.
- assert_coherent_text strengthened.
- Mistral-7B IQ2_S / IQ3_S regression-marker tests added under
  #[ignore]. They go green when the tokenizer-from-GGUF fallback
  (vllm-serve/src/llm.rs:tokenize_text) is fixed — currently every
  prompt's ASCII bytes get used as token IDs because bartowski's
  GGUF-only repo ships no tokenizer.json, which is what the
  documented "x_max=0 from L0 q_proj" symptom actually was.

Docs
- HANDOFF_QUANTIZED.md captures the current state: items vllm-project#2 (IQ2_XS)
  and vllm-project#3 (IQ batched prefill) closed, item vllm-project#1 reclassified as a
  tokenizer-fallback bug (not a quant-path bug), item vllm-project#4 (IQ decode
  DMMV) reclassified do-not-port — upstream llama.cpp deleted
  ggml/src/ggml-cuda/dmmv.cu entirely; ik_llama.cpp only ports DMMV
  for its custom IQ*_KT types not the standard ones.
- vllm-parity/parity.csv: GGUF IQ row updated to enumerate all 8
  wired types with per-type rel numbers; new rows for "GGUF dequant
  kernels (IQ types)", "IQ decode DMMV (wontfix)", and "Tokenizer
  from GGUF metadata".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own
PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full
= None`, so `cute.compile()` fires once per layer on first request —
16 × ~23 s ≈ ~6 min cold-start stall.

Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple
of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4`
(audited via grep; key covers them all + 12 safe-redundant derived
fields). Instances with matching config share one compiled kernel.

Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`):
- 16 β-coop attachments → 1 compile event (was 16).
- Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each.
- Projected savings ≈ 310 s (~5 min) shaved off first-request latency.
- GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12',
  not a kernel regression — reproduces on baseline without this fix).

Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`):
- 6 new tests covering dict existence, key equivalence for matching
  configs, key distinctness for different configs, 16-instance → 1-compile
  behavior, distinct-config → N-compiles, and back-compat instance attr
  population.
- 33/33 existing Phase E tests still pass.

Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM
shrink + #4 matched-concurrency baseline bench (follow-up session),
#5 cudaProfilerApi hook (infra).

Base: 7bc5773

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 25, 2026
The existing Phase E baseline leg ran concurrent=4 max_tokens=256 while
β-lite ran concurrent=8 max_tokens=64 (per Caveat #1 in
benchmarks/nvllm/traces/phase_e/2026-04-23-initial/summary.md). The
per-kernel μs comparison wasn't apples-to-apples.

This script re-captures a baseline leg (CUTE_PHASE_E_FUSION=0) at the
same workload as the β-lite leg — num_seqs=8, concurrent=8,
max_tokens=64, warmup=4, timed=5 — so β-lite vs baseline kernel-duration
deltas can be read directly from the CSVs produced by
extract_e2e_kernels.py.

Mirrors the structure of capture_beta_only.sh (same profiler config,
memory watchdog, readiness gate, CUPTI flush delay). Runs on the
current nvllm:gb10 image; FUSION=0 bypasses all Phase E code paths so
no rebuild is required for this leg.

Output: benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/
Evidence bundle (summary.md + kernel CSV) lands in the follow-up
session that ships E.1 #2 (β-coop SMEM shrink).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 25, 2026
…ched concurrency

Matched-concurrency baseline (CUTE_PHASE_E_FUSION=0, num_seqs=8) vs
existing β-lite leg. Same model, PIECEWISE, FP8 KV, active_iterations=200.

Finding: Phase_D_MLP_Kernel fires 2× per full_attn layer per decode
step in β-lite (n_calls=2016) vs 1× in baseline (n_calls=1008).
Per-call MLP is 13.5% faster (90,408 vs 104,499 μs), but the 2×
firing swamps the win. Net: +76,349 μs/layer/step, i.e. +62.8%
slower per-full-attn-layer decode cost.

Raises Phase E.1 #2 (β-coop SMEM shrink → num_seqs≥2) priority from
"lower leverage if num_seqs=1 is 95%" to "regression fix for the
user's steady-state workload." See memory updates for num_seqs=2
target.

Extends .gitignore to mirror the phase_e/** policy to phase_e_1/**
(raw .pt.trace.json.gz local-only; CSV + logs + md + txt + json
committed) plus pre-ships phase_f/** rules for upcoming Phase F.1.

Evidence bundle:
  benchmarks/nvllm/traces/phase_e_1/2026-04-24-baseline-matched/
    ├── baseline_matched_kernels.csv   (67 kernels, per-call + totals)
    ├── baseline_matched_serve.log     (EngineCore — confirms FUSION=0)
    ├── baseline_matched_mem.log       (host + docker mem watchdog)
    ├── profiler_out_0.txt
    └── summary.md                     (apples-to-apples comparison)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 25, 2026
…gate

Spec + implementation plan for a coupled fix discovered when working
through the β-lite 2× firing regression from E.1 #4:

* **Phase E.2** — fresh-eyes spec-reviewer audit caught that both β
  kernels multiply by raw γ while Qwen3_5RMSNorm uses x * (1 + γ).
  Latent because the consume branch at qwen3_5.py:473 dead-branches
  under PIECEWISE (@support_torch_compile + fullgraph=True +
  skip_all_guards_unsafe), so β's wrong output is orphaned and the
  legacy path produces the actual answer. Fix: (1 + γ) at
  mlp_kernel.py:1502 and phase_e_kernel.py:641 (audit Phase 4
  epilogue during implementation), plus a reference-diff harness
  cross-checking against Qwen3_5RMSNorm.forward_native.

* **Phase F.1** — opaque-gate refactor (first slice of Phase F,
  FULL-graph enablement). Two new custom ops: cute_phase_e_dispatch
  (replaces qwen3_5.py:473-481 consume gate) and
  cute_phase_e_skip_input_layernorm (wraps qwen3_5.py:386 so layer
  N+1 skips its input_layernorm when β pre-applied it). Fail-loud
  on consume-branch errors; nested op calls into cute_mlp_forward
  on the NOT-consumed path.

Audit also surfaced that Phase E's reported 51.7% speedup measures
β kernel launch latency, not end-to-end decode wall time — the
legacy full path fires alongside β due to the same dead-branch.
Memory updated accordingly (project_phase_e_phantom_speedup,
project_phase_e_beta_math_bug). True end-to-end numbers pending
post-E.2+F.1 re-trace.

Audit Finding #13 (flagged, deferred to Phase F.N): _fusion_active
at qwen3_5.py:423,445 is the same bug class — Phase B/C attention
fusion wins from project_cute_paged_bench may also be phantom.

Files:
  docs/superpowers/specs/2026-04-24-phase-f1-opaque-gate-refactor-design.md
  docs/superpowers/plans/2026-04-24-phase-e2-f1-beta-correctness-opaque-gate.md

Force-add past docs/superpowers/ gitignore rule (consistent with
existing specs/plans which are grandfathered-in and tracked).

Execution next session via superpowers:subagent-driven-development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…+ Embedding)

Tensor-parallel safetensors → GPU loaders matching Python vLLM's
ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding
weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires
the dispatch from the macro side; this commit lands only the
runtime helpers).

Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`:

- `Linear::load_sharded(weights, prefix, dim, rank, world)`. The
  load-bearing bias rules:
  - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias
    shards along dim 0 too — each rank holds its own slice. Mirrors
    Python `ColumnParallelLinear.weight_loader` →
    `loaded_weight.narrow(output_dim=0, …)`.
  - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated
    full-size on rank 0 only**, `None` on other ranks. The forward
    path adds bias before the cross-rank AllReduce-sum; only rank 0's
    contribution survives the sum, giving exactly one bias add to
    the residual stream. Mirrors Python `RowParallelLinear.forward`
    line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`.

- `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`.
  Thin wrapper over `Linear::load_sharded`. The codegen entry point
  for non-fused (single-prefix) sharded loads.

- `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream,
  rank, world)`. Sharded variant of `load_dense_concat` for the fused
  QKV / gate_up paths. Always column-parallel (no row-parallel concat
  exists in any current arch). Each source weight slices along dim 0
  to `[out_i / world, hidden]` then packs into one contiguous
  `[(sum out_i) / world, hidden]` GPU buffer via per-source
  `take_shard_into`. Biases follow the column-parallel rule (sliced
  along dim 0) — matches Python `MergedColumnParallelLinear` /
  `QKVParallelLinear`. Per-rank divisibility is guaranteed by the
  macro's outer-loop fanout `skip` of indivisible (variant, tp)
  tuples (commit `889c44b2f`).

- `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab-
  parallel: slices the embedding table along dim 0
  (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`).
  Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as
  `Linear::load_sharded(dim=0)` — that's what makes
  `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent
  at tp>1.

Defers FP8 / Marlin / BNB sharded variants — the verify model
(commandr) is dense bf16. World == 1 short-circuits to
byte-equivalent behavior with the existing unsharded paths in
every helper, plus shard-kind-aware bias rules. No tests added at
this layer (CUDA stream + safetensors fixtures aren't worth the
infra spend; the integration test is task vllm-project#8). `take_shard` /
`take_shard_into` on `GpuWeights` already exist (used by the prior
hand-written TP path); these wrappers are pure call-site
plumbing on top.

Build clean: ferrite-kernels checks + clippy at default features.
The macro-side consumer that chooses sharded vs unsharded based
on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers
have no runtime caller (intentionally — wholesale codegen
migration per the no_piecemeal_codegen_migration rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request May 9, 2026
The previous debug commit (`aabb517a9`) printed bindings for ICB
compute commands but skipped the Gemm path (which goes through
`resolve_gemm_buffers` rather than `record_compute_dispatch`). This
left bug vllm-project#4 — the MPS-side `offset: 0u64` hardcoding fixed in the
previous commit — invisible to the diagnostic. Add the matching
print for `BucketStep::Gemm`: kernel/dims and `(buf, off)` for each
of a / b / c.

Drop with the rest of `aabb517a9`'s instrumentation when the metal
correctness work is merged-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request May 9, 2026
…oject#4 still open

Update FERRITE_METAL_PROGRESS.md to reflect the third bug fix this
session: `MPSMatrix.initWithBuffer:offset:` was hardcoded to 0,
silently undoing the per-binding offsets the lowering + worker had
threaded all the way to the Gemm encoder. Caught by extending
`FERRITE_METAL_BAKE_DEBUG=1` to dump Gemm bindings — the offset
discrepancy fell out as soon as we could see the bound `(buf, off)`
vs MPS' offset=0 invariant.

Post-fix the model output is finally a function of the input: "hi"
samples token 29966 (`<`) and continues `<assistant>\n…`; different
prompts produce different sequences. But the model still loops on
chat-template-like tokens — at least one more bug remains, likely
in attention math, RoPE, or kernel arithmetic at runtime conditions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request May 9, 2026
Adds a Phase 5.K section documenting this session's startup-time work: 16 commits between 79da7db and f9ae0df, dropping warm init engine from 2.91s to 240ms on Llama-3.2-3B (~20% under vllm-mlx's 300ms baseline). Also closes Phase 5.J's bug vllm-project#4 (bf16 KV slot-0 race) which landed in 3658f2b, and updates the Notes block.

Cross-references project_metal_unpacked_mlp_next.md for the structural follow-up (real MetalMulImpl + Instruction::ElementwiseMul) that would eliminate the remaining ~200ms gate_up pack memcpy and drop startup to ~50ms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant