Skip to content

Fix a bug in 1D input shape#5

Merged
WoosukKwon merged 4 commits into
mainfrom
bugfix
Mar 6, 2023
Merged

Fix a bug in 1D input shape#5
WoosukKwon merged 4 commits into
mainfrom
bugfix

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

This PR fixes a miscalculation of the input shape when iteration-level scheduling is used.

@WoosukKwon WoosukKwon merged commit 04e5acc into main Mar 6, 2023
@WoosukKwon WoosukKwon deleted the bugfix branch March 6, 2023 18:05
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
* finish changing scheduler

* finish merge

* fix model

* Fix (vllm-project#5)

* fix problems

* fix

* delete unused params

* remove redundant comments

---------

Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 14, 2024
Align optimum-intel based model signature with vLLM signature
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
…imum

Install optimum-intel from latest main
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Drop indecies when finish

* min 1 attention layer

* CG is working on forward pass passing

* Remove comments

* cosmetics - rename indecies -> indices, organize some whitespaces

* Add some TODOs

* Adding mamba cache for cg

* Remove useless vars from input_metadata

* Remove unused import

* Set the seqlen offset to boolean

* Return only hidden state

* Return only hidden states

* Add padding to match forward pass bs

* Is prompt instead of seqlen offset

* Remove mamba cache class (not used)

* Another remove

* Remove

* Use mamba4gc

* Fix mamba forward, run update only on non prompt

* Use 1 index after the maximal index

* Remove import

* Remove import

* typo

* typo

* place holder

* Padding and empty token takes it from the first empty place

* reformat

* Apply suggestions from code review

Whitespaces

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024
…3small

 [Model][Kernels] Support Phi3small architecture, blocksparse attnention prefilling kernel, CUDA+Triton paged attn kernels
Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024
Faster v2 hopper fused moe kernel configs
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 12, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
roy-shih referenced this pull request in UnieAI/vllm Mar 31, 2026
逐一 grep 驗證所有已完成項目的整合程式碼確實存在:
- #3 spec decode: _batch_precompute_spec_decode() 已在 scheduler.py
- #5 builtin hash: 已在 config/cache.py Literal type
- vllm-project#15 batch spec decode: _precomputed_spec 快速路徑已在迴圈中

清除 strikethrough 噪音,統一為乾淨的「已完成/未完成」兩表格式。

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12 pushed a commit to jinhuang12/vllm that referenced this pull request Apr 8, 2026
…ssions)

Kill vllm-project#2: E2E 1.41-1.46% at BS=32 (needs ≥1.5%)
Kill vllm-project#5: down_proj 0.754x warm, in_proj_qkvz 0.925x warm

Catch-22: selective dispatch avoids warm regressions but gives 0.5% E2E;
all-shapes dispatch gets 1.46% E2E but has warm regressions.
No configuration satisfies both criteria.

Translation factor ~0.33 appears to be the ceiling for GEMM-only
optimizations on this model/hardware (Qwen3.5-4B, L40S, BF16, TP=1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12 pushed a commit to jinhuang12/vllm that referenced this pull request Apr 8, 2026
- SKILL.md: add Non-Negotiable vllm-project#5 (GPU isolation via CVD prefix), add
  PostToolUse hook row, add gpu_status.py + gpu_force_clear.py to helper
  scripts, SHOULD→MUST for session_id, add Resume Protocol step 3b,
  remove soft-reservation line for debate GPUs
- parallel-tracks.md: replace flat GPU table with TP-Aware GPU Partitioning
  section (exclusive/shared modes, state.json gpu_assignment schema), update
  spawn prompt templates to kernel_cvd + e2e_cvd format, replace GPU
  Allocation During Overlap soft-reservation note with CPU-only mandate
- ammo-impl-champion.md: replace GPU Coordination section with GPU Usage
  (CVD prefix pattern, hook auto-management, blocking behavior)
- ammo-impl-validator.md: add GPU Usage section (same pattern as champion)

Agents no longer need to manually reserve/release GPUs — hooks handle it
automatically when commands include CUDA_VISIBLE_DEVICES from spawn prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 14, 2026
…_like, .contiguous()

DecodeKernel.__call__ changes:
- query: assert contiguity instead of .contiguous() copy
- output: accept caller-provided persistent buffer
- wo_global_scale: pass tensor pointer, not .item() float
- grid.z: use padded_num_seqs for stable graph capture
- gate_ptr/gate_fused: new kernel args for output gate fusion

Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 15, 2026
…elete LlamaSolver/Qwen2Solver

The old macro-generated Llama/Qwen2 `Model` + `solver_hidden_states`
paths (CudaModel::LlamaSolver, CudaModel::Qwen2Solver) were the
pre-ferrite-forward solver output. With ferrite-forward's Weights
enum + forward now wired up, they're redundant. Delete both.

vllm-executor/src/cuda_worker.rs:
- Remove LlamaSolver / Qwen2Solver variants + all their match arms
  (num_layers, num_kv_heads, head_dim, vocab_size, hidden_size,
  hidden_states, forward).
- Dense bf16 Llama always routes to CudaModel::LlamaFerrite (no
  more VLLM_FERRITE=1 env-var gate).
- Dense bf16 Qwen2 routes to the hand-written
  vllm_cuda::model::qwen2::Qwen2ForCausalLM::load (quant/tp/pp
  paths it already used) since Qwen2 through ferrite-forward is
  blocked on HANDOFF Step C (CublasFusedQkvGemmWithBiasImpl, gap
  vllm-project#5).

vllm-cuda/src/model/llama.rs: drop the `ferrite_macros::forward!{}`
block at line 3667 + the cutlass_gemm_ffi! FFI declarations that
only existed for the old macro's output (lines 3630-3665).

vllm-cuda/src/model/qwen2.rs: same — drop the forward!{} block at
line 275, its cutlass FFI block, and now-unused imports
(attention_helpers, driver, kernels, Embedding, Linear, LinearLayer,
RmsNorm, RotaryCache, GpuTensor).

vllm-cuda/Cargo.toml: drop the `ferrite-macros` dep (nothing in
vllm-cuda consumes it anymore; ferrite-models/src/qwen2.rs still
uses it and carries its own dep).

Compile-time impact: the two `forward!{}` invocations ran the full
old ferrite pipeline (parse → DAG → solver → codegen) for all
llama + qwen2 configs on every build, producing ~hundreds of lines
of generated Rust. Their removal cuts vllm-cuda's proc-macro time
significantly.

Verified: cargo check -p vllm-executor --features cuda green in
1.55s (cold: 3:26); fmt + clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12 pushed a commit to jinhuang12/vllm that referenced this pull request Apr 15, 2026
…retry + no-kill rule

Root cause: gpu_reservation.py defaulted session_id to "cli" when CLAUDE_SESSION_ID
was unset, causing auto_release=True to silently evict other agents' GPU reservations.

Changes:
- gpu-pool.md: canonical pattern now includes --session-id and --no-auto-release,
  explicit retry loop for pool exhaustion, process isolation rules
- parallel-tracks.md: spawn prompt uses agent-scoped --session-id {op_id}
- SKILL.md: rule vllm-project#5 updated with new flags, retry, and no-kill references
- ammo-gpu-release.sh: tighter trigger regex, extracts --session-id from command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 16, 2026
Seven-commit Gemma2 push landed this session. Compiler stays
architecture-agnostic — the acid test passed for generality:
zero "gemma" strings below the parser, the entire architecture
fits in DSL body + JSON configs + library Impls.

New sections:
- "Next session starts here" rewritten around the Gemma2 state:
  coherent output on common prompts, known hard-fail on
  `test_cuda_correctness_gemma2_2b` prompt 7 (a translation
  prompt with apostrophe-quoted text), ruled-out causes listed
  (not CUDA graphs, not final softcap, not ferrite-vs-hand-written
  drift, not the embed-scale/sliding-flip bugs I already fixed).
- "Commit history" grew a fresh block of 8 commits.
- "Landed since the correctness fix" gained a full item vllm-project#7 for
  the Gemma2 port, enumerating each sub-piece and its commit.
- Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the
  logical next architecture using the same DSL constructs).
- Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice
  to have)" to "per-layer tensor-dump debug harness" — the
  concrete tool needed to unblock the prompt-7 failure. Spelled
  out the approach including the forward_backbone quirk (for
  Gemma2 it returns pre-softcap logits, not pre-lm_head
  hidden_states, because the DSL's terminal tile is
  `tanh_softcap`).
- "Already landed" list gained nine new bullets covering the
  DSL constructs, the scalars mechanism, Gemma2 Impls, kernel
  `weight_offset` param, Gemma2 DSL + configs, and
  `Gemma2Ferrite` cuda_worker wiring.
- Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses
  scalar multiply via the Gemma embed-scale plumbing.

Honest accounting of the known-fail: correlation r=0.11 between
our post-softcap logits and Python vLLM's on prompt 7. Divergence
is upstream of softcap, specific to this prompt's numerics, and
likely lives in a shared CUDA kernel (flash-attn softcap + sliding,
or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite
and the pre-existing hand-written Gemma2 path exhibit the same
failure — the fix, once found, applies to both.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JlPang863 added a commit to yizhongzoe-cloud/my-vllm-serving-system that referenced this pull request Apr 22, 2026
Adds an env-var-controlled experimental knob for comparing the cost of the
host-memory KV reload recovery path against two cheaper alternatives:

  - reload    (default): existing behavior — restore KV from host checkpoint.
  - restart            : drop checkpoint AND already-decoded tokens, fresh
                         re-prefill of original prompt. Stream consistency
                         BROKEN — only for ablation.
  - reprefill          : drop checkpoint, preserve decoded tokens by
                         extending the prompt. Stream consistency preserved.

Two patches because the centralized policy (ft_benders_centralized) bypasses
RecoveryManager and does its own recovery in ft_client.py:

  - vllm/v1/engine/ft_client.py            (centralized recovery loop)
  - vllm/v1/core/recovery_manager.py       (fault_tolerant + non-centralized
                                            ft_benders policies)

Default is "reload" so the main experiment paths are unchanged.

Validation (W1_Chat/Heavy/F2_Mid, dp=2, 8B, 462 reqs):

  | mode      | goodput   | TTFT p50  | SLO viol | failover_gap p50 |
  |-----------|-----------|-----------|----------|------------------|
  | No-FT     | 321.6 t/s |  406 ms   |   20.3 % |        –         |
  | reload    | 124.2 t/s | 18.6 s    |   65.6 % |     8.6 s        |
  | restart   |  47.9 t/s | 50.4 s    |   87.4 % |     8.6 s        |
  | reprefill |  39.5 t/s | 42.8 s    |   88.5 % |    24.6 s        |

Reload is 2.6-3.1x better goodput than restart/reprefill — KV reload converts
recovery cost from compute (the scarce resource on the surviving engine post-
fault) to I/O (which is plentiful), avoiding prefill capacity contention.
Original hypothesis "reload is overkill" disproved.

Full ablation writeup added as Follow-up finding vllm-project#5 in
experiments_v2/docs/e1a_quick_diagnosis.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own
PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full
= None`, so `cute.compile()` fires once per layer on first request —
16 × ~23 s ≈ ~6 min cold-start stall.

Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple
of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4`
(audited via grep; key covers them all + 12 safe-redundant derived
fields). Instances with matching config share one compiled kernel.

Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`):
- 16 β-coop attachments → 1 compile event (was 16).
- Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each.
- Projected savings ≈ 310 s (~5 min) shaved off first-request latency.
- GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12',
  not a kernel regression — reproduces on baseline without this fix).

Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`):
- 6 new tests covering dict existence, key equivalence for matching
  configs, key distinctness for different configs, 16-instance → 1-compile
  behavior, distinct-config → N-compiles, and back-compat instance attr
  population.
- 33/33 existing Phase E tests still pass.

Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM
shrink + #4 matched-concurrency baseline bench (follow-up session),
#5 cudaProfilerApi hook (infra).

Base: 7bc5773

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
… slot

Regression tripwire for the load-bearing assumption in
project_tp_design_notes: `Instruction::AllReduce` is one-tile in-
place same-shape, so shape-aware coloring must place its output at
the input's slot with no `View` row. At runtime that lets
`NcclGroup::all_reduce_inplace` operate on the gemm output tile
directly — no alias-row preamble in the bucket slice.

The generic `coloring_same_shape_alias_collapses_to_source` test
already validates the colorer's behavior for any one-tile in-place
same-shape Impl, so this is a duplicate hypothesis. The value is:
1. Naming — future grep for "allreduce" in coloring lands here.
2. Realistic shape — `[N=4, H=16]` instead of `[1]` so coloring
   regressions that interact with `coloring_disjoint_lifetimes_dont
   _share_across_shapes` (the K=8192 N=5120 cublas-panic-driven
   test) trip both invariants.
3. Three-node row-parallel chain — gemm → all_reduce → consumer —
   matches what the upcoming task vllm-project#5 lowering pass will produce.

OpKind on the all_reduce node is `OpKind::Add` for now; task vllm-project#5
adds `OpKind::AllReduce` proper alongside the lowering pass.

Macro test suite: 210 → 211 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Two regression tests assert that `CutlassGemmAddImpl::matches()` and
`FusedAddRmsNormImpl::matches()` refuse to claim a fused (Gemm, Add)
or (Add, RmsNorm) subgraph across an intermediate tile — the
topology the upcoming task vllm-project#5 lowering pass produces by inserting
`Instruction::AllReduce` after every row-parallel gemm.

Without this property at tp>1 the cutlass_gemm_add epilogue would
fuse the residual addition into the gemm's beta=1.0 path, mixing
pre-reduce per-rank gemm outputs into the residual stream BEFORE
all-reduce reconciled them across ranks → silently wrong math.

The "guard" is emergent — `consumes_tile(add, gemm)` returns false
once any intermediate node sits between the seed and the candidate
fusion partner. Both tests pin that property explicitly so a future
matches() change (e.g. following alias chains across in-place ops)
can't reintroduce the cross-AllReduce fusion.

Models AllReduce as a 1-input Add; `OpKind::AllReduce` proper lands
with the lowering pass in task vllm-project#5. The `inputs.len() != 2` filter
in CutlassGemmAddImpl::matches() ensures the stub itself isn't
mis-claimed as the residual Add.

Macro test suite: 211 → 213 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Sub-step 1 of task vllm-project#5 (lowering pass). Lands the structural pieces
the lowering pass will produce, with no FUF currently carrying
`OpKind::AllReduce`:

- `classified::OpKind::AllReduce` variant. Deliberately omitted
  from `OpKind::from_name` so a user can't write `all_reduce(...)`
  in a `#[forward]` body — the canonical path is the post-FUF
  lowering pass that injects nodes with this op kind.
- `shape::apply_signature` and `shape::weight_arg_ranks` arms.
  Identity-shape one-input via `sig_unary_elementwise`, no tensor
  weight inputs.
- `AllReduceImpl` in `impl_lib.rs`: single-tile claim of any
  `OpKind::AllReduce` node with exactly one tile input. `output_alias`
  declares dst→input collapse so shape-aware coloring places the
  AllReduce row at the input's slot with no `View` indirection
  (validated by `coloring_allreduce_collapses_to_input_slot`). Cost
  model is bandwidth-bound 2x of local memory, falls back to
  `UNCALIBRATED_COST_US` on profiles missing measured cost data.
- Registered in `starter_library()` — at tp=1 the lowering pass
  emits zero AllReduce FufNodes so the matcher is a no-op in the
  solver.

One unit test pinning the matcher contract (single tile input
required, zero/multi rejected) and the output_alias collapse.

Macro test suite: 213 → 214 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Sub-step 2 of task vllm-project#5. Lands the FUF-lowering pass module + the
no-op fast path at tp_world_size=1. Wires `insert_all_reduces` into
`compile()` between `fuf::unroll` and `solver::solve`, called with a
`tp=1` literal so every existing SolvedModel sees a byte-identical
FUF flowing into the solver. Task #5c lands the actual insertion
logic + per-arch shard-kind table; task vllm-project#7 swaps the literal for
`solved_model.tp_world_size` once canonical fanout exists.

Tests:
- `lowering_no_allreduce_at_tp_eq_1` — gemm → residual_add FUF, pass
  at tp=1, asserts node count + OpKind preserved tile-by-tile and
  zero `OpKind::AllReduce` nodes added.
- `lowering_no_allreduce_at_tp_eq_1_empty_fuf` — degenerate case,
  prevents a future implementation from indexing into nodes[0]
  without a length check.

Macro test suite: 214 → 216 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Sub-step 3 of task vllm-project#5. Lands the actual insertion logic + the
generic shard-kind name-pattern table. At tp_world_size > 1 the
pass:

1. Walks every `OpKind::Gemm` node, looks up its weight in
   `program.weights`, classifies via `shard_kind_for_weight_path` —
   `q_proj | k_proj | v_proj | gate_proj | up_proj` → `ShardDim0`
   (column-parallel, no comm), `o_proj | down_proj` → `ShardDim1`
   (row-parallel, needs all-reduce), everything else → `Replicate`.
2. For each `ShardDim1` gemm, appends a fresh `OpKind::AllReduce`
   FufNode with `new_id = nodes.len()`, reading the gemm output as
   its single tile input.
3. Rewires every OTHER node's `FufInput::Tile { id: gemm_id }` to
   `FufInput::Tile { id: all_reduce_id }`. After this rewrite, every
   consumer of the gemm's output sees the post-reduce value — the
   load-bearing correctness invariant.

The shard-kind matcher is generic across the whole ferrite tree
since every existing arch (llama / qwen* / mistral / gemma* /
commandr / phi3 / granite / deepseek_v* / mixtral) uses the
HF-standard names. Per-arch overrides can layer on later when an
arch with non-standard names lands; until then keeping the matcher
generic avoids per-arch boilerplate.

Tests:
- `shard_kind_table_covers_hf_standard_names` — explicit table
  pin so a future per-arch override can't silently break the
  generic fallback.
- `lowering_inserts_allreduce_after_row_parallel_gemm_at_tp_gt_1`
  — the load-bearing tp>1 assertion: 1 AllReduce inserted, residual
  add rewired, raw gemm output never read by a downstream consumer.
- `lowering_skips_column_parallel_gemms_at_tp_gt_1` — q_proj must
  not trigger AllReduce; false insertion would clobber the per-rank
  shape contract.

Macro test suite: 216 → 219 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…llel

Matches Python vLLM. `VocabParallelEmbedding` shards the embedding
table along dim 0 (vocab dim); `ParallelLMHead` inherits from it,
so both layers carry the same sharded layout — that's what makes
`tie_weights` (lm_head.weight = embed_tokens.weight) self-consistent
at tp>1 (ref vllm/model_executor/layers/vocab_parallel_embedding.py
lines 186, 500, 553).

Adds:
- `shard_kind_for_last_segment(&str)` so codegen can lookup without
  pre-splitting the path. Refactor of the existing match arm.
- `shard_kind_for_dotted_prefix(&str)` convenience for codegen,
  which has the dotted layer-0 prefix string in hand. `dead_code`
  allow until task vllm-project#5 wires the load helpers; pinned by tests now.
- `embed_tokens` and `lm_head` map to `ShardDim0` (was `Replicate`).

Behavior at the existing call site (`insert_all_reduces`) is
byte-identical: the lowering pass walks only `OpKind::Gemm` nodes
filtered by `ShardDim1`. `embed_tokens` is consumed by `Embed`,
not `Gemm`. `lm_head` is a Gemm but ShardDim0, not ShardDim1, so
the row-parallel filter still excludes it. AllReduce-after-Embed
+ AllGather-after-lm_head land in task vllm-project#6.

Macro tests: 195/195 (was 194/194) under both default `--features
cuda` and `--features nccl`. New tests:
- `shard_kind_for_dotted_prefix_pins_runtime_paths` covers the
  load-bearing per-arch paths from `default_required_weights`.
- Existing `shard_kind_table_covers_hf_standard_names` updated to
  expect `ShardDim0` for `embed_tokens` / `lm_head`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…+ Embedding)

Tensor-parallel safetensors → GPU loaders matching Python vLLM's
ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding
weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires
the dispatch from the macro side; this commit lands only the
runtime helpers).

Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`:

- `Linear::load_sharded(weights, prefix, dim, rank, world)`. The
  load-bearing bias rules:
  - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias
    shards along dim 0 too — each rank holds its own slice. Mirrors
    Python `ColumnParallelLinear.weight_loader` →
    `loaded_weight.narrow(output_dim=0, …)`.
  - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated
    full-size on rank 0 only**, `None` on other ranks. The forward
    path adds bias before the cross-rank AllReduce-sum; only rank 0's
    contribution survives the sum, giving exactly one bias add to
    the residual stream. Mirrors Python `RowParallelLinear.forward`
    line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`.

- `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`.
  Thin wrapper over `Linear::load_sharded`. The codegen entry point
  for non-fused (single-prefix) sharded loads.

- `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream,
  rank, world)`. Sharded variant of `load_dense_concat` for the fused
  QKV / gate_up paths. Always column-parallel (no row-parallel concat
  exists in any current arch). Each source weight slices along dim 0
  to `[out_i / world, hidden]` then packs into one contiguous
  `[(sum out_i) / world, hidden]` GPU buffer via per-source
  `take_shard_into`. Biases follow the column-parallel rule (sliced
  along dim 0) — matches Python `MergedColumnParallelLinear` /
  `QKVParallelLinear`. Per-rank divisibility is guaranteed by the
  macro's outer-loop fanout `skip` of indivisible (variant, tp)
  tuples (commit `889c44b2f`).

- `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab-
  parallel: slices the embedding table along dim 0
  (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`).
  Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as
  `Linear::load_sharded(dim=0)` — that's what makes
  `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent
  at tp>1.

Defers FP8 / Marlin / BNB sharded variants — the verify model
(commandr) is dense bf16. World == 1 short-circuits to
byte-equivalent behavior with the existing unsharded paths in
every helper, plus shard-kind-aware bias rules. No tests added at
this layer (CUDA stream + safetensors fixtures aren't worth the
infra spend; the integration test is task vllm-project#8). `take_shard` /
`take_shard_into` on `GpuWeights` already exist (used by the prior
hand-written TP path); these wrappers are pure call-site
plumbing on top.

Build clean: ferrite-kernels checks + clippy at default features.
The macro-side consumer that chooses sharded vs unsharded based
on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers
have no runtime caller (intentionally — wholesale codegen
migration per the no_piecemeal_codegen_migration rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Adds the `load_layered_*_sharded` wrappers that the codegen macro
will route to at tp>1 (task vllm-project#5). Mirrors the existing layered
helpers — same per-canonical accessor → single-line-of-expanded
fold, just routing through the kernel-side `_sharded` variants
landed in `b8ceca55a`.

- `load_layered_embedding_sharded(rank, world)` → vocab-parallel
  `Embedding::load_sharded`. No layered Embedding accessor exists
  in any current arch; helper is here for symmetry so the codegen
  can route every layered FieldLoad through a `_sharded` variant
  uniformly.
- `load_layered_linear_dense_sharded(dim, rank, world)` → wraps
  `LinearLayer::load_dense_sharded`. `dim = 0` for column-parallel
  (q/k/v/gate/up); `dim = 1` for row-parallel (o/down). Per-dim
  bias rules live in the kernel-side `Linear::load_sharded`.
- `load_layered_linear_dense_concat_sharded(rank, world)` → wraps
  `LinearLayer::load_dense_concat_sharded`. No `dim` arg — fused
  QKV / gate_up are always column-parallel.

Defers `_sharded` variants of the marlin / fp8 / bnb4 layered
helpers — the verify model is dense bf16. Per the
`no_codegen_in_proc_macro` rule the helpers stay narrow (one per
FieldLoad arm); the macro maps each (variant, dim) → exactly one
helper at codegen time.

Build clean: `ferrite-forward` checks + clippy at `--features
cuda` and `--features nccl`. Helpers have no consumers yet (task
vllm-project#5 wires them) but are `pub` at the crate surface so dead-code
lint is silent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…a_worker

Prerequisite for task vllm-project#5 (codegen-side shard-kind dispatch). The
emitted `_sharded` loader calls need `rank` as a runtime value in
scope inside the `Weights::load_with` body — `tp_world_size` is
already a baked literal per (model, tp), but `rank` is per-process.

Touches:
- `ferrite_forward::dispatcher::ArchTryLoadFn` + `try_load(...,
  tp_rank: u8, ...)` — adds rank between `tp_world_size` and
  `max_model_len` in the public API. Forwarded to the matched
  registration's closure.
- Macro: `inventory::submit!`'s closure body becomes
  `|gw, stream, max_model_len, tp_rank, hf|` and passes
  `tp_rank` into `Weights::load(..., #tp_lit, tp_rank)`. The
  per-arch `Weights::load(...)` and `load_match_arms` body gain
  the `tp_rank: u8` parameter and forward it to
  `#model_ident::load(gw, stream, max_model_len, tp_rank)`.
- Codegen: per-model `load_with(..., tp_rank: u8)` and `load(...,
  tp_rank: u8)` (both canonical and Shim variants). `tp_rank` is
  in scope inside the `load_with` body — task vllm-project#5's emitted
  `_sharded` calls reference it as `tp_rank as usize`.
- `cuda_worker.rs`: passes `self.config.tp_rank as u8` into
  `try_load`. The `!use_tp` ferrite eligibility gate stays in
  place — until task vllm-project#5 lands the codegen-side shard-kind
  dispatch, the per-arch `Weights::load` at tp>1 still calls the
  unsharded loaders and would shape-mismatch the sharded
  `<W>::*` constants from `b8c25e5d2`. The gate lifts when vllm-project#5 +
  vllm-project#6 (embed masking + lowering AllReduce-after-Embed +
  AllGather-after-lm_head) are wired end-to-end.

Macro tests 196/196 under both `--features cuda` and `--features
nccl`. Full ferrite-models umbrella build clean at `--features
cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m51s.
vllm-executor + vllm-cuda + ferrite-models compile clean at
`--features cuda` in 1m58s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant