Fix a bug in 1D input shape by WoosukKwon · Pull Request #5 · vllm-project/vllm

WoosukKwon · 2023-03-03T04:18:38Z

This PR fixes a miscalculation of the input shape when iteration-level scheduling is used.

More improvements awq

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>

Align optimum-intel based model signature with vLLM signature

…imum Install optimum-intel from latest main

* Drop indecies when finish * min 1 attention layer * CG is working on forward pass passing * Remove comments * cosmetics - rename indecies -> indices, organize some whitespaces * Add some TODOs * Adding mamba cache for cg * Remove useless vars from input_metadata * Remove unused import * Set the seqlen offset to boolean * Return only hidden state * Return only hidden states * Add padding to match forward pass bs * Is prompt instead of seqlen offset * Remove mamba cache class (not used) * Another remove * Remove * Use mamba4gc * Fix mamba forward, run update only on non prompt * Use 1 index after the maximal index * Remove import * Remove import * typo * typo * place holder * Padding and empty token takes it from the first empty place * reformat * Apply suggestions from code review Whitespaces --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

…3small [Model][Kernels] Support Phi3small architecture, blocksparse attnention prefilling kernel, CUDA+Triton paged attn kernels

Faster v2 hopper fused moe kernel configs

Kuntai disagg refactor

fea: support rfork

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

逐一 grep 驗證所有已完成項目的整合程式碼確實存在： - #3 spec decode: _batch_precompute_spec_decode() 已在 scheduler.py - #5 builtin hash: 已在 config/cache.py Literal type - vllm-project#15 batch spec decode: _precomputed_spec 快速路徑已在迴圈中清除 strikethrough 噪音，統一為乾淨的「已完成/未完成」兩表格式。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ssions) Kill vllm-project#2: E2E 1.41-1.46% at BS=32 (needs ≥1.5%) Kill vllm-project#5: down_proj 0.754x warm, in_proj_qkvz 0.925x warm Catch-22: selective dispatch avoids warm regressions but gives 0.5% E2E; all-shapes dispatch gets 1.46% E2E but has warm regressions. No configuration satisfies both criteria. Translation factor ~0.33 appears to be the ceiling for GEMM-only optimizations on this model/hardware (Qwen3.5-4B, L40S, BF16, TP=1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- SKILL.md: add Non-Negotiable vllm-project#5 (GPU isolation via CVD prefix), add PostToolUse hook row, add gpu_status.py + gpu_force_clear.py to helper scripts, SHOULD→MUST for session_id, add Resume Protocol step 3b, remove soft-reservation line for debate GPUs - parallel-tracks.md: replace flat GPU table with TP-Aware GPU Partitioning section (exclusive/shared modes, state.json gpu_assignment schema), update spawn prompt templates to kernel_cvd + e2e_cvd format, replace GPU Allocation During Overlap soft-reservation note with CPU-only mandate - ammo-impl-champion.md: replace GPU Coordination section with GPU Usage (CVD prefix pattern, hook auto-management, blocking behavior) - ammo-impl-validator.md: add GPU Usage section (same pattern as champion) Agents no longer need to manually reserve/release GPUs — hooks handle it automatically when commands include CUDA_VISIBLE_DEVICES from spawn prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_like, .contiguous() DecodeKernel.__call__ changes: - query: assert contiguity instead of .contiguous() copy - output: accept caller-provided persistent buffer - wo_global_scale: pass tensor pointer, not .item() float - grid.z: use padded_num_seqs for stable graph capture - gate_ptr/gate_fused: new kernel args for output gate fusion Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…elete LlamaSolver/Qwen2Solver The old macro-generated Llama/Qwen2 `Model` + `solver_hidden_states` paths (CudaModel::LlamaSolver, CudaModel::Qwen2Solver) were the pre-ferrite-forward solver output. With ferrite-forward's Weights enum + forward now wired up, they're redundant. Delete both. vllm-executor/src/cuda_worker.rs: - Remove LlamaSolver / Qwen2Solver variants + all their match arms (num_layers, num_kv_heads, head_dim, vocab_size, hidden_size, hidden_states, forward). - Dense bf16 Llama always routes to CudaModel::LlamaFerrite (no more VLLM_FERRITE=1 env-var gate). - Dense bf16 Qwen2 routes to the hand-written vllm_cuda::model::qwen2::Qwen2ForCausalLM::load (quant/tp/pp paths it already used) since Qwen2 through ferrite-forward is blocked on HANDOFF Step C (CublasFusedQkvGemmWithBiasImpl, gap vllm-project#5). vllm-cuda/src/model/llama.rs: drop the `ferrite_macros::forward!{}` block at line 3667 + the cutlass_gemm_ffi! FFI declarations that only existed for the old macro's output (lines 3630-3665). vllm-cuda/src/model/qwen2.rs: same — drop the forward!{} block at line 275, its cutlass FFI block, and now-unused imports (attention_helpers, driver, kernels, Embedding, Linear, LinearLayer, RmsNorm, RotaryCache, GpuTensor). vllm-cuda/Cargo.toml: drop the `ferrite-macros` dep (nothing in vllm-cuda consumes it anymore; ferrite-models/src/qwen2.rs still uses it and carries its own dep). Compile-time impact: the two `forward!{}` invocations ran the full old ferrite pipeline (parse → DAG → solver → codegen) for all llama + qwen2 configs on every build, producing ~hundreds of lines of generated Rust. Their removal cuts vllm-cuda's proc-macro time significantly. Verified: cargo check -p vllm-executor --features cuda green in 1.55s (cold: 3:26); fmt + clippy -D warnings clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…retry + no-kill rule Root cause: gpu_reservation.py defaulted session_id to "cli" when CLAUDE_SESSION_ID was unset, causing auto_release=True to silently evict other agents' GPU reservations. Changes: - gpu-pool.md: canonical pattern now includes --session-id and --no-auto-release, explicit retry loop for pool exhaustion, process isolation rules - parallel-tracks.md: spawn prompt uses agent-scoped --session-id {op_id} - SKILL.md: rule vllm-project#5 updated with new flags, retry, and no-kill references - ammo-gpu-release.sh: tighter trigger regex, extracts --session-id from command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seven-commit Gemma2 push landed this session. Compiler stays architecture-agnostic — the acid test passed for generality: zero "gemma" strings below the parser, the entire architecture fits in DSL body + JSON configs + library Impls. New sections: - "Next session starts here" rewritten around the Gemma2 state: coherent output on common prompts, known hard-fail on `test_cuda_correctness_gemma2_2b` prompt 7 (a translation prompt with apostrophe-quoted text), ruled-out causes listed (not CUDA graphs, not final softcap, not ferrite-vs-hand-written drift, not the embed-scale/sliding-flip bugs I already fixed). - "Commit history" grew a fresh block of 8 commits. - "Landed since the correctness fix" gained a full item vllm-project#7 for the Gemma2 port, enumerating each sub-piece and its commit. - Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the logical next architecture using the same DSL constructs). - Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice to have)" to "per-layer tensor-dump debug harness" — the concrete tool needed to unblock the prompt-7 failure. Spelled out the approach including the forward_backbone quirk (for Gemma2 it returns pre-softcap logits, not pre-lm_head hidden_states, because the DSL's terminal tile is `tanh_softcap`). - "Already landed" list gained nine new bullets covering the DSL constructs, the scalars mechanism, Gemma2 Impls, kernel `weight_offset` param, Gemma2 DSL + configs, and `Gemma2Ferrite` cuda_worker wiring. - Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses scalar multiply via the Gemma embed-scale plumbing. Honest accounting of the known-fail: correlation r=0.11 between our post-softcap logits and Python vLLM's on prompt 7. Divergence is upstream of softcap, specific to this prompt's numerics, and likely lives in a shared CUDA kernel (flash-attn softcap + sliding, or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite and the pre-existing hand-written Gemma2 path exhibit the same failure — the fix, once found, applies to both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds an env-var-controlled experimental knob for comparing the cost of the host-memory KV reload recovery path against two cheaper alternatives: - reload (default): existing behavior — restore KV from host checkpoint. - restart : drop checkpoint AND already-decoded tokens, fresh re-prefill of original prompt. Stream consistency BROKEN — only for ablation. - reprefill : drop checkpoint, preserve decoded tokens by extending the prompt. Stream consistency preserved. Two patches because the centralized policy (ft_benders_centralized) bypasses RecoveryManager and does its own recovery in ft_client.py: - vllm/v1/engine/ft_client.py (centralized recovery loop) - vllm/v1/core/recovery_manager.py (fault_tolerant + non-centralized ft_benders policies) Default is "reload" so the main experiment paths are unchanged. Validation (W1_Chat/Heavy/F2_Mid, dp=2, 8B, 462 reqs): | mode | goodput | TTFT p50 | SLO viol | failover_gap p50 | |-----------|-----------|-----------|----------|------------------| | No-FT | 321.6 t/s | 406 ms | 20.3 % | – | | reload | 124.2 t/s | 18.6 s | 65.6 % | 8.6 s | | restart | 47.9 t/s | 50.4 s | 87.4 % | 8.6 s | | reprefill | 39.5 t/s | 42.8 s | 88.5 % | 24.6 s | Reload is 2.6-3.1x better goodput than restart/reprefill — KV reload converts recovery cost from compute (the scarce resource on the surviving engine post- fault) to I/O (which is plentiful), avoiding prefill capacity contention. Original hypothesis "reload is overkill" disproved. Full ablation writeup added as Follow-up finding vllm-project#5 in experiments_v2/docs/e1a_quick_diagnosis.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… slot Regression tripwire for the load-bearing assumption in project_tp_design_notes: `Instruction::AllReduce` is one-tile in- place same-shape, so shape-aware coloring must place its output at the input's slot with no `View` row. At runtime that lets `NcclGroup::all_reduce_inplace` operate on the gemm output tile directly — no alias-row preamble in the bucket slice. The generic `coloring_same_shape_alias_collapses_to_source` test already validates the colorer's behavior for any one-tile in-place same-shape Impl, so this is a duplicate hypothesis. The value is: 1. Naming — future grep for "allreduce" in coloring lands here. 2. Realistic shape — `[N=4, H=16]` instead of `[1]` so coloring regressions that interact with `coloring_disjoint_lifetimes_dont _share_across_shapes` (the K=8192 N=5120 cublas-panic-driven test) trip both invariants. 3. Three-node row-parallel chain — gemm → all_reduce → consumer — matches what the upcoming task vllm-project#5 lowering pass will produce. OpKind on the all_reduce node is `OpKind::Add` for now; task vllm-project#5 adds `OpKind::AllReduce` proper alongside the lowering pass. Macro test suite: 210 → 211 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two regression tests assert that `CutlassGemmAddImpl::matches()` and `FusedAddRmsNormImpl::matches()` refuse to claim a fused (Gemm, Add) or (Add, RmsNorm) subgraph across an intermediate tile — the topology the upcoming task vllm-project#5 lowering pass produces by inserting `Instruction::AllReduce` after every row-parallel gemm. Without this property at tp>1 the cutlass_gemm_add epilogue would fuse the residual addition into the gemm's beta=1.0 path, mixing pre-reduce per-rank gemm outputs into the residual stream BEFORE all-reduce reconciled them across ranks → silently wrong math. The "guard" is emergent — `consumes_tile(add, gemm)` returns false once any intermediate node sits between the seed and the candidate fusion partner. Both tests pin that property explicitly so a future matches() change (e.g. following alias chains across in-place ops) can't reintroduce the cross-AllReduce fusion. Models AllReduce as a 1-input Add; `OpKind::AllReduce` proper lands with the lowering pass in task vllm-project#5. The `inputs.len() != 2` filter in CutlassGemmAddImpl::matches() ensures the stub itself isn't mis-claimed as the residual Add. Macro test suite: 211 → 213 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sub-step 1 of task vllm-project#5 (lowering pass). Lands the structural pieces the lowering pass will produce, with no FUF currently carrying `OpKind::AllReduce`: - `classified::OpKind::AllReduce` variant. Deliberately omitted from `OpKind::from_name` so a user can't write `all_reduce(...)` in a `#[forward]` body — the canonical path is the post-FUF lowering pass that injects nodes with this op kind. - `shape::apply_signature` and `shape::weight_arg_ranks` arms. Identity-shape one-input via `sig_unary_elementwise`, no tensor weight inputs. - `AllReduceImpl` in `impl_lib.rs`: single-tile claim of any `OpKind::AllReduce` node with exactly one tile input. `output_alias` declares dst→input collapse so shape-aware coloring places the AllReduce row at the input's slot with no `View` indirection (validated by `coloring_allreduce_collapses_to_input_slot`). Cost model is bandwidth-bound 2x of local memory, falls back to `UNCALIBRATED_COST_US` on profiles missing measured cost data. - Registered in `starter_library()` — at tp=1 the lowering pass emits zero AllReduce FufNodes so the matcher is a no-op in the solver. One unit test pinning the matcher contract (single tile input required, zero/multi rejected) and the output_alias collapse. Macro test suite: 213 → 214 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sub-step 2 of task vllm-project#5. Lands the FUF-lowering pass module + the no-op fast path at tp_world_size=1. Wires `insert_all_reduces` into `compile()` between `fuf::unroll` and `solver::solve`, called with a `tp=1` literal so every existing SolvedModel sees a byte-identical FUF flowing into the solver. Task #5c lands the actual insertion logic + per-arch shard-kind table; task vllm-project#7 swaps the literal for `solved_model.tp_world_size` once canonical fanout exists. Tests: - `lowering_no_allreduce_at_tp_eq_1` — gemm → residual_add FUF, pass at tp=1, asserts node count + OpKind preserved tile-by-tile and zero `OpKind::AllReduce` nodes added. - `lowering_no_allreduce_at_tp_eq_1_empty_fuf` — degenerate case, prevents a future implementation from indexing into nodes[0] without a length check. Macro test suite: 214 → 216 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sub-step 3 of task vllm-project#5. Lands the actual insertion logic + the generic shard-kind name-pattern table. At tp_world_size > 1 the pass: 1. Walks every `OpKind::Gemm` node, looks up its weight in `program.weights`, classifies via `shard_kind_for_weight_path` — `q_proj | k_proj | v_proj | gate_proj | up_proj` → `ShardDim0` (column-parallel, no comm), `o_proj | down_proj` → `ShardDim1` (row-parallel, needs all-reduce), everything else → `Replicate`. 2. For each `ShardDim1` gemm, appends a fresh `OpKind::AllReduce` FufNode with `new_id = nodes.len()`, reading the gemm output as its single tile input. 3. Rewires every OTHER node's `FufInput::Tile { id: gemm_id }` to `FufInput::Tile { id: all_reduce_id }`. After this rewrite, every consumer of the gemm's output sees the post-reduce value — the load-bearing correctness invariant. The shard-kind matcher is generic across the whole ferrite tree since every existing arch (llama / qwen* / mistral / gemma* / commandr / phi3 / granite / deepseek_v* / mixtral) uses the HF-standard names. Per-arch overrides can layer on later when an arch with non-standard names lands; until then keeping the matcher generic avoids per-arch boilerplate. Tests: - `shard_kind_table_covers_hf_standard_names` — explicit table pin so a future per-arch override can't silently break the generic fallback. - `lowering_inserts_allreduce_after_row_parallel_gemm_at_tp_gt_1` — the load-bearing tp>1 assertion: 1 AllReduce inserted, residual add rewired, raw gemm output never read by a downstream consumer. - `lowering_skips_column_parallel_gemms_at_tp_gt_1` — q_proj must not trigger AllReduce; false insertion would clobber the per-rank shape contract. Macro test suite: 216 → 219 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llel Matches Python vLLM. `VocabParallelEmbedding` shards the embedding table along dim 0 (vocab dim); `ParallelLMHead` inherits from it, so both layers carry the same sharded layout — that's what makes `tie_weights` (lm_head.weight = embed_tokens.weight) self-consistent at tp>1 (ref vllm/model_executor/layers/vocab_parallel_embedding.py lines 186, 500, 553). Adds: - `shard_kind_for_last_segment(&str)` so codegen can lookup without pre-splitting the path. Refactor of the existing match arm. - `shard_kind_for_dotted_prefix(&str)` convenience for codegen, which has the dotted layer-0 prefix string in hand. `dead_code` allow until task vllm-project#5 wires the load helpers; pinned by tests now. - `embed_tokens` and `lm_head` map to `ShardDim0` (was `Replicate`). Behavior at the existing call site (`insert_all_reduces`) is byte-identical: the lowering pass walks only `OpKind::Gemm` nodes filtered by `ShardDim1`. `embed_tokens` is consumed by `Embed`, not `Gemm`. `lm_head` is a Gemm but ShardDim0, not ShardDim1, so the row-parallel filter still excludes it. AllReduce-after-Embed + AllGather-after-lm_head land in task vllm-project#6. Macro tests: 195/195 (was 194/194) under both default `--features cuda` and `--features nccl`. New tests: - `shard_kind_for_dotted_prefix_pins_runtime_paths` covers the load-bearing per-arch paths from `default_required_weights`. - Existing `shard_kind_table_covers_hf_standard_names` updated to expect `ShardDim0` for `embed_tokens` / `lm_head`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ Embedding) Tensor-parallel safetensors → GPU loaders matching Python vLLM's ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires the dispatch from the macro side; this commit lands only the runtime helpers). Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`: - `Linear::load_sharded(weights, prefix, dim, rank, world)`. The load-bearing bias rules: - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias shards along dim 0 too — each rank holds its own slice. Mirrors Python `ColumnParallelLinear.weight_loader` → `loaded_weight.narrow(output_dim=0, …)`. - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated full-size on rank 0 only**, `None` on other ranks. The forward path adds bias before the cross-rank AllReduce-sum; only rank 0's contribution survives the sum, giving exactly one bias add to the residual stream. Mirrors Python `RowParallelLinear.forward` line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`. - `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`. Thin wrapper over `Linear::load_sharded`. The codegen entry point for non-fused (single-prefix) sharded loads. - `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream, rank, world)`. Sharded variant of `load_dense_concat` for the fused QKV / gate_up paths. Always column-parallel (no row-parallel concat exists in any current arch). Each source weight slices along dim 0 to `[out_i / world, hidden]` then packs into one contiguous `[(sum out_i) / world, hidden]` GPU buffer via per-source `take_shard_into`. Biases follow the column-parallel rule (sliced along dim 0) — matches Python `MergedColumnParallelLinear` / `QKVParallelLinear`. Per-rank divisibility is guaranteed by the macro's outer-loop fanout `skip` of indivisible (variant, tp) tuples (commit `889c44b2f`). - `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab- parallel: slices the embedding table along dim 0 (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`). Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as `Linear::load_sharded(dim=0)` — that's what makes `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent at tp>1. Defers FP8 / Marlin / BNB sharded variants — the verify model (commandr) is dense bf16. World == 1 short-circuits to byte-equivalent behavior with the existing unsharded paths in every helper, plus shard-kind-aware bias rules. No tests added at this layer (CUDA stream + safetensors fixtures aren't worth the infra spend; the integration test is task vllm-project#8). `take_shard` / `take_shard_into` on `GpuWeights` already exist (used by the prior hand-written TP path); these wrappers are pure call-site plumbing on top. Build clean: ferrite-kernels checks + clippy at default features. The macro-side consumer that chooses sharded vs unsharded based on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers have no runtime caller (intentionally — wholesale codegen migration per the no_piecemeal_codegen_migration rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the `load_layered_*_sharded` wrappers that the codegen macro will route to at tp>1 (task vllm-project#5). Mirrors the existing layered helpers — same per-canonical accessor → single-line-of-expanded fold, just routing through the kernel-side `_sharded` variants landed in `b8ceca55a`. - `load_layered_embedding_sharded(rank, world)` → vocab-parallel `Embedding::load_sharded`. No layered Embedding accessor exists in any current arch; helper is here for symmetry so the codegen can route every layered FieldLoad through a `_sharded` variant uniformly. - `load_layered_linear_dense_sharded(dim, rank, world)` → wraps `LinearLayer::load_dense_sharded`. `dim = 0` for column-parallel (q/k/v/gate/up); `dim = 1` for row-parallel (o/down). Per-dim bias rules live in the kernel-side `Linear::load_sharded`. - `load_layered_linear_dense_concat_sharded(rank, world)` → wraps `LinearLayer::load_dense_concat_sharded`. No `dim` arg — fused QKV / gate_up are always column-parallel. Defers `_sharded` variants of the marlin / fp8 / bnb4 layered helpers — the verify model is dense bf16. Per the `no_codegen_in_proc_macro` rule the helpers stay narrow (one per FieldLoad arm); the macro maps each (variant, dim) → exactly one helper at codegen time. Build clean: `ferrite-forward` checks + clippy at `--features cuda` and `--features nccl`. Helpers have no consumers yet (task vllm-project#5 wires them) but are `pub` at the crate surface so dead-code lint is silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…a_worker Prerequisite for task vllm-project#5 (codegen-side shard-kind dispatch). The emitted `_sharded` loader calls need `rank` as a runtime value in scope inside the `Weights::load_with` body — `tp_world_size` is already a baked literal per (model, tp), but `rank` is per-process. Touches: - `ferrite_forward::dispatcher::ArchTryLoadFn` + `try_load(..., tp_rank: u8, ...)` — adds rank between `tp_world_size` and `max_model_len` in the public API. Forwarded to the matched registration's closure. - Macro: `inventory::submit!`'s closure body becomes `|gw, stream, max_model_len, tp_rank, hf|` and passes `tp_rank` into `Weights::load(..., #tp_lit, tp_rank)`. The per-arch `Weights::load(...)` and `load_match_arms` body gain the `tp_rank: u8` parameter and forward it to `#model_ident::load(gw, stream, max_model_len, tp_rank)`. - Codegen: per-model `load_with(..., tp_rank: u8)` and `load(..., tp_rank: u8)` (both canonical and Shim variants). `tp_rank` is in scope inside the `load_with` body — task vllm-project#5's emitted `_sharded` calls reference it as `tp_rank as usize`. - `cuda_worker.rs`: passes `self.config.tp_rank as u8` into `try_load`. The `!use_tp` ferrite eligibility gate stays in place — until task vllm-project#5 lands the codegen-side shard-kind dispatch, the per-arch `Weights::load` at tp>1 still calls the unsharded loaders and would shape-mismatch the sharded `<W>::*` constants from `b8c25e5d2`. The gate lifts when vllm-project#5 + vllm-project#6 (embed masking + lowering AllReduce-after-Embed + AllGather-after-lm_head) are wired end-to-end. Macro tests 196/196 under both `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m51s. vllm-executor + vllm-cuda + ferrite-models compile clean at `--features cuda` in 1m58s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WoosukKwon added 4 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

Test iteration-level scheduling

4db2916

WoosukKwon merged commit 04e5acc into main Mar 6, 2023

WoosukKwon deleted the bugfix branch March 6, 2023 18:05

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#5 from ri938/more_improvements_awq

73db30f

More improvements awq

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Fix a bug in 1D input shape (vllm-project#5)

c639d4c

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 14, 2024

Merge pull request vllm-project#5 from slyalin/fixed_parameter_types

30605c8

Align optimum-intel based model signature with vLLM signature

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#5 from ilya-lavrenov/update-intel-opt…

8f016e0

…imum Install optimum-intel from latest main

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#5 from wenxcs/fit-cluster-tests

de23377

Faster v2 hopper fused moe kernel configs

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

ehuaa mentioned this pull request Jul 19, 2024

[Bug]: The vllm is disconnected after running for some time #5084

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

wangwensuo mentioned this pull request Aug 22, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#5 from KuntaiDu/kuntai-disagg-refactor

4db6446

Kuntai disagg refactor

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 12, 2026

Merge pull request vllm-project#5 from IWantFight/bf_ant_group

1888945

fea: support rfork

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

lishunyang12 mentioned this pull request Apr 11, 2026

[Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479

Merged

7 tasks

haosdent mentioned this pull request Apr 25, 2026

[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support #38476

Open

SongXiaoMao mentioned this pull request May 13, 2026

[Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) #40756

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a bug in 1D input shape#5

Fix a bug in 1D input shape#5
WoosukKwon merged 4 commits into
mainfrom
bugfix

WoosukKwon commented Mar 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant