Fix a bug in 1D input shape#5
Merged
Merged
Conversation
v1nc3nt27
pushed a commit
to v1nc3nt27/vllm
that referenced
this pull request
Sep 12, 2023
More improvements awq
xiangyuT
added a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 24, 2023
* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 14, 2024
Align optimum-intel based model signature with vLLM signature
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 25, 2024
…imum Install optimum-intel from latest main
mzusman
added a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
* Drop indecies when finish * min 1 attention layer * CG is working on forward pass passing * Remove comments * cosmetics - rename indecies -> indices, organize some whitespaces * Add some TODOs * Adding mamba cache for cg * Remove useless vars from input_metadata * Remove unused import * Set the seqlen offset to boolean * Return only hidden state * Return only hidden states * Add padding to match forward pass bs * Is prompt instead of seqlen offset * Remove mamba cache class (not used) * Another remove * Remove * Use mamba4gc * Fix mamba forward, run update only on non prompt * Use 1 index after the maximal index * Remove import * Remove import * typo * typo * place holder * Padding and empty token takes it from the first empty place * reformat * Apply suggestions from code review Whitespaces --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
linxihui
added a commit
to linxihui/vllm
that referenced
this pull request
May 14, 2024
…3small [Model][Kernels] Support Phi3small architecture, blocksparse attnention prefilling kernel, CUDA+Triton paged attn kernels
Starmys
pushed a commit
to Starmys/vllm
that referenced
this pull request
May 20, 2024
Faster v2 hopper fused moe kernel configs
This was referenced Jul 5, 2024
Closed
zeroorhero
pushed a commit
to zeroorhero/vllm
that referenced
this pull request
Sep 23, 2024
Kuntai disagg refactor
1 task
1 task
IWantFight
pushed a commit
to IWantFight/vllm
that referenced
this pull request
Mar 12, 2026
fea: support rfork
1 task
1 task
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 26, 2026
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
roy-shih
referenced
this pull request
in UnieAI/vllm
Mar 31, 2026
逐一 grep 驗證所有已完成項目的整合程式碼確實存在: - #3 spec decode: _batch_precompute_spec_decode() 已在 scheduler.py - #5 builtin hash: 已在 config/cache.py Literal type - vllm-project#15 batch spec decode: _precomputed_spec 快速路徑已在迴圈中 清除 strikethrough 噪音,統一為乾淨的「已完成/未完成」兩表格式。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12
pushed a commit
to jinhuang12/vllm
that referenced
this pull request
Apr 8, 2026
…ssions) Kill vllm-project#2: E2E 1.41-1.46% at BS=32 (needs ≥1.5%) Kill vllm-project#5: down_proj 0.754x warm, in_proj_qkvz 0.925x warm Catch-22: selective dispatch avoids warm regressions but gives 0.5% E2E; all-shapes dispatch gets 1.46% E2E but has warm regressions. No configuration satisfies both criteria. Translation factor ~0.33 appears to be the ceiling for GEMM-only optimizations on this model/hardware (Qwen3.5-4B, L40S, BF16, TP=1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12
pushed a commit
to jinhuang12/vllm
that referenced
this pull request
Apr 8, 2026
- SKILL.md: add Non-Negotiable vllm-project#5 (GPU isolation via CVD prefix), add PostToolUse hook row, add gpu_status.py + gpu_force_clear.py to helper scripts, SHOULD→MUST for session_id, add Resume Protocol step 3b, remove soft-reservation line for debate GPUs - parallel-tracks.md: replace flat GPU table with TP-Aware GPU Partitioning section (exclusive/shared modes, state.json gpu_assignment schema), update spawn prompt templates to kernel_cvd + e2e_cvd format, replace GPU Allocation During Overlap soft-reservation note with CPU-only mandate - ammo-impl-champion.md: replace GPU Coordination section with GPU Usage (CVD prefix pattern, hook auto-management, blocking behavior) - ammo-impl-validator.md: add GPU Usage section (same pattern as champion) Agents no longer need to manually reserve/release GPUs — hooks handle it automatically when commands include CUDA_VISIBLE_DEVICES from spawn prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tasks
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 14, 2026
…_like, .contiguous() DecodeKernel.__call__ changes: - query: assert contiguity instead of .contiguous() copy - output: accept caller-provided persistent buffer - wo_global_scale: pass tensor pointer, not .item() float - grid.z: use padded_num_seqs for stable graph capture - gate_ptr/gate_fused: new kernel args for output gate fusion Blockers #1, #2, #3, #4, #5 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 15, 2026
…elete LlamaSolver/Qwen2Solver The old macro-generated Llama/Qwen2 `Model` + `solver_hidden_states` paths (CudaModel::LlamaSolver, CudaModel::Qwen2Solver) were the pre-ferrite-forward solver output. With ferrite-forward's Weights enum + forward now wired up, they're redundant. Delete both. vllm-executor/src/cuda_worker.rs: - Remove LlamaSolver / Qwen2Solver variants + all their match arms (num_layers, num_kv_heads, head_dim, vocab_size, hidden_size, hidden_states, forward). - Dense bf16 Llama always routes to CudaModel::LlamaFerrite (no more VLLM_FERRITE=1 env-var gate). - Dense bf16 Qwen2 routes to the hand-written vllm_cuda::model::qwen2::Qwen2ForCausalLM::load (quant/tp/pp paths it already used) since Qwen2 through ferrite-forward is blocked on HANDOFF Step C (CublasFusedQkvGemmWithBiasImpl, gap vllm-project#5). vllm-cuda/src/model/llama.rs: drop the `ferrite_macros::forward!{}` block at line 3667 + the cutlass_gemm_ffi! FFI declarations that only existed for the old macro's output (lines 3630-3665). vllm-cuda/src/model/qwen2.rs: same — drop the forward!{} block at line 275, its cutlass FFI block, and now-unused imports (attention_helpers, driver, kernels, Embedding, Linear, LinearLayer, RmsNorm, RotaryCache, GpuTensor). vllm-cuda/Cargo.toml: drop the `ferrite-macros` dep (nothing in vllm-cuda consumes it anymore; ferrite-models/src/qwen2.rs still uses it and carries its own dep). Compile-time impact: the two `forward!{}` invocations ran the full old ferrite pipeline (parse → DAG → solver → codegen) for all llama + qwen2 configs on every build, producing ~hundreds of lines of generated Rust. Their removal cuts vllm-cuda's proc-macro time significantly. Verified: cargo check -p vllm-executor --features cuda green in 1.55s (cold: 3:26); fmt + clippy -D warnings clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jinhuang12
pushed a commit
to jinhuang12/vllm
that referenced
this pull request
Apr 15, 2026
…retry + no-kill rule
Root cause: gpu_reservation.py defaulted session_id to "cli" when CLAUDE_SESSION_ID
was unset, causing auto_release=True to silently evict other agents' GPU reservations.
Changes:
- gpu-pool.md: canonical pattern now includes --session-id and --no-auto-release,
explicit retry loop for pool exhaustion, process isolation rules
- parallel-tracks.md: spawn prompt uses agent-scoped --session-id {op_id}
- SKILL.md: rule vllm-project#5 updated with new flags, retry, and no-kill references
- ammo-gpu-release.sh: tighter trigger regex, extracts --session-id from command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 16, 2026
Seven-commit Gemma2 push landed this session. Compiler stays architecture-agnostic — the acid test passed for generality: zero "gemma" strings below the parser, the entire architecture fits in DSL body + JSON configs + library Impls. New sections: - "Next session starts here" rewritten around the Gemma2 state: coherent output on common prompts, known hard-fail on `test_cuda_correctness_gemma2_2b` prompt 7 (a translation prompt with apostrophe-quoted text), ruled-out causes listed (not CUDA graphs, not final softcap, not ferrite-vs-hand-written drift, not the embed-scale/sliding-flip bugs I already fixed). - "Commit history" grew a fresh block of 8 commits. - "Landed since the correctness fix" gained a full item vllm-project#7 for the Gemma2 port, enumerating each sub-piece and its commit. - Gap vllm-project#5 went from "sliding-window / Gemma2" to "Gemma3" (the logical next architecture using the same DSL constructs). - Gap vllm-project#9 repurposed from "per-layer golden diff harness (nice to have)" to "per-layer tensor-dump debug harness" — the concrete tool needed to unblock the prompt-7 failure. Spelled out the approach including the forward_backbone quirk (for Gemma2 it returns pre-softcap logits, not pre-lm_head hidden_states, because the DSL's terminal tile is `tanh_softcap`). - "Already landed" list gained nine new bullets covering the DSL constructs, the scalars mechanism, Gemma2 Impls, kernel `weight_offset` param, Gemma2 DSL + configs, and `Gemma2Ferrite` cuda_worker wiring. - Gap vllm-project#4 (Granite multipliers) updated: the DSL now expresses scalar multiply via the Gemma embed-scale plumbing. Honest accounting of the known-fail: correlation r=0.11 between our post-softcap logits and Python vLLM's on prompt 7. Divergence is upstream of softcap, specific to this prompt's numerics, and likely lives in a shared CUDA kernel (flash-attn softcap + sliding, or cuBLAS gemm on Gemma2's unusual head_dim ratios). Both ferrite and the pre-existing hand-written Gemma2 path exhibit the same failure — the fix, once found, applies to both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JlPang863
added a commit
to yizhongzoe-cloud/my-vllm-serving-system
that referenced
this pull request
Apr 22, 2026
Adds an env-var-controlled experimental knob for comparing the cost of the
host-memory KV reload recovery path against two cheaper alternatives:
- reload (default): existing behavior — restore KV from host checkpoint.
- restart : drop checkpoint AND already-decoded tokens, fresh
re-prefill of original prompt. Stream consistency
BROKEN — only for ablation.
- reprefill : drop checkpoint, preserve decoded tokens by
extending the prompt. Stream consistency preserved.
Two patches because the centralized policy (ft_benders_centralized) bypasses
RecoveryManager and does its own recovery in ft_client.py:
- vllm/v1/engine/ft_client.py (centralized recovery loop)
- vllm/v1/core/recovery_manager.py (fault_tolerant + non-centralized
ft_benders policies)
Default is "reload" so the main experiment paths are unchanged.
Validation (W1_Chat/Heavy/F2_Mid, dp=2, 8B, 462 reqs):
| mode | goodput | TTFT p50 | SLO viol | failover_gap p50 |
|-----------|-----------|-----------|----------|------------------|
| No-FT | 321.6 t/s | 406 ms | 20.3 % | – |
| reload | 124.2 t/s | 18.6 s | 65.6 % | 8.6 s |
| restart | 47.9 t/s | 50.4 s | 87.4 % | 8.6 s |
| reprefill | 39.5 t/s | 42.8 s | 88.5 % | 24.6 s |
Reload is 2.6-3.1x better goodput than restart/reprefill — KV reload converts
recovery cost from compute (the scarce resource on the surviving engine post-
fault) to I/O (which is plentiful), avoiding prefill capacity contention.
Original hypothesis "reload is overkill" disproved.
Full ablation writeup added as Follow-up finding vllm-project#5 in
experiments_v2/docs/e1a_quick_diagnosis.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
… slot Regression tripwire for the load-bearing assumption in project_tp_design_notes: `Instruction::AllReduce` is one-tile in- place same-shape, so shape-aware coloring must place its output at the input's slot with no `View` row. At runtime that lets `NcclGroup::all_reduce_inplace` operate on the gemm output tile directly — no alias-row preamble in the bucket slice. The generic `coloring_same_shape_alias_collapses_to_source` test already validates the colorer's behavior for any one-tile in-place same-shape Impl, so this is a duplicate hypothesis. The value is: 1. Naming — future grep for "allreduce" in coloring lands here. 2. Realistic shape — `[N=4, H=16]` instead of `[1]` so coloring regressions that interact with `coloring_disjoint_lifetimes_dont _share_across_shapes` (the K=8192 N=5120 cublas-panic-driven test) trip both invariants. 3. Three-node row-parallel chain — gemm → all_reduce → consumer — matches what the upcoming task vllm-project#5 lowering pass will produce. OpKind on the all_reduce node is `OpKind::Add` for now; task vllm-project#5 adds `OpKind::AllReduce` proper alongside the lowering pass. Macro test suite: 210 → 211 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Two regression tests assert that `CutlassGemmAddImpl::matches()` and `FusedAddRmsNormImpl::matches()` refuse to claim a fused (Gemm, Add) or (Add, RmsNorm) subgraph across an intermediate tile — the topology the upcoming task vllm-project#5 lowering pass produces by inserting `Instruction::AllReduce` after every row-parallel gemm. Without this property at tp>1 the cutlass_gemm_add epilogue would fuse the residual addition into the gemm's beta=1.0 path, mixing pre-reduce per-rank gemm outputs into the residual stream BEFORE all-reduce reconciled them across ranks → silently wrong math. The "guard" is emergent — `consumes_tile(add, gemm)` returns false once any intermediate node sits between the seed and the candidate fusion partner. Both tests pin that property explicitly so a future matches() change (e.g. following alias chains across in-place ops) can't reintroduce the cross-AllReduce fusion. Models AllReduce as a 1-input Add; `OpKind::AllReduce` proper lands with the lowering pass in task vllm-project#5. The `inputs.len() != 2` filter in CutlassGemmAddImpl::matches() ensures the stub itself isn't mis-claimed as the residual Add. Macro test suite: 211 → 213 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Sub-step 1 of task vllm-project#5 (lowering pass). Lands the structural pieces the lowering pass will produce, with no FUF currently carrying `OpKind::AllReduce`: - `classified::OpKind::AllReduce` variant. Deliberately omitted from `OpKind::from_name` so a user can't write `all_reduce(...)` in a `#[forward]` body — the canonical path is the post-FUF lowering pass that injects nodes with this op kind. - `shape::apply_signature` and `shape::weight_arg_ranks` arms. Identity-shape one-input via `sig_unary_elementwise`, no tensor weight inputs. - `AllReduceImpl` in `impl_lib.rs`: single-tile claim of any `OpKind::AllReduce` node with exactly one tile input. `output_alias` declares dst→input collapse so shape-aware coloring places the AllReduce row at the input's slot with no `View` indirection (validated by `coloring_allreduce_collapses_to_input_slot`). Cost model is bandwidth-bound 2x of local memory, falls back to `UNCALIBRATED_COST_US` on profiles missing measured cost data. - Registered in `starter_library()` — at tp=1 the lowering pass emits zero AllReduce FufNodes so the matcher is a no-op in the solver. One unit test pinning the matcher contract (single tile input required, zero/multi rejected) and the output_alias collapse. Macro test suite: 213 → 214 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Sub-step 2 of task vllm-project#5. Lands the FUF-lowering pass module + the no-op fast path at tp_world_size=1. Wires `insert_all_reduces` into `compile()` between `fuf::unroll` and `solver::solve`, called with a `tp=1` literal so every existing SolvedModel sees a byte-identical FUF flowing into the solver. Task #5c lands the actual insertion logic + per-arch shard-kind table; task vllm-project#7 swaps the literal for `solved_model.tp_world_size` once canonical fanout exists. Tests: - `lowering_no_allreduce_at_tp_eq_1` — gemm → residual_add FUF, pass at tp=1, asserts node count + OpKind preserved tile-by-tile and zero `OpKind::AllReduce` nodes added. - `lowering_no_allreduce_at_tp_eq_1_empty_fuf` — degenerate case, prevents a future implementation from indexing into nodes[0] without a length check. Macro test suite: 214 → 216 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Sub-step 3 of task vllm-project#5. Lands the actual insertion logic + the generic shard-kind name-pattern table. At tp_world_size > 1 the pass: 1. Walks every `OpKind::Gemm` node, looks up its weight in `program.weights`, classifies via `shard_kind_for_weight_path` — `q_proj | k_proj | v_proj | gate_proj | up_proj` → `ShardDim0` (column-parallel, no comm), `o_proj | down_proj` → `ShardDim1` (row-parallel, needs all-reduce), everything else → `Replicate`. 2. For each `ShardDim1` gemm, appends a fresh `OpKind::AllReduce` FufNode with `new_id = nodes.len()`, reading the gemm output as its single tile input. 3. Rewires every OTHER node's `FufInput::Tile { id: gemm_id }` to `FufInput::Tile { id: all_reduce_id }`. After this rewrite, every consumer of the gemm's output sees the post-reduce value — the load-bearing correctness invariant. The shard-kind matcher is generic across the whole ferrite tree since every existing arch (llama / qwen* / mistral / gemma* / commandr / phi3 / granite / deepseek_v* / mixtral) uses the HF-standard names. Per-arch overrides can layer on later when an arch with non-standard names lands; until then keeping the matcher generic avoids per-arch boilerplate. Tests: - `shard_kind_table_covers_hf_standard_names` — explicit table pin so a future per-arch override can't silently break the generic fallback. - `lowering_inserts_allreduce_after_row_parallel_gemm_at_tp_gt_1` — the load-bearing tp>1 assertion: 1 AllReduce inserted, residual add rewired, raw gemm output never read by a downstream consumer. - `lowering_skips_column_parallel_gemms_at_tp_gt_1` — q_proj must not trigger AllReduce; false insertion would clobber the per-rank shape contract. Macro test suite: 216 → 219 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…llel Matches Python vLLM. `VocabParallelEmbedding` shards the embedding table along dim 0 (vocab dim); `ParallelLMHead` inherits from it, so both layers carry the same sharded layout — that's what makes `tie_weights` (lm_head.weight = embed_tokens.weight) self-consistent at tp>1 (ref vllm/model_executor/layers/vocab_parallel_embedding.py lines 186, 500, 553). Adds: - `shard_kind_for_last_segment(&str)` so codegen can lookup without pre-splitting the path. Refactor of the existing match arm. - `shard_kind_for_dotted_prefix(&str)` convenience for codegen, which has the dotted layer-0 prefix string in hand. `dead_code` allow until task vllm-project#5 wires the load helpers; pinned by tests now. - `embed_tokens` and `lm_head` map to `ShardDim0` (was `Replicate`). Behavior at the existing call site (`insert_all_reduces`) is byte-identical: the lowering pass walks only `OpKind::Gemm` nodes filtered by `ShardDim1`. `embed_tokens` is consumed by `Embed`, not `Gemm`. `lm_head` is a Gemm but ShardDim0, not ShardDim1, so the row-parallel filter still excludes it. AllReduce-after-Embed + AllGather-after-lm_head land in task vllm-project#6. Macro tests: 195/195 (was 194/194) under both default `--features cuda` and `--features nccl`. New tests: - `shard_kind_for_dotted_prefix_pins_runtime_paths` covers the load-bearing per-arch paths from `default_required_weights`. - Existing `shard_kind_table_covers_hf_standard_names` updated to expect `ShardDim0` for `embed_tokens` / `lm_head`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…+ Embedding) Tensor-parallel safetensors → GPU loaders matching Python vLLM's ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires the dispatch from the macro side; this commit lands only the runtime helpers). Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`: - `Linear::load_sharded(weights, prefix, dim, rank, world)`. The load-bearing bias rules: - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias shards along dim 0 too — each rank holds its own slice. Mirrors Python `ColumnParallelLinear.weight_loader` → `loaded_weight.narrow(output_dim=0, …)`. - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated full-size on rank 0 only**, `None` on other ranks. The forward path adds bias before the cross-rank AllReduce-sum; only rank 0's contribution survives the sum, giving exactly one bias add to the residual stream. Mirrors Python `RowParallelLinear.forward` line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`. - `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`. Thin wrapper over `Linear::load_sharded`. The codegen entry point for non-fused (single-prefix) sharded loads. - `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream, rank, world)`. Sharded variant of `load_dense_concat` for the fused QKV / gate_up paths. Always column-parallel (no row-parallel concat exists in any current arch). Each source weight slices along dim 0 to `[out_i / world, hidden]` then packs into one contiguous `[(sum out_i) / world, hidden]` GPU buffer via per-source `take_shard_into`. Biases follow the column-parallel rule (sliced along dim 0) — matches Python `MergedColumnParallelLinear` / `QKVParallelLinear`. Per-rank divisibility is guaranteed by the macro's outer-loop fanout `skip` of indivisible (variant, tp) tuples (commit `889c44b2f`). - `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab- parallel: slices the embedding table along dim 0 (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`). Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as `Linear::load_sharded(dim=0)` — that's what makes `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent at tp>1. Defers FP8 / Marlin / BNB sharded variants — the verify model (commandr) is dense bf16. World == 1 short-circuits to byte-equivalent behavior with the existing unsharded paths in every helper, plus shard-kind-aware bias rules. No tests added at this layer (CUDA stream + safetensors fixtures aren't worth the infra spend; the integration test is task vllm-project#8). `take_shard` / `take_shard_into` on `GpuWeights` already exist (used by the prior hand-written TP path); these wrappers are pure call-site plumbing on top. Build clean: ferrite-kernels checks + clippy at default features. The macro-side consumer that chooses sharded vs unsharded based on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers have no runtime caller (intentionally — wholesale codegen migration per the no_piecemeal_codegen_migration rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Adds the `load_layered_*_sharded` wrappers that the codegen macro will route to at tp>1 (task vllm-project#5). Mirrors the existing layered helpers — same per-canonical accessor → single-line-of-expanded fold, just routing through the kernel-side `_sharded` variants landed in `b8ceca55a`. - `load_layered_embedding_sharded(rank, world)` → vocab-parallel `Embedding::load_sharded`. No layered Embedding accessor exists in any current arch; helper is here for symmetry so the codegen can route every layered FieldLoad through a `_sharded` variant uniformly. - `load_layered_linear_dense_sharded(dim, rank, world)` → wraps `LinearLayer::load_dense_sharded`. `dim = 0` for column-parallel (q/k/v/gate/up); `dim = 1` for row-parallel (o/down). Per-dim bias rules live in the kernel-side `Linear::load_sharded`. - `load_layered_linear_dense_concat_sharded(rank, world)` → wraps `LinearLayer::load_dense_concat_sharded`. No `dim` arg — fused QKV / gate_up are always column-parallel. Defers `_sharded` variants of the marlin / fp8 / bnb4 layered helpers — the verify model is dense bf16. Per the `no_codegen_in_proc_macro` rule the helpers stay narrow (one per FieldLoad arm); the macro maps each (variant, dim) → exactly one helper at codegen time. Build clean: `ferrite-forward` checks + clippy at `--features cuda` and `--features nccl`. Helpers have no consumers yet (task vllm-project#5 wires them) but are `pub` at the crate surface so dead-code lint is silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…a_worker Prerequisite for task vllm-project#5 (codegen-side shard-kind dispatch). The emitted `_sharded` loader calls need `rank` as a runtime value in scope inside the `Weights::load_with` body — `tp_world_size` is already a baked literal per (model, tp), but `rank` is per-process. Touches: - `ferrite_forward::dispatcher::ArchTryLoadFn` + `try_load(..., tp_rank: u8, ...)` — adds rank between `tp_world_size` and `max_model_len` in the public API. Forwarded to the matched registration's closure. - Macro: `inventory::submit!`'s closure body becomes `|gw, stream, max_model_len, tp_rank, hf|` and passes `tp_rank` into `Weights::load(..., #tp_lit, tp_rank)`. The per-arch `Weights::load(...)` and `load_match_arms` body gain the `tp_rank: u8` parameter and forward it to `#model_ident::load(gw, stream, max_model_len, tp_rank)`. - Codegen: per-model `load_with(..., tp_rank: u8)` and `load(..., tp_rank: u8)` (both canonical and Shim variants). `tp_rank` is in scope inside the `load_with` body — task vllm-project#5's emitted `_sharded` calls reference it as `tp_rank as usize`. - `cuda_worker.rs`: passes `self.config.tp_rank as u8` into `try_load`. The `!use_tp` ferrite eligibility gate stays in place — until task vllm-project#5 lands the codegen-side shard-kind dispatch, the per-arch `Weights::load` at tp>1 still calls the unsharded loaders and would shape-mismatch the sharded `<W>::*` constants from `b8c25e5d2`. The gate lifts when vllm-project#5 + vllm-project#6 (embed masking + lowering AllReduce-after-Embed + AllGather-after-lm_head) are wired end-to-end. Macro tests 196/196 under both `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m51s. vllm-executor + vllm-cuda + ferrite-models compile clean at `--features cuda` in 1m58s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a miscalculation of the input shape when iteration-level scheduling is used.