Automatically configure KV cache size#6
Merged
Merged
Conversation
xiangyuT
added a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 24, 2023
* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
slyalin
pushed a commit
to slyalin/vllm
that referenced
this pull request
Mar 21, 2024
Add missing Python requirements
mzusman
added a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
Co-authored-by: Mor Zusman <morz@ai21.com>
dtrifiro
referenced
this pull request
in dtrifiro/vllm
Apr 26, 2024
[CI/Build] Dockerfile.ubi : Remove test stage
Starmys
pushed a commit
to Starmys/vllm
that referenced
this pull request
May 20, 2024
FP8 on A100 for PHIMOE
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 26, 2026
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore
pushed a commit
to Damon-Salvetore/vllm
that referenced
this pull request
Mar 31, 2026
…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM
googlercolin
referenced
this pull request
in hyscale-lab/vllm-thought-eviction
Apr 7, 2026
- wrap_stream async generator middleware with finally cleanup - _accumulate: differential L2 norms, reasoning content extraction, offset computation - _maybe_schedule_cycle: time-based and token-based triggers via asyncio.create_task - _run_eviction_cycle: guard conditions (ENG-09, ENG-10, Pitfall #6), strategy dispatch - Reasoning-relative to absolute offset conversion (D-05) - merge_overlapping_ranges + apply_retention_window + align_ranges_to_blocks pipeline - engine_client.update_request_mask call (D-04)
googlercolin
referenced
this pull request
in hyscale-lab/vllm-thought-eviction
Apr 7, 2026
- 21 tests covering accumulation, guard conditions, cycle scheduling, passthrough - Tests: ENG-09, ENG-10, Pitfall #6, D-05 offset, ENG-06 permanent ranges, ENG-07 isolation - Fix: apply_retention_window only when floor > 0 to avoid discarding all ranges - Fix: used asyncio.run() for async tests (no pytest-asyncio installed)
xinyu-intel
pushed a commit
to xinyu-intel/vllm
that referenced
this pull request
Apr 11, 2026
…-spec fix(turboquant): TQFullAttentionSpec Python page size override
carlosfundora
pushed a commit
to carlosfundora/vllm-1-bit-turbo
that referenced
this pull request
Apr 13, 2026
[codex][Kernel] TurboQuant SM86 support, TP metadata slicing, and docs
Natfii
referenced
this pull request
in Navi-AI-Lab/nvllm
Apr 14, 2026
…impl CutePagedAttentionImpl becomes a pipeline state object: - bind_fusion_weights() stores static weights + allocates persistent I/O buffers with fixed addresses (graph-safe) - forward() reads from self instead of per-forward side-channels - gate_buf added for output gate fusion (Qwen3NextAttention) Blockers #6, #7, #8 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wkin-t
pushed a commit
to wkin-t/vllm
that referenced
this pull request
Apr 15, 2026
Root cause: k_cached_trim is float16 (Triton kernel output) while qdtype is bfloat16. The .to(qdtype) call on a (96K, 2, 512) tensor creates a 188 MB copy. Allocating k_full then v_full sequentially while k_full (188 MB) remains live requires ~376 MB new allocation, but RTX 4090 has only ~260 MB headroom -> OOM. Fix: branch on D <= _FA_MAX_HEAD_DIM BEFORE concatenating k/v: - FA2 path (D=256, SWA layers): keep full concat; flash_attn requires contiguous tensors and SWA footprint is within budget. - Chunked-softmax path (D=512, global layers): read k_cached_trim and v_cached_trim per CHUNK_K rows with lazy dtype conversion. Peak extra allocation stays < 16 MB regardless of context length. K/V segment routing in inner loop: [0, cached_len) -> k_cached_trim[ki:kj].to(qdtype) (cache) [cached_len, seq_len) -> key_chunk[off:off+cs_k] (current) boundary -> torch.cat (at most once, CHUNK_K rows) Result: 96K context now works on RTX 4090. Previous 64K limit was from the earlier fix (bug vllm-project#6); this fix extends it to the full KV cache capacity (~109K tokens with turboquant_3bit_nc). Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 15, 2026
Task vllm-project#6. The #[forward] macro now emits, for each model config, a pub Layer + pub Model struct and an unsafe Model::load that reads weights from ferrite-cuda-core's GpuWeights holder. Data- driven from the classified DSL program — no arch-specific fields or load patterns in the compiler. Emission logic (new: codegen_model.rs): - Walks the classified Program to classify every WeightId: * per-layer vs global: any Expr::Weight{index: Some(_)} → per-layer * runtime type: derived from the op the weight flows into as an arg — embed(ids, W) → Embedding, rmsnorm(x, W) → RmsNorm, gemm(x, W) → Linear. A weight used in incompatible positions is a hard error. - Layer struct: one field per per-layer weight, typed as the corresponding ferrite-kernels layer. Model struct: Vec<Layer> + one field per global weight. - Model::NUM_LAYERS baked in from the model's num_hidden_layers config field. Model::load walks 0..NUM_LAYERS, loading per-layer weights via format!("model.layers.{i}.<path>") prefixes. Global weights loaded via weight_conventions::hf_weight_prefix. - Tie-word-embeddings: if the safetensors file doesn't contain lm_head.weight, Model::load aliases embed_tokens.weight as lm_head (GpuTensor: Copy). Runtime decision, not a compile flag. HF convention: - weight_conventions::hf_weight_prefix(path, layer) returns the HF safetensors prefix for a DSL weight reference. Encodes: * per-layer → "model.layers.{i}.{dotted}" * global under backbone → "model.{dotted}" * lm_head → "lm_head" (HF puts it at the root, not under model.) Tests cover all three cases. Feature gating: - Emitted Model/Layer/Model::load are #[cfg(feature = "cuda")]. Without cuda, ferrite-kernels layer types aren't in scope, so the types can't be referenced. This mirrors how ferrite-models gates its `pub mod llama` today. Consumers without cuda still get the numeric pipeline observations (NUM_TILES etc). - ferrite-forward crate grew a cuda feature that forwards to ferrite-kernels/cuda. Tests: - codegen_model::tests: llama_body_classification proves per-layer/global + runtime type assignment are right for a representative body. weight_used_in_incompatible_ops_errors proves the type-conflict check fires. - weight_conventions::tests: hf_weight_prefix cases (per-layer, global under model., lm_head at root). - phase7_end_to_end: cuda-feature-gated type-check functions that assert Layer/Model have the expected fields with the expected types; Model::NUM_LAYERS baked in for 1B/8B/70B; Model::load signature matches the documented interface. Counts: - Without cuda: 77 unit + 6 integration tests pass. - With cuda: 77 unit + 10 integration tests pass (+4 cuda-only). - fmt + clippy clean on both feature configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 task
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 19, 2026
Documents commit 8539124 (overlay) at the top of "Next session starts here": preset × dense fan-out, cross-variant dedup, MarlinFormat runtime dispatch, g_idx fingerprint disambiguation, ferrite-models debug=0 RAM ceiling, plus the two goldens that prove the overlay path. Also notes pre-existing flashinfer.rs clippy lints + suspected flashinfer-induced top-N drift on AWQ Llama / Qwen2-GPTQ / Qwen3-dense (goldens were generated pre-flashinfer and need re-baselining). Marks the CT INT4 caveat as RESOLVED via GptqLayout::WeightPacked + the Ok(None) fallback (task vllm-project#6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 21, 2026
Adds six items to the FP8 open-work list that surfaced during the quant×arch coverage session: 5. `fp8_exclusion` × any new quant that ships `.weight_scale` needs to be added to the exclusion tuple (CT-INT4's skip landed in 9ef35d6; future formats like any wN-a-16 variant with a bf16/fp16 `.weight_scale` tensor would need the same treatment). 6. CUDA-graph vs eager drift — structurally the strongest candidate for the threshold=1 ULP drift (goldens use enforce_eager=True, ferrite uses graphs). 7. Scale-value comparison (ferrite's f32 scale tensor vs Python's on the same input) not done — top suspect if vllm-project#6 falls through. 8. Python `cutlass_scaled_mm` Triton fallback when `b.shape[0/1] % 16 != 0` — not hit by current arches but worth pre-checking new arch shapes. 9. Static-scaled FP8 group-shape kernel (`scaled_fp8_quant_kernel_strided_group_shape`) not ported; not currently reachable. 10. Qwen3/Gemma3 × FP8 QK-norm runs via singletons; a dedicated `Fp8FusedQkvQkNormRope*Impl` would close the fused-launch gap (matching the already-staged-but-unregistered dense `FusedQkvQkNormRopeCacheImpl`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JlPang863
added a commit
to yizhongzoe-cloud/my-vllm-serving-system
that referenced
this pull request
Apr 22, 2026
… blocking root cause + fix Documents the 2026-04-09 overnight investigation that finally localized the framework baseline overhead source after 6 prior negative results (drop mode, lazy reload, admission throttling, snapshot bypass, log demotion, Benders profile fix, ckpt step throttle). Root cause: API server's _maybe_ft_checkpoint() at line 740 calls _ft_ckpt_future.result() which BLOCKS waiting for the previous step's checkpoint RPC. Under W1_Chat/Heavy fault load, the worker's RPC takes ~70 ms (per-req torch.save + 3× fsync × ~14 reqs) while step time is ~30 ms. API server step rate drops from ~30/sec to ~14/sec — exactly the 60% goodput drop observed (117 tok/s reload vs 292 tok/s NoCkpt). The cProfile data was misleading: it only instruments schedule(), but the blocking happens AFTER schedule() in _ft_post_process → _maybe_ft_checkpoint. cProfile showed schedule() at only 0.6 ms total, which we (correctly) concluded wasn't the bottleneck — but we missed that the bottleneck was in a different function entirely. Fix in commit 2044810 (env-var-gated, default off): FT_CKPT_NONBLOCK=1 non-blocking pipeline (peek with .done()) FT_FAST_TMPFS_WRITE=1 skip fsync on tmpfs (provably correct) Phase 7 result (3 seeds): goodput 117.3±12.6 → 247.4±127.1 (+130 tok/s, +111%). Trade-off: completion drops 98% → 93% because some checkpoint cycles are dropped under load. Adds Follow-up finding vllm-project#6 section + 1 verdict row + 2 future-work items (decide on default, run full ablation matrix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 task
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…on phase Adds two sections to HANDOFF_INTERPRETER.md: - "Tensor-parallel foundation (since `f7edb7032`)" under "## Where we are" — 8-commit table covering Instruction::AllReduce variant + dedup_tp_sig threading + coloring/comm-boundary tests + OpKind::AllReduce + AllReduceImpl + tp_lowering pass + insertion logic + cuda_worker NcclGroup plumbing. Notes that the machinery is built and tested but not activated — the lowering pass is wired with a tp=1 literal so it's a strict no-op until canonical fanout flips the switch. - "§4. Tensor-parallel — activation phase" under "## What's left" — covers the three remaining tasks (vllm-project#7 canonical fanout, vllm-project#6 loader sharding + KV replication, vllm-project#9 verify on commandr at tp=2) with enough detail to act on without re-reading the design memory. Bumps the original "Optional follow-ups" to §5. Reference points (`crates/ferrite-forward/src/instr.rs`, `ferrite-forward-macro/src/lib.rs`) get TP-relevant entries: the cfg-gated `Instruction::AllReduce` variant, `dedup_tp_sig`, `tp_lowering::insert_all_reduces`, and `AllReduceImpl`. Top-of-file abstract acknowledges the foundation landing and points at §4. Macro test count updated 207→190 (the unrelated `1dcea5ce1`/`7eb8e995b` cleanup excised dead pivot machinery before TP work; the TP foundation added 8 tests to that smaller baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…s by tp Adds `tp_world_size: u8` parameter to `emit_canonical_params_impl` and `emit_model`, threading it through `compile()`'s emission loop from the SolvedModel. The shard math floor-divides every column- parallel dim — `num_q_heads`, `num_kv_heads`, `intermediate_size` — by `tp_world_size`; derived sizes (`q_size`, `kv_size`) follow. `head_dim` is per-head and never sharded. At `tp_world_size = 1` (every emission until task vllm-project#7's outer-loop fanout lands) sharding is identity — output is byte-identical to single-rank builds. The 3 new codegen unit tests pin: * `canonical_params_at_tp_eq_1_is_identity` — Llama-2-7B-ish bounds at tp=1 keep NUM_Q_HEADS=32, INTERMEDIATE_SIZE=11008, Q_SIZE=4096. * `canonical_params_at_tp_eq_2_shards_column_parallel_dims` — same bounds at tp=2 yield 16 / 5504 / 2048. * `canonical_params_at_tp_eq_8_shards_column_parallel_dims` — Llama-3-8B-ish bounds at tp=8 (the upper bound of the compile-time set) yield NUM_Q_HEADS=4, NUM_KV_HEADS=1, INTERMEDIATE_SIZE=1792. KV replication when `num_kv_heads < tp_size` is task vllm-project#6's loader- sharding work — `compile()`'s outer loop in task vllm-project#7 will skip indivisible (variant, tp) tuples until then. Macro test suite: 194/194 (was 191/191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Activation phase. When `ferrite-forward-macro` is built with its own
`nccl` cargo feature on (the per-arch crate's `nccl` chains down via
ferrite-forward → ferrite-forward-macro), `compile()` iterates the
compile-time TP set `[1, 2, 4, 8]` over every variant, producing one
SolvedModel per (model, tp) tuple. Variant idents at tp>1 get a
`_tp{N}` suffix; ditto source stems for canonical-selection ordering.
Indivisible (variant, tp) tuples are skipped at the outer loop —
column-parallel dims `num_attention_heads`, `num_key_value_heads`,
`intermediate_size` must each divide evenly. KV replication when
`num_kv_heads < tp_size` is task vllm-project#6's loader-sharding work.
`emit_arch_dispatcher` now keys arms on `(model_ident, bounds, tp)`
via the new `DispatchArm` struct. The emitted `Weights` enum carries
all (variant × tp) variants; `Weights::load` gains a
`tp_world_size: u8` parameter that match-dispatches to the right
fingerprint-arms set. One `inventory::submit!` per distinct tp value
hits `FerriteArchRegistration` with that tp, so the runtime
`try_load(arch_hint, tp_world_size)` from commit B routes correctly.
`cfg!(feature = "nccl")` reads `ferrite-forward-macro`'s own feature
state — cargo recompiles the proc-macro per-feature-set, so consumers
without nccl still get the fast tp=1-only macro. No env-var dance,
no build.rs.
Verified on commandr at `--features nccl`:
c4ai-command-r-v01 · 564 tiles · tp=1
command-r-1-layer · 18 tiles · tp=1
c4ai-command-r-v01_tp2 · 644 tiles · tp=2 (+80 from per-layer AllReduce)
command-r-1-layer_tp2 · 20 tiles · tp=2 (+2)
c4ai-command-r-v01_tp4 · 644 tiles · tp=4
command-r-1-layer_tp4 · 20 tiles · tp=4
c4ai-command-r-v01_tp8 · 644 tiles · tp=8
command-r-1-layer_tp8 · 20 tiles · tp=8
At default features (no nccl): byte-identical to pre-fanout — only
tp=1 emits, single `inventory::submit!` block. Macro test suite:
194/194 pass under both `cargo test -p ferrite-forward-macro --lib`
and `--lib --features nccl`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…hedule fix
Updates the TP section of HANDOFF_INTERPRETER.md to reflect the
activation phase landing in this session:
- 6 activation commits on top of foundation: SolvedModel.tp_world_size
threading, FerriteArchRegistration tp filter, per-arch nccl features,
emit_canonical_params_impl sharding, compile() outer-loop fanout,
schedule-topo fix.
- §4 ("What's left") drops task vllm-project#7 (done), keeps task vllm-project#6 (loader
sharding — the live-tp blocker) and task vllm-project#9 (commandr-tp=2 verify).
- Verified clean under --features nccl on 10 of 11 arches; default
build of ferrite-models all-arches clean in 1m47s. Macro test
suite: 194/194 (was 190/190).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…t pickup
The previous handoff commit added the activation-phase commit table
but left several stale references upstream of it:
- Top abstract said "Tensor-parallel **foundation** also landed...
strict no-op until canonical fanout activates it" — now reads
"Tensor-parallel **fanout is active** at --features nccl".
- "Where we are" said tip = `03ce7668e`, macro tests 190/190; now
tip = `85ef0006f`, 194/194 under both default and `--features
nccl`. Calls out the 10/11 arch verification at nccl.
- Pre-commit checklist's "207 pass / 0 fail" baseline updated to
the current 194; adds an explicit `--features nccl` test run.
Adds a `cargo build -p ferrite-models --features nccl` clean
step so future work can't silently regress the fanout build.
- "Reference points" entry for `tp_lowering::insert_all_reduces`
added a load-bearing caveat about the schedule-topo break that
`d11085d0b` fixed — so a future hand can't re-introduce an
SubgraphId-order assumption.
- §4 ("What's left") trimmed to just task vllm-project#6 (loader sharding)
and task vllm-project#9 (commandr-tp=2 verify); points at memory for the
full commit map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…llel Matches Python vLLM. `VocabParallelEmbedding` shards the embedding table along dim 0 (vocab dim); `ParallelLMHead` inherits from it, so both layers carry the same sharded layout — that's what makes `tie_weights` (lm_head.weight = embed_tokens.weight) self-consistent at tp>1 (ref vllm/model_executor/layers/vocab_parallel_embedding.py lines 186, 500, 553). Adds: - `shard_kind_for_last_segment(&str)` so codegen can lookup without pre-splitting the path. Refactor of the existing match arm. - `shard_kind_for_dotted_prefix(&str)` convenience for codegen, which has the dotted layer-0 prefix string in hand. `dead_code` allow until task vllm-project#5 wires the load helpers; pinned by tests now. - `embed_tokens` and `lm_head` map to `ShardDim0` (was `Replicate`). Behavior at the existing call site (`insert_all_reduces`) is byte-identical: the lowering pass walks only `OpKind::Gemm` nodes filtered by `ShardDim1`. `embed_tokens` is consumed by `Embed`, not `Gemm`. `lm_head` is a Gemm but ShardDim0, not ShardDim1, so the row-parallel filter still excludes it. AllReduce-after-Embed + AllGather-after-lm_head land in task vllm-project#6. Macro tests: 195/195 (was 194/194) under both default `--features cuda` and `--features nccl`. New tests: - `shard_kind_for_dotted_prefix_pins_runtime_paths` covers the load-bearing per-arch paths from `default_required_weights`. - Existing `shard_kind_table_covers_hf_standard_names` updated to expect `ShardDim0` for `embed_tokens` / `lm_head`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…ather Foundation for vocab-parallel lm_head at tp>1. Per Python vLLM, ParallelLMHead inherits from VocabParallelEmbedding (both shard along dim 0), so the per-rank lm_head matmul produces partial logits `[N, vocab/tp]` and the sampler needs `[N, vocab]` — LogitsProcessor.gather_logits in Python; an in-FUF AllGather node in ferrite. Mirrors the shape of foundation's `AllReduce` add (`f7edb7032` + `cc75942f7`), inverted on the output-alias axis. Adds: - `OpKind::AllGather` in classified.rs. Deliberately omitted from `from_name` (only the lowering pass produces it). Display string `"all_gather"`. - `Instruction::AllGather(in_slot: u32, out_slot: u32)` in instr.rs, cfg-gated on `nccl`. Eval arm calls `ctx.fwd.tp_group.expect(..).all_gather_last_dim(view, &mut ctx.device.caching)` and stores the returned OwnedTensor at out_slot. Two-tile, fresh output buffer (last dim grows by `tp_world_size`); doesn't alias the input slot. - `shape::apply_signature` arm + `weight_arg_ranks` arm. The signature arm returns `ShapeError::BadArgs` like `Reshape` — output shape isn't a function of input shapes alone (depends on the build's `tp_world_size` literal); the lowering pass writes the concrete output shape onto the FufNode directly when it inserts the node post-FUF-build, so this arm is never reached. - `AllGatherImpl` in impl_lib.rs, registered in `starter_library`. Single-tile claim shape mirrors `AllReduceImpl::matches`. The load-bearing distinction: NO override of `output_alias` — the default returns `None` for every output, which the coloring pass interprets as "give this output its own fresh slot". A regression that overrode `output_alias` to point at the input would mis-size the output buffer and corrupt logits. - Tripwire test `all_gather_impl_claims_single_tile_input_with_ fresh_output_slot` asserts both the claim shape and the no-alias property. Inverts the AllReduce in-place collapse invariant pinned by `coloring_allreduce_collapses_to_input_slot`. At tp=1 the lowering pass produces zero AllGather FufNodes (the `insert_all_reduces` pass doesn't yet emit them; task vllm-project#6 wires it for the lm_head Gemm), so this Impl never matches and is free in the solver. AllGatherImpl is registered unconditionally regardless of `nccl` feature — same convention as AllReduceImpl; `cost_us` returns `UNCALIBRATED_COST_US` on profiles without an `all_gather` cost row. Macro tests: 196/196 (was 195/195) under both default `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m21s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…a_worker Prerequisite for task vllm-project#5 (codegen-side shard-kind dispatch). The emitted `_sharded` loader calls need `rank` as a runtime value in scope inside the `Weights::load_with` body — `tp_world_size` is already a baked literal per (model, tp), but `rank` is per-process. Touches: - `ferrite_forward::dispatcher::ArchTryLoadFn` + `try_load(..., tp_rank: u8, ...)` — adds rank between `tp_world_size` and `max_model_len` in the public API. Forwarded to the matched registration's closure. - Macro: `inventory::submit!`'s closure body becomes `|gw, stream, max_model_len, tp_rank, hf|` and passes `tp_rank` into `Weights::load(..., #tp_lit, tp_rank)`. The per-arch `Weights::load(...)` and `load_match_arms` body gain the `tp_rank: u8` parameter and forward it to `#model_ident::load(gw, stream, max_model_len, tp_rank)`. - Codegen: per-model `load_with(..., tp_rank: u8)` and `load(..., tp_rank: u8)` (both canonical and Shim variants). `tp_rank` is in scope inside the `load_with` body — task vllm-project#5's emitted `_sharded` calls reference it as `tp_rank as usize`. - `cuda_worker.rs`: passes `self.config.tp_rank as u8` into `try_load`. The `!use_tp` ferrite eligibility gate stays in place — until task vllm-project#5 lands the codegen-side shard-kind dispatch, the per-arch `Weights::load` at tp>1 still calls the unsharded loaders and would shape-mismatch the sharded `<W>::*` constants from `b8c25e5d2`. The gate lifts when vllm-project#5 + vllm-project#6 (embed masking + lowering AllReduce-after-Embed + AllGather-after-lm_head) are wired end-to-end. Macro tests 196/196 under both `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m51s. vllm-executor + vllm-cuda + ferrite-models compile clean at `--features cuda` in 1m58s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
The wholesale change. `emit_unindexed_let` and
`emit_layered_load_body` now take `tp_world_size: u8` and dispatch
each `FieldLoad` arm to the matching `_sharded` variant when
`tp_world_size > 1`. At `tp_world_size == 1` every arm emits the
same call shape as before — byte-equivalent expansion, defended
by the new `emit_layered_load_body_at_tp_eq_1_emits_no_sharded_call`
test.
Per the no_piecemeal_codegen_migration rule this lands wholesale
across every dense FieldLoad arm in one commit:
- `FieldLoad::Embedding` → `Embedding::load_sharded` /
`load_layered_embedding_sharded` when the prefix's last segment
is `embed_tokens` (vocab-parallel via Python's
`VocabParallelEmbedding`); else falls back to unsharded.
- `FieldLoad::LinearDense` → `LinearLayer::load_dense_sharded`
with `dim = 0` for column-parallel (q/k/v/gate/up/lm_head/embed)
or `dim = 1` for row-parallel (o/down). Replicate paths (norms
etc.) keep the unsharded call. The shard kind comes from
`tp_lowering::shard_kind_for_dotted_prefix` — the same table the
lowering pass uses for `AllReduce` insertion (single source of
truth).
- `FieldLoad::LinearConcat` → `load_dense_concat_sharded`
always (fused QKV / gate_up are always column-parallel; no
row-parallel concat exists in any current arch).
- `FieldLoad::LinearTiedToEmbedding` — unchanged. At tp>1
`embed_tokens.weight` is already vocab-sharded by the
`Embedding::load_sharded` call site that ran first; the tied
`Linear::new(embed.weight, None)` here just wraps the
already-sharded tensor. That's how Python's
`tie_weights(lm_head.weight = embed_tokens.weight)` stays
self-consistent at tp>1 — both layers point at the same dim-0
sharded slice.
- `FieldLoad::RmsNorm` / `CohereLayerNorm` — unchanged. Replicate.
- Quantized arms (Marlin/Fp8/Bnb4) — unchanged. Defer per the
dense-only scope decision; would need `_sharded` quant variants
in ferrite-kernels first. Quantized arches DO still register at
tp>1 in inventory (the macro outer loop doesn't skip them yet),
but `cuda_worker`'s `!use_tp` ferrite gate prevents anyone from
reaching them; lifting that gate without quant `_sharded`
helpers would shape-mismatch.
Threading: `tp_world_size` flows through
`emit_weights_struct(...)` → `emit_group_let(...)` →
`emit_unindexed_let(.., tp)` and
`emit_layered_load_body(.., tp)`. Both `emit_weights_struct`
callers (canonical + `emit_shim_model`) pass through the value.
The Shim variant doesn't actually consume it (Shims delegate to
canonical's `load_with`, which carries its own per-(model, tp)
specialization), but the API stays uniform so a future change
that needs tp on the Shim side has nothing to plumb. The two
`emit_*` fns get `#[allow(clippy::too_many_arguments)]` since the
existing 8-arg surface predates this commit.
Re-exports: added `load_layered_linear_dense_sharded`,
`load_layered_linear_dense_concat_sharded`, and
`load_layered_embedding_sharded` to `ferrite_forward`'s
`pub use loaders::{...}` list. The macro-emitted code references
them through the `::ferrite_forward::` namespace.
Macro tests: 198/198 (was 196/196 at task vllm-project#2). Two new tests:
- `emit_layered_load_body_at_tp_eq_1_emits_no_sharded_call` —
pins the negative direction across q_proj / o_proj /
layernorm / concat / embedding plans. Defends against a future
regression that drops the `tp == 1` short-circuit.
- `emit_layered_load_body_at_tp_gt_1_dispatches_by_shard_kind` —
pins the dim baked into the emitted call: q_proj → 0,
o_proj → 1, layernorm → unsharded, concat → concat_sharded with
`4 as usize` baked literal.
Full ferrite-models umbrella build clean at both `--features cuda`
(1m45s) and `--features cuda,nccl` (10m07s, all 11 arches ×
{1,2,4,8} tp variants). cuda_worker's `!use_tp` gate stays — the
remaining blocker for end-to-end tp>1 is task vllm-project#6 (embed masking
+ tp_lowering's AllReduce-after-Embed +
AllGather-after-lm_head insertions).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
…mbed) Phase 1 of task vllm-project#6 — embed side. The lm_head AllGather lands in the next commit. Matches Python `VocabParallelEmbedding.forward_native` (vllm/model_executor/layers/vocab_parallel_embedding.py:464–484): masked-input gather → zero out-of-range outputs → AllReduce sum. Kernel side (`embedding_kernels.cu` + `kernels.rs`): - Both `embedding_gather_kernel` and `embedding_gather_vec_kernel` gain `(vocab_offset, vocab_per_rank)` args. A token's `local_id = id - vocab_offset` is range-checked against `vocab_per_rank`; out-of-range tokens get zero-filled output rows. Underflow on `id < vocab_offset` is handled (the unsigned subtraction yields a huge value that fails the upper-bound check). The vec kernel zero-fills via memset on the int4/int2/int vector type. - Rust extern declarations updated to match the new `extern "C"` C++ signatures (FFI wrappers `embedding_gather_{f16,bf16,f32}`). - New `kernels::embedding_gather_masked(weight, ids, vocab_offset, vocab_per_rank, alloc, stream)` is the Rust entry point. The existing `kernels::embedding_gather(weight, ids, alloc, stream)` becomes a thin wrapper passing `(0, weight.dim(0) as u32)` — byte-equivalent at tp=1, no churn at the ~30 unrelated call sites in vllm-cuda hand-written models. `Instruction::Embed` (instr.rs): - Now reads `weight.dim(0)` for `vocab_per_rank` (= `vocab_size / tp` per rank's sharded slice; full vocab at tp=1) and computes `vocab_offset = rank * vocab_per_rank` from the NcclGroup. At tp=1 (`tp_group = None`) the offset stays 0 and the mask never trips. cfg-gated `tp_group` access — the field doesn't exist without `nccl`, so the offset path is `#[cfg(not(feature = "nccl"))]` constant-zero. Lowering pass (tp_lowering.rs): - `insert_all_reduces` extends the producer-walk from `OpKind::Gemm with ShardDim1 weight` to also include `OpKind::Embed with ShardDim0 weight` (i.e. `embed_tokens`). Same fan_out shape: append AllReduce node, rewire downstream consumers to read post-AllReduce value. Without this, ranks 1..N would silently contribute zero embeddings (from the masked gather) without summing back to the global value. - New tests: `lowering_inserts_allreduce_after_vocab_parallel_embed_ at_tp_gt_1` pins the embed insertion shape; `lowering_skips_ replicate_embed_at_tp_gt_1` defends the ShardKind gate so a future shard-table change can't silently emit stray AllReduces. Macro tests: 200/200 (was 198/198) at both default `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m18s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit
added a commit
to starpit/vllm
that referenced
this pull request
Apr 27, 2026
Phase 2 of task vllm-project#6 — lm_head side. Closes the all-gather hole the Instruction::AllGather + AllGatherImpl + OpKind::AllGather foundation in `9d70ff563` left for the lowering pass to fill. Lowering (tp_lowering.rs): - New `insert_lm_head_allgather(fuf, program, tp_world_size)`. At tp>1 walks the FUF for `OpKind::Gemm` nodes whose weight path's last segment is `"lm_head"`, and appends an `OpKind::AllGather` reading the gemm's output. Rewires every consumer of the lm_head Gemm to read the AllGather instead. At tp=1 it's a strict no-op. - Refactor: extract `rewire_consumers(fuf, old, new)` so the AllReduce and AllGather inserters share the consumer-rewiring walk (was duplicated inline in the AllReduce loop). Behavior unchanged. - Wired into `compile()` at the activation site right after `insert_all_reduces` — both passes are gated on tp_world_size > 1 internally, no extra outer-loop branch. backbone_output_for (codegen.rs): - Updated to walk past the AllGather node when present. lm_head's hidden-state input was `last_node.inputs.first()`; with the AllGather inserted, `last_node` is now the AllGather, and its first input is the lm_head Gemm. Skip one hop back to recover the lm_head Gemm, then read its first input as before. At tp=1 the unchanged path is taken (no AllGather node exists). Without this, `forward_backbone` (used by pipeline-parallel intermediate ranks) would mistakenly return the lm_head gemm output instead of the hidden state. FUF output shape on the AllGather node is left equal to the lm_head Gemm's output. The FUF carries pre-shard SYMBOLIC dims (e.g. `vocab_size` Bound, not `vocab_size / tp`); the runtime allocation comes from the kernel's `alloc_tensor` call, which reads `weight.dim(0)` (sharded) for the gemm and the gather's own world-size multiplier internally. The fresh slot for AllGather output is enforced by `AllGatherImpl::output_alias` returning `None` (already pinned by the `all_gather_impl_claims_single_tile_input_with_fresh_output_slot` test). Macro tests: 203/203 (was 200/200) at both default `--features cuda` and `--features nccl`. Three new tests: - `lowering_inserts_allgather_after_lm_head_gemm_at_tp_gt_1` - `lowering_no_allgather_at_tp_eq_1` - `lowering_skips_non_lm_head_gemms_for_allgather` — defends the name gate so a future change can't silently start emitting AllGathers on q_proj / o_proj / etc. (only one match per FUF in every current arch — multiple lm_head gemms would still be handled, but no arch produces them). Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m19s. cuda_worker's `!use_tp` ferrite gate stays for one more commit — lifting it is the last step before task vllm-project#8 verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
6 tasks
SandishKumarHN
added a commit
to SandishKumarHN/vllm
that referenced
this pull request
May 11, 2026
vllm/distributed/eplb/rebalance_execute.py:586 had a device-wide GPU sync with a NOTE(bowen) comment admitting the original author didn't know why it was needed. After investigation, the line is dead code in the SYNC path (rearrange_expert_weights_inplace). Why it's safe: The SYNC path runs entirely on the default CUDA stream end-to-end — torch.empty_like, move_to_buffer's b.copy_(w, non_blocking=True), and NCCL Send/Recv (default stream=None -> current_stream()) all share it. No cross-stream hazard exists. PyTorch's ProcessGroupNCCL correctly calls record_stream() on input/output tensors, so the caching allocator is also safe across iterations. The ASYNC path (transfer_layer + async_worker) uses its own design — cuda_stream.synchronize() (async_worker.py:134) plus CpuGpuEvent for thread handoff (eplb_utils.py) — and is unaffected by this change. Fixes: vllm-project#32028 (Item vllm-project#6) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>
1 task
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.
Tested models:
Tested GPUs: