Automatically configure KV cache size by WoosukKwon · Pull Request #6 · vllm-project/vllm

WoosukKwon · 2023-03-03T10:05:40Z

This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>

…ect#6)

Add missing Python requirements

Co-authored-by: Mor Zusman <morz@ai21.com>

[CI/Build] Dockerfile.ubi : Remove test stage

FP8 on A100 for PHIMOE

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM

- wrap_stream async generator middleware with finally cleanup - _accumulate: differential L2 norms, reasoning content extraction, offset computation - _maybe_schedule_cycle: time-based and token-based triggers via asyncio.create_task - _run_eviction_cycle: guard conditions (ENG-09, ENG-10, Pitfall #6), strategy dispatch - Reasoning-relative to absolute offset conversion (D-05) - merge_overlapping_ranges + apply_retention_window + align_ranges_to_blocks pipeline - engine_client.update_request_mask call (D-04)

- 21 tests covering accumulation, guard conditions, cycle scheduling, passthrough - Tests: ENG-09, ENG-10, Pitfall #6, D-05 offset, ENG-06 permanent ranges, ENG-07 isolation - Fix: apply_retention_window only when floor > 0 to avoid discarding all ranges - Fix: used asyncio.run() for async tests (no pytest-asyncio installed)

…-spec fix(turboquant): TQFullAttentionSpec Python page size override

[codex][Kernel] TurboQuant SM86 support, TP metadata slicing, and docs

…impl CutePagedAttentionImpl becomes a pipeline state object: - bind_fusion_weights() stores static weights + allocates persistent I/O buffers with fixed addresses (graph-safe) - forward() reads from self instead of per-forward side-channels - gate_buf added for output gate fusion (Qwen3NextAttention) Blockers #6, #7, #8 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: k_cached_trim is float16 (Triton kernel output) while qdtype is bfloat16. The .to(qdtype) call on a (96K, 2, 512) tensor creates a 188 MB copy. Allocating k_full then v_full sequentially while k_full (188 MB) remains live requires ~376 MB new allocation, but RTX 4090 has only ~260 MB headroom -> OOM. Fix: branch on D <= _FA_MAX_HEAD_DIM BEFORE concatenating k/v: - FA2 path (D=256, SWA layers): keep full concat; flash_attn requires contiguous tensors and SWA footprint is within budget. - Chunked-softmax path (D=512, global layers): read k_cached_trim and v_cached_trim per CHUNK_K rows with lazy dtype conversion. Peak extra allocation stays < 16 MB regardless of context length. K/V segment routing in inner loop: [0, cached_len) -> k_cached_trim[ki:kj].to(qdtype) (cache) [cached_len, seq_len) -> key_chunk[off:off+cs_k] (current) boundary -> torch.cat (at most once, CHUNK_K rows) Result: 96K context now works on RTX 4090. Previous 64K limit was from the earlier fix (bug vllm-project#6); this fix extends it to the full KV cache capacity (~109K tokens with turboquant_3bit_nc). Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Task vllm-project#6. The #[forward] macro now emits, for each model config, a pub Layer + pub Model struct and an unsafe Model::load that reads weights from ferrite-cuda-core's GpuWeights holder. Data- driven from the classified DSL program — no arch-specific fields or load patterns in the compiler. Emission logic (new: codegen_model.rs): - Walks the classified Program to classify every WeightId: * per-layer vs global: any Expr::Weight{index: Some(_)} → per-layer * runtime type: derived from the op the weight flows into as an arg — embed(ids, W) → Embedding, rmsnorm(x, W) → RmsNorm, gemm(x, W) → Linear. A weight used in incompatible positions is a hard error. - Layer struct: one field per per-layer weight, typed as the corresponding ferrite-kernels layer. Model struct: Vec<Layer> + one field per global weight. - Model::NUM_LAYERS baked in from the model's num_hidden_layers config field. Model::load walks 0..NUM_LAYERS, loading per-layer weights via format!("model.layers.{i}.<path>") prefixes. Global weights loaded via weight_conventions::hf_weight_prefix. - Tie-word-embeddings: if the safetensors file doesn't contain lm_head.weight, Model::load aliases embed_tokens.weight as lm_head (GpuTensor: Copy). Runtime decision, not a compile flag. HF convention: - weight_conventions::hf_weight_prefix(path, layer) returns the HF safetensors prefix for a DSL weight reference. Encodes: * per-layer → "model.layers.{i}.{dotted}" * global under backbone → "model.{dotted}" * lm_head → "lm_head" (HF puts it at the root, not under model.) Tests cover all three cases. Feature gating: - Emitted Model/Layer/Model::load are #[cfg(feature = "cuda")]. Without cuda, ferrite-kernels layer types aren't in scope, so the types can't be referenced. This mirrors how ferrite-models gates its `pub mod llama` today. Consumers without cuda still get the numeric pipeline observations (NUM_TILES etc). - ferrite-forward crate grew a cuda feature that forwards to ferrite-kernels/cuda. Tests: - codegen_model::tests: llama_body_classification proves per-layer/global + runtime type assignment are right for a representative body. weight_used_in_incompatible_ops_errors proves the type-conflict check fires. - weight_conventions::tests: hf_weight_prefix cases (per-layer, global under model., lm_head at root). - phase7_end_to_end: cuda-feature-gated type-check functions that assert Layer/Model have the expected fields with the expected types; Model::NUM_LAYERS baked in for 1B/8B/70B; Model::load signature matches the documented interface. Counts: - Without cuda: 77 unit + 6 integration tests pass. - With cuda: 77 unit + 10 integration tests pass (+4 cuda-only). - fmt + clippy clean on both feature configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents commit 8539124 (overlay) at the top of "Next session starts here": preset × dense fan-out, cross-variant dedup, MarlinFormat runtime dispatch, g_idx fingerprint disambiguation, ferrite-models debug=0 RAM ceiling, plus the two goldens that prove the overlay path. Also notes pre-existing flashinfer.rs clippy lints + suspected flashinfer-induced top-N drift on AWQ Llama / Qwen2-GPTQ / Qwen3-dense (goldens were generated pre-flashinfer and need re-baselining). Marks the CT INT4 caveat as RESOLVED via GptqLayout::WeightPacked + the Ok(None) fallback (task vllm-project#6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds six items to the FP8 open-work list that surfaced during the quant×arch coverage session: 5. `fp8_exclusion` × any new quant that ships `.weight_scale` needs to be added to the exclusion tuple (CT-INT4's skip landed in 9ef35d6; future formats like any wN-a-16 variant with a bf16/fp16 `.weight_scale` tensor would need the same treatment). 6. CUDA-graph vs eager drift — structurally the strongest candidate for the threshold=1 ULP drift (goldens use enforce_eager=True, ferrite uses graphs). 7. Scale-value comparison (ferrite's f32 scale tensor vs Python's on the same input) not done — top suspect if vllm-project#6 falls through. 8. Python `cutlass_scaled_mm` Triton fallback when `b.shape[0/1] % 16 != 0` — not hit by current arches but worth pre-checking new arch shapes. 9. Static-scaled FP8 group-shape kernel (`scaled_fp8_quant_kernel_strided_group_shape`) not ported; not currently reachable. 10. Qwen3/Gemma3 × FP8 QK-norm runs via singletons; a dedicated `Fp8FusedQkvQkNormRope*Impl` would close the fused-launch gap (matching the already-staged-but-unregistered dense `FusedQkvQkNormRopeCacheImpl`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… blocking root cause + fix Documents the 2026-04-09 overnight investigation that finally localized the framework baseline overhead source after 6 prior negative results (drop mode, lazy reload, admission throttling, snapshot bypass, log demotion, Benders profile fix, ckpt step throttle). Root cause: API server's _maybe_ft_checkpoint() at line 740 calls _ft_ckpt_future.result() which BLOCKS waiting for the previous step's checkpoint RPC. Under W1_Chat/Heavy fault load, the worker's RPC takes ~70 ms (per-req torch.save + 3× fsync × ~14 reqs) while step time is ~30 ms. API server step rate drops from ~30/sec to ~14/sec — exactly the 60% goodput drop observed (117 tok/s reload vs 292 tok/s NoCkpt). The cProfile data was misleading: it only instruments schedule(), but the blocking happens AFTER schedule() in _ft_post_process → _maybe_ft_checkpoint. cProfile showed schedule() at only 0.6 ms total, which we (correctly) concluded wasn't the bottleneck — but we missed that the bottleneck was in a different function entirely. Fix in commit 2044810 (env-var-gated, default off): FT_CKPT_NONBLOCK=1 non-blocking pipeline (peek with .done()) FT_FAST_TMPFS_WRITE=1 skip fsync on tmpfs (provably correct) Phase 7 result (3 seeds): goodput 117.3±12.6 → 247.4±127.1 (+130 tok/s, +111%). Trade-off: completion drops 98% → 93% because some checkpoint cycles are dropped under load. Adds Follow-up finding vllm-project#6 section + 1 verdict row + 2 future-work items (decide on default, run full ablation matrix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…on phase Adds two sections to HANDOFF_INTERPRETER.md: - "Tensor-parallel foundation (since `f7edb7032`)" under "## Where we are" — 8-commit table covering Instruction::AllReduce variant + dedup_tp_sig threading + coloring/comm-boundary tests + OpKind::AllReduce + AllReduceImpl + tp_lowering pass + insertion logic + cuda_worker NcclGroup plumbing. Notes that the machinery is built and tested but not activated — the lowering pass is wired with a tp=1 literal so it's a strict no-op until canonical fanout flips the switch. - "§4. Tensor-parallel — activation phase" under "## What's left" — covers the three remaining tasks (vllm-project#7 canonical fanout, vllm-project#6 loader sharding + KV replication, vllm-project#9 verify on commandr at tp=2) with enough detail to act on without re-reading the design memory. Bumps the original "Optional follow-ups" to §5. Reference points (`crates/ferrite-forward/src/instr.rs`, `ferrite-forward-macro/src/lib.rs`) get TP-relevant entries: the cfg-gated `Instruction::AllReduce` variant, `dedup_tp_sig`, `tp_lowering::insert_all_reduces`, and `AllReduceImpl`. Top-of-file abstract acknowledges the foundation landing and points at §4. Macro test count updated 207→190 (the unrelated `1dcea5ce1`/`7eb8e995b` cleanup excised dead pivot machinery before TP work; the TP foundation added 8 tests to that smaller baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s by tp Adds `tp_world_size: u8` parameter to `emit_canonical_params_impl` and `emit_model`, threading it through `compile()`'s emission loop from the SolvedModel. The shard math floor-divides every column- parallel dim — `num_q_heads`, `num_kv_heads`, `intermediate_size` — by `tp_world_size`; derived sizes (`q_size`, `kv_size`) follow. `head_dim` is per-head and never sharded. At `tp_world_size = 1` (every emission until task vllm-project#7's outer-loop fanout lands) sharding is identity — output is byte-identical to single-rank builds. The 3 new codegen unit tests pin: * `canonical_params_at_tp_eq_1_is_identity` — Llama-2-7B-ish bounds at tp=1 keep NUM_Q_HEADS=32, INTERMEDIATE_SIZE=11008, Q_SIZE=4096. * `canonical_params_at_tp_eq_2_shards_column_parallel_dims` — same bounds at tp=2 yield 16 / 5504 / 2048. * `canonical_params_at_tp_eq_8_shards_column_parallel_dims` — Llama-3-8B-ish bounds at tp=8 (the upper bound of the compile-time set) yield NUM_Q_HEADS=4, NUM_KV_HEADS=1, INTERMEDIATE_SIZE=1792. KV replication when `num_kv_heads < tp_size` is task vllm-project#6's loader- sharding work — `compile()`'s outer loop in task vllm-project#7 will skip indivisible (variant, tp) tuples until then. Macro test suite: 194/194 (was 191/191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Activation phase. When `ferrite-forward-macro` is built with its own `nccl` cargo feature on (the per-arch crate's `nccl` chains down via ferrite-forward → ferrite-forward-macro), `compile()` iterates the compile-time TP set `[1, 2, 4, 8]` over every variant, producing one SolvedModel per (model, tp) tuple. Variant idents at tp>1 get a `_tp{N}` suffix; ditto source stems for canonical-selection ordering. Indivisible (variant, tp) tuples are skipped at the outer loop — column-parallel dims `num_attention_heads`, `num_key_value_heads`, `intermediate_size` must each divide evenly. KV replication when `num_kv_heads < tp_size` is task vllm-project#6's loader-sharding work. `emit_arch_dispatcher` now keys arms on `(model_ident, bounds, tp)` via the new `DispatchArm` struct. The emitted `Weights` enum carries all (variant × tp) variants; `Weights::load` gains a `tp_world_size: u8` parameter that match-dispatches to the right fingerprint-arms set. One `inventory::submit!` per distinct tp value hits `FerriteArchRegistration` with that tp, so the runtime `try_load(arch_hint, tp_world_size)` from commit B routes correctly. `cfg!(feature = "nccl")` reads `ferrite-forward-macro`'s own feature state — cargo recompiles the proc-macro per-feature-set, so consumers without nccl still get the fast tp=1-only macro. No env-var dance, no build.rs. Verified on commandr at `--features nccl`: c4ai-command-r-v01 · 564 tiles · tp=1 command-r-1-layer · 18 tiles · tp=1 c4ai-command-r-v01_tp2 · 644 tiles · tp=2 (+80 from per-layer AllReduce) command-r-1-layer_tp2 · 20 tiles · tp=2 (+2) c4ai-command-r-v01_tp4 · 644 tiles · tp=4 command-r-1-layer_tp4 · 20 tiles · tp=4 c4ai-command-r-v01_tp8 · 644 tiles · tp=8 command-r-1-layer_tp8 · 20 tiles · tp=8 At default features (no nccl): byte-identical to pre-fanout — only tp=1 emits, single `inventory::submit!` block. Macro test suite: 194/194 pass under both `cargo test -p ferrite-forward-macro --lib` and `--lib --features nccl`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hedule fix Updates the TP section of HANDOFF_INTERPRETER.md to reflect the activation phase landing in this session: - 6 activation commits on top of foundation: SolvedModel.tp_world_size threading, FerriteArchRegistration tp filter, per-arch nccl features, emit_canonical_params_impl sharding, compile() outer-loop fanout, schedule-topo fix. - §4 ("What's left") drops task vllm-project#7 (done), keeps task vllm-project#6 (loader sharding — the live-tp blocker) and task vllm-project#9 (commandr-tp=2 verify). - Verified clean under --features nccl on 10 of 11 arches; default build of ferrite-models all-arches clean in 1m47s. Macro test suite: 194/194 (was 190/190). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t pickup The previous handoff commit added the activation-phase commit table but left several stale references upstream of it: - Top abstract said "Tensor-parallel **foundation** also landed... strict no-op until canonical fanout activates it" — now reads "Tensor-parallel **fanout is active** at --features nccl". - "Where we are" said tip = `03ce7668e`, macro tests 190/190; now tip = `85ef0006f`, 194/194 under both default and `--features nccl`. Calls out the 10/11 arch verification at nccl. - Pre-commit checklist's "207 pass / 0 fail" baseline updated to the current 194; adds an explicit `--features nccl` test run. Adds a `cargo build -p ferrite-models --features nccl` clean step so future work can't silently regress the fanout build. - "Reference points" entry for `tp_lowering::insert_all_reduces` added a load-bearing caveat about the schedule-topo break that `d11085d0b` fixed — so a future hand can't re-introduce an SubgraphId-order assumption. - §4 ("What's left") trimmed to just task vllm-project#6 (loader sharding) and task vllm-project#9 (commandr-tp=2 verify); points at memory for the full commit map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llel Matches Python vLLM. `VocabParallelEmbedding` shards the embedding table along dim 0 (vocab dim); `ParallelLMHead` inherits from it, so both layers carry the same sharded layout — that's what makes `tie_weights` (lm_head.weight = embed_tokens.weight) self-consistent at tp>1 (ref vllm/model_executor/layers/vocab_parallel_embedding.py lines 186, 500, 553). Adds: - `shard_kind_for_last_segment(&str)` so codegen can lookup without pre-splitting the path. Refactor of the existing match arm. - `shard_kind_for_dotted_prefix(&str)` convenience for codegen, which has the dotted layer-0 prefix string in hand. `dead_code` allow until task vllm-project#5 wires the load helpers; pinned by tests now. - `embed_tokens` and `lm_head` map to `ShardDim0` (was `Replicate`). Behavior at the existing call site (`insert_all_reduces`) is byte-identical: the lowering pass walks only `OpKind::Gemm` nodes filtered by `ShardDim1`. `embed_tokens` is consumed by `Embed`, not `Gemm`. `lm_head` is a Gemm but ShardDim0, not ShardDim1, so the row-parallel filter still excludes it. AllReduce-after-Embed + AllGather-after-lm_head land in task vllm-project#6. Macro tests: 195/195 (was 194/194) under both default `--features cuda` and `--features nccl`. New tests: - `shard_kind_for_dotted_prefix_pins_runtime_paths` covers the load-bearing per-arch paths from `default_required_weights`. - Existing `shard_kind_table_covers_hf_standard_names` updated to expect `ShardDim0` for `embed_tokens` / `lm_head`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ather Foundation for vocab-parallel lm_head at tp>1. Per Python vLLM, ParallelLMHead inherits from VocabParallelEmbedding (both shard along dim 0), so the per-rank lm_head matmul produces partial logits `[N, vocab/tp]` and the sampler needs `[N, vocab]` — LogitsProcessor.gather_logits in Python; an in-FUF AllGather node in ferrite. Mirrors the shape of foundation's `AllReduce` add (`f7edb7032` + `cc75942f7`), inverted on the output-alias axis. Adds: - `OpKind::AllGather` in classified.rs. Deliberately omitted from `from_name` (only the lowering pass produces it). Display string `"all_gather"`. - `Instruction::AllGather(in_slot: u32, out_slot: u32)` in instr.rs, cfg-gated on `nccl`. Eval arm calls `ctx.fwd.tp_group.expect(..).all_gather_last_dim(view, &mut ctx.device.caching)` and stores the returned OwnedTensor at out_slot. Two-tile, fresh output buffer (last dim grows by `tp_world_size`); doesn't alias the input slot. - `shape::apply_signature` arm + `weight_arg_ranks` arm. The signature arm returns `ShapeError::BadArgs` like `Reshape` — output shape isn't a function of input shapes alone (depends on the build's `tp_world_size` literal); the lowering pass writes the concrete output shape onto the FufNode directly when it inserts the node post-FUF-build, so this arm is never reached. - `AllGatherImpl` in impl_lib.rs, registered in `starter_library`. Single-tile claim shape mirrors `AllReduceImpl::matches`. The load-bearing distinction: NO override of `output_alias` — the default returns `None` for every output, which the coloring pass interprets as "give this output its own fresh slot". A regression that overrode `output_alias` to point at the input would mis-size the output buffer and corrupt logits. - Tripwire test `all_gather_impl_claims_single_tile_input_with_ fresh_output_slot` asserts both the claim shape and the no-alias property. Inverts the AllReduce in-place collapse invariant pinned by `coloring_allreduce_collapses_to_input_slot`. At tp=1 the lowering pass produces zero AllGather FufNodes (the `insert_all_reduces` pass doesn't yet emit them; task vllm-project#6 wires it for the lm_head Gemm), so this Impl never matches and is free in the solver. AllGatherImpl is registered unconditionally regardless of `nccl` feature — same convention as AllReduceImpl; `cost_us` returns `UNCALIBRATED_COST_US` on profiles without an `all_gather` cost row. Macro tests: 196/196 (was 195/195) under both default `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m21s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…a_worker Prerequisite for task vllm-project#5 (codegen-side shard-kind dispatch). The emitted `_sharded` loader calls need `rank` as a runtime value in scope inside the `Weights::load_with` body — `tp_world_size` is already a baked literal per (model, tp), but `rank` is per-process. Touches: - `ferrite_forward::dispatcher::ArchTryLoadFn` + `try_load(..., tp_rank: u8, ...)` — adds rank between `tp_world_size` and `max_model_len` in the public API. Forwarded to the matched registration's closure. - Macro: `inventory::submit!`'s closure body becomes `|gw, stream, max_model_len, tp_rank, hf|` and passes `tp_rank` into `Weights::load(..., #tp_lit, tp_rank)`. The per-arch `Weights::load(...)` and `load_match_arms` body gain the `tp_rank: u8` parameter and forward it to `#model_ident::load(gw, stream, max_model_len, tp_rank)`. - Codegen: per-model `load_with(..., tp_rank: u8)` and `load(..., tp_rank: u8)` (both canonical and Shim variants). `tp_rank` is in scope inside the `load_with` body — task vllm-project#5's emitted `_sharded` calls reference it as `tp_rank as usize`. - `cuda_worker.rs`: passes `self.config.tp_rank as u8` into `try_load`. The `!use_tp` ferrite eligibility gate stays in place — until task vllm-project#5 lands the codegen-side shard-kind dispatch, the per-arch `Weights::load` at tp>1 still calls the unsharded loaders and would shape-mismatch the sharded `<W>::*` constants from `b8c25e5d2`. The gate lifts when vllm-project#5 + vllm-project#6 (embed masking + lowering AllReduce-after-Embed + AllGather-after-lm_head) are wired end-to-end. Macro tests 196/196 under both `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m51s. vllm-executor + vllm-cuda + ferrite-models compile clean at `--features cuda` in 1m58s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wholesale change. `emit_unindexed_let` and `emit_layered_load_body` now take `tp_world_size: u8` and dispatch each `FieldLoad` arm to the matching `_sharded` variant when `tp_world_size > 1`. At `tp_world_size == 1` every arm emits the same call shape as before — byte-equivalent expansion, defended by the new `emit_layered_load_body_at_tp_eq_1_emits_no_sharded_call` test. Per the no_piecemeal_codegen_migration rule this lands wholesale across every dense FieldLoad arm in one commit: - `FieldLoad::Embedding` → `Embedding::load_sharded` / `load_layered_embedding_sharded` when the prefix's last segment is `embed_tokens` (vocab-parallel via Python's `VocabParallelEmbedding`); else falls back to unsharded. - `FieldLoad::LinearDense` → `LinearLayer::load_dense_sharded` with `dim = 0` for column-parallel (q/k/v/gate/up/lm_head/embed) or `dim = 1` for row-parallel (o/down). Replicate paths (norms etc.) keep the unsharded call. The shard kind comes from `tp_lowering::shard_kind_for_dotted_prefix` — the same table the lowering pass uses for `AllReduce` insertion (single source of truth). - `FieldLoad::LinearConcat` → `load_dense_concat_sharded` always (fused QKV / gate_up are always column-parallel; no row-parallel concat exists in any current arch). - `FieldLoad::LinearTiedToEmbedding` — unchanged. At tp>1 `embed_tokens.weight` is already vocab-sharded by the `Embedding::load_sharded` call site that ran first; the tied `Linear::new(embed.weight, None)` here just wraps the already-sharded tensor. That's how Python's `tie_weights(lm_head.weight = embed_tokens.weight)` stays self-consistent at tp>1 — both layers point at the same dim-0 sharded slice. - `FieldLoad::RmsNorm` / `CohereLayerNorm` — unchanged. Replicate. - Quantized arms (Marlin/Fp8/Bnb4) — unchanged. Defer per the dense-only scope decision; would need `_sharded` quant variants in ferrite-kernels first. Quantized arches DO still register at tp>1 in inventory (the macro outer loop doesn't skip them yet), but `cuda_worker`'s `!use_tp` ferrite gate prevents anyone from reaching them; lifting that gate without quant `_sharded` helpers would shape-mismatch. Threading: `tp_world_size` flows through `emit_weights_struct(...)` → `emit_group_let(...)` → `emit_unindexed_let(.., tp)` and `emit_layered_load_body(.., tp)`. Both `emit_weights_struct` callers (canonical + `emit_shim_model`) pass through the value. The Shim variant doesn't actually consume it (Shims delegate to canonical's `load_with`, which carries its own per-(model, tp) specialization), but the API stays uniform so a future change that needs tp on the Shim side has nothing to plumb. The two `emit_*` fns get `#[allow(clippy::too_many_arguments)]` since the existing 8-arg surface predates this commit. Re-exports: added `load_layered_linear_dense_sharded`, `load_layered_linear_dense_concat_sharded`, and `load_layered_embedding_sharded` to `ferrite_forward`'s `pub use loaders::{...}` list. The macro-emitted code references them through the `::ferrite_forward::` namespace. Macro tests: 198/198 (was 196/196 at task vllm-project#2). Two new tests: - `emit_layered_load_body_at_tp_eq_1_emits_no_sharded_call` — pins the negative direction across q_proj / o_proj / layernorm / concat / embedding plans. Defends against a future regression that drops the `tp == 1` short-circuit. - `emit_layered_load_body_at_tp_gt_1_dispatches_by_shard_kind` — pins the dim baked into the emitted call: q_proj → 0, o_proj → 1, layernorm → unsharded, concat → concat_sharded with `4 as usize` baked literal. Full ferrite-models umbrella build clean at both `--features cuda` (1m45s) and `--features cuda,nccl` (10m07s, all 11 arches × {1,2,4,8} tp variants). cuda_worker's `!use_tp` gate stays — the remaining blocker for end-to-end tp>1 is task vllm-project#6 (embed masking + tp_lowering's AllReduce-after-Embed + AllGather-after-lm_head insertions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mbed) Phase 1 of task vllm-project#6 — embed side. The lm_head AllGather lands in the next commit. Matches Python `VocabParallelEmbedding.forward_native` (vllm/model_executor/layers/vocab_parallel_embedding.py:464–484): masked-input gather → zero out-of-range outputs → AllReduce sum. Kernel side (`embedding_kernels.cu` + `kernels.rs`): - Both `embedding_gather_kernel` and `embedding_gather_vec_kernel` gain `(vocab_offset, vocab_per_rank)` args. A token's `local_id = id - vocab_offset` is range-checked against `vocab_per_rank`; out-of-range tokens get zero-filled output rows. Underflow on `id < vocab_offset` is handled (the unsigned subtraction yields a huge value that fails the upper-bound check). The vec kernel zero-fills via memset on the int4/int2/int vector type. - Rust extern declarations updated to match the new `extern "C"` C++ signatures (FFI wrappers `embedding_gather_{f16,bf16,f32}`). - New `kernels::embedding_gather_masked(weight, ids, vocab_offset, vocab_per_rank, alloc, stream)` is the Rust entry point. The existing `kernels::embedding_gather(weight, ids, alloc, stream)` becomes a thin wrapper passing `(0, weight.dim(0) as u32)` — byte-equivalent at tp=1, no churn at the ~30 unrelated call sites in vllm-cuda hand-written models. `Instruction::Embed` (instr.rs): - Now reads `weight.dim(0)` for `vocab_per_rank` (= `vocab_size / tp` per rank's sharded slice; full vocab at tp=1) and computes `vocab_offset = rank * vocab_per_rank` from the NcclGroup. At tp=1 (`tp_group = None`) the offset stays 0 and the mask never trips. cfg-gated `tp_group` access — the field doesn't exist without `nccl`, so the offset path is `#[cfg(not(feature = "nccl"))]` constant-zero. Lowering pass (tp_lowering.rs): - `insert_all_reduces` extends the producer-walk from `OpKind::Gemm with ShardDim1 weight` to also include `OpKind::Embed with ShardDim0 weight` (i.e. `embed_tokens`). Same fan_out shape: append AllReduce node, rewire downstream consumers to read post-AllReduce value. Without this, ranks 1..N would silently contribute zero embeddings (from the masked gather) without summing back to the global value. - New tests: `lowering_inserts_allreduce_after_vocab_parallel_embed_ at_tp_gt_1` pins the embed insertion shape; `lowering_skips_ replicate_embed_at_tp_gt_1` defends the ShardKind gate so a future shard-table change can't silently emit stray AllReduces. Macro tests: 200/200 (was 198/198) at both default `--features cuda` and `--features nccl`. Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m18s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 of task vllm-project#6 — lm_head side. Closes the all-gather hole the Instruction::AllGather + AllGatherImpl + OpKind::AllGather foundation in `9d70ff563` left for the lowering pass to fill. Lowering (tp_lowering.rs): - New `insert_lm_head_allgather(fuf, program, tp_world_size)`. At tp>1 walks the FUF for `OpKind::Gemm` nodes whose weight path's last segment is `"lm_head"`, and appends an `OpKind::AllGather` reading the gemm's output. Rewires every consumer of the lm_head Gemm to read the AllGather instead. At tp=1 it's a strict no-op. - Refactor: extract `rewire_consumers(fuf, old, new)` so the AllReduce and AllGather inserters share the consumer-rewiring walk (was duplicated inline in the AllReduce loop). Behavior unchanged. - Wired into `compile()` at the activation site right after `insert_all_reduces` — both passes are gated on tp_world_size > 1 internally, no extra outer-loop branch. backbone_output_for (codegen.rs): - Updated to walk past the AllGather node when present. lm_head's hidden-state input was `last_node.inputs.first()`; with the AllGather inserted, `last_node` is now the AllGather, and its first input is the lm_head Gemm. Skip one hop back to recover the lm_head Gemm, then read its first input as before. At tp=1 the unchanged path is taken (no AllGather node exists). Without this, `forward_backbone` (used by pipeline-parallel intermediate ranks) would mistakenly return the lm_head gemm output instead of the hidden state. FUF output shape on the AllGather node is left equal to the lm_head Gemm's output. The FUF carries pre-shard SYMBOLIC dims (e.g. `vocab_size` Bound, not `vocab_size / tp`); the runtime allocation comes from the kernel's `alloc_tensor` call, which reads `weight.dim(0)` (sharded) for the gemm and the gather's own world-size multiplier internally. The fresh slot for AllGather output is enforced by `AllGatherImpl::output_alias` returning `None` (already pinned by the `all_gather_impl_claims_single_tile_input_with_fresh_output_slot` test). Macro tests: 203/203 (was 200/200) at both default `--features cuda` and `--features nccl`. Three new tests: - `lowering_inserts_allgather_after_lm_head_gemm_at_tp_gt_1` - `lowering_no_allgather_at_tp_eq_1` - `lowering_skips_non_lm_head_gemms_for_allgather` — defends the name gate so a future change can't silently start emitting AllGathers on q_proj / o_proj / etc. (only one match per FUF in every current arch — multiple lm_head gemms would still be handled, but no arch produces them). Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m19s. cuda_worker's `!use_tp` ferrite gate stays for one more commit — lifting it is the last step before task vllm-project#8 verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vllm/distributed/eplb/rebalance_execute.py:586 had a device-wide GPU sync with a NOTE(bowen) comment admitting the original author didn't know why it was needed. After investigation, the line is dead code in the SYNC path (rearrange_expert_weights_inplace). Why it's safe: The SYNC path runs entirely on the default CUDA stream end-to-end — torch.empty_like, move_to_buffer's b.copy_(w, non_blocking=True), and NCCL Send/Recv (default stream=None -> current_stream()) all share it. No cross-stream hazard exists. PyTorch's ProcessGroupNCCL correctly calls record_stream() on input/output tensors, so the caching allocator is also safe across iterations. The ASYNC path (transfer_layer + async_worker) uses its own design — cuda_stream.synchronize() (async_worker.py:134) plus CpuGpuEvent for thread handoff (eplb_utils.py) — and is unaffected by this change. Fixes: vllm-project#32028 (Item vllm-project#6) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>

WoosukKwon added 17 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

[WIP] Add memory analyzer

d78e2fb

Automatically config GPU/CPU blocks

2649eb5

Remove TODO

1ae7420

Merge branch 'main' into autoconfig

6654b34

Merge branch 'main' into autoconfig

fcbf027

Add max_num_batched_tokens argument

350ed27

Minor

6f5b41b

Minor

2d03918

Refactor model utils

8ec00fe

Re-implement memory analyzer

84203fc

Fix __init__

96b216c

Use memory analyzer in server.py

c89d440

Add psutil to README

f5d1e2c

Fix comment

cc63c24

WoosukKwon merged commit e9d3f2f into main Mar 12, 2023

WoosukKwon deleted the autoconfig branch March 12, 2023 07:23

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add memory analyzer & utomatically configure KV cache size (vllm-proj…

de10960

…ect#6)

slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 21, 2024

Merge pull request vllm-project#6 from mzegla/extended_requirements

2922b06

Add missing Python requirements

mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024

dtype (vllm-project#6)

00bce1f

Co-authored-by: Mor Zusman <morz@ai21.com>

dtrifiro referenced this pull request in dtrifiro/vllm Apr 26, 2024

Merge pull request #6 from z103cb/ibm_main_docker_ubi_updates

91e4a51

[CI/Build] Dockerfile.ubi : Remove test stage

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#6 from wenxcs/wenxh/fp8-on-a100

4e56e27

FP8 on A100 for PHIMOE

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#6 from bcacdwk/copilot/create-framewo…

64ac2fb

…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM

xinyu-intel pushed a commit to xinyu-intel/vllm that referenced this pull request Apr 11, 2026

Merge pull request vllm-project#6 from Alberto-Codes/fix/tq-page-size…

8a7764a

…-spec fix(turboquant): TQFullAttentionSpec Python page size override

carlosfundora pushed a commit to carlosfundora/vllm-1-bit-turbo that referenced this pull request Apr 13, 2026

Merge pull request vllm-project#6 from mitkox/turboquant-a6000-v019

c6b2ee9

[codex][Kernel] TurboQuant SM86 support, TP metadata slicing, and docs

SandishKumarHN mentioned this pull request Apr 18, 2026

[RFC]: EPLB Implementation Refactoring #32028

Open

1 task

tonyliu312 mentioned this pull request Apr 27, 2026

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) #40969

Open

1 task

Natfii mentioned this pull request Apr 29, 2026

feat: uber-kernel migration (nvllm-v0.3.0) — β-coop as production decode path Navi-AI-Lab/nvllm#4

Merged

6 tasks

SongXiaoMao mentioned this pull request May 13, 2026

[Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) #40756

Open

1 task

jasl mentioned this pull request May 16, 2026

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes #41834

Open

maeehart mentioned this pull request May 17, 2026

[ROCm][DSv4][WIP] Sparse-MLA bring-up on MI300X (FP8 encoder/decoder symmetry + cudagraph fixes) maeehart/vllm#1

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically configure KV cache size#6

Automatically configure KV cache size#6
WoosukKwon merged 17 commits into
mainfrom
autoconfig

WoosukKwon commented Mar 3, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WoosukKwon commented Mar 3, 2023 •

edited

Loading