Skip to content

Add miscellaneous updates#8

Merged
WoosukKwon merged 6 commits into
mainfrom
minor
Mar 13, 2023
Merged

Add miscellaneous updates#8
WoosukKwon merged 6 commits into
mainfrom
minor

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

This PR contains several miscellaneous updates to the system, with two notable changes:

  1. The size of the CPU KV cache is now calculated based on the swap_space size provided by the user (defaulting to 20 GiB).
  2. The default value for max_num_batched_tokens has been increased from 2048 to 2560.

@WoosukKwon WoosukKwon merged commit cfae35b into main Mar 13, 2023
@WoosukKwon WoosukKwon deleted the minor branch March 13, 2023 20:48
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Return support for other models apart from jamba

* Support n>1

* A little cleanup

* Rename

* Apply whitespace suggestions from code review

* Add max batch size to the main func

* Fixed attention kv cache bug

* log where requests id are deleted from the dict to debug mode

* Fix typo

* Align with v0.3.3 vllm code

* Remove comments

* Take out model config from CUDAGraph object

* Fix

* Fix typo

* Make the kv cache selection cleaner

* Another typo

* Took the num layers calc outside

* Remove the -1

* Set as num layer / period

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
sfc-gh-hazhang pushed a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
…128k

Support Phi3SuScaledRotaryEmbedding for 128k model
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026
Add PR and issue templates from vLLM project
Srinivasoo7 pushed a commit to Srinivasoo7/vllm that referenced this pull request Mar 4, 2026
…Manager

- Add store_threshold >= 2 validation in FilterReusedOffloadingManager
  constructor (mirrors the existing max_tracker_size >= 1 guard)
- Fix cpu.py gate from > 1 to >= 2; update comment to clarify that
  values < 2 disable filtering
- Add internal assertions to test_filter_reused_manager to verify
  tracker eviction and count reset (Comments vllm-project#8 and vllm-project#9)
- Remove tests/v1/kv_offload/__init__.py (not needed for pytest discovery)
- Remove accidentally tracked dev-workflow files (.patch, diff*.txt,
  error.txt, log files, mypy/test output files)

Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
…llm-project#8)

Add optional `get_desired_lora_slots()` method to the `LoRAResolver` ABC
with a default `return None` so all existing subclasses remain unaffected.

The engine will call this hook between batches when dynamic_lora_slots=True
to let resolver implementations signal a desired GPU slot count. The returned
value is clamped to [min_loras, max_loras] by the engine (implemented in vllm-project#13).

Closes vllm-project#8

Co-authored-by: Claude
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…rk-slidesparse

更新 framework_slidesparse.md:重构为七阶段工程流程并完善实现细节
jinhuang12 pushed a commit to jinhuang12/vllm that referenced this pull request Apr 8, 2026
…d check

Replace all "diminishing returns" / discretionary language with mechanical
f-threshold stop condition across SKILL.md, orchestration docs, hooks,
and conformance tests. Key changes:

- Stage 7 marked AUTONOMOUS with decision tree (no user interaction)
- Non-Negotiable vllm-project#8 + Campaign Stop Condition already in place; align
  all downstream references (Task Graph, Example 1, Resume Protocol)
- Escalation Protocol: STOP → HALT (clarify ≠ campaign termination)
- Resume Protocol step 9: prohibit autonomous pause (user-request only)
- Stop hook: add paused-status exit + replace stale nudge language
- Gate hook + test: update "diminishing returns" labels
- README: fix stale 3% default → 1.0%, add Non-Negotiable vllm-project#8
- integration-logic.md: fix 5 discretionary-language spots
- test-orchestrator.md: update all § references and expected behaviors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii referenced this pull request in Navi-AI-Lab/nvllm Apr 14, 2026
…impl

CutePagedAttentionImpl becomes a pipeline state object:
- bind_fusion_weights() stores static weights + allocates persistent
  I/O buffers with fixed addresses (graph-safe)
- forward() reads from self instead of per-forward side-channels
- gate_buf added for output gate fusion (Qwen3NextAttention)

Blockers #6, #7, #8 from the CUDA graphs checklist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Apr 17, 2026
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…nccl`

First step of TP support per project_tp_design_notes. Adds the
universal `Instruction::AllReduce(u32)` variant + eval arm + the
`ForwardCtx::tp_group: Option<&Arc<NcclGroup>>` field, all gated
behind a new `nccl` cargo feature on `ferrite-forward`. At tp=1 the
upcoming lowering pass emits zero AllReduce rows, so this is a
strict superset of the current `cuda` build.

Variant placement mirrors `Add` / `FusedAddRmsNorm` — one-tile in-
place same-shape, so shape-aware coloring will collapse it to the
input slot with no `View` row (validated by task vllm-project#3's coloring
test). Eval arm calls `NcclGroup::all_reduce_inplace` and `expect`s
both the group reference and the call result; the `None` case is
unreachable when canonical fanout (task vllm-project#7) only emits AllReduce
rows for tp_world_size > 1 canonicals.

Plumbs the feature forward through `vllm-executor`'s `nccl` feature
so the cuda_worker `ForwardCtx` construction sites compile under
the full feature set; `tp_group: None` for now (task vllm-project#8 wires the
real `Arc<NcclGroup>` through). Also stubs the missing
`Self::Ferrite(_) => {}` arm in `CudaModel::set_tp_group` — that
match was non-exhaustive under `--features nccl` because the
ferrite stack was previously TP-oblivious and nobody compiled the
nccl path through it.

Verified: `cargo check -p ferrite-forward --features cuda` (variant
absent) and `--features nccl` (variant present) both green;
`cargo check -p vllm-executor --features cuda` and `--features
nccl` both green; clippy -D warnings clean on both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…tp_group

Task vllm-project#8 plumbing. The Ferrite forward path now actually receives
the worker's NCCL communicator instead of swallowing it.

- `FerriteModel` gains a `tp_group: Option<Arc<NcclGroup>>` field
  (gated on `feature = "nccl"`), mirroring the same shape the hand-
  written CudaModel arms already carry.
- `CudaModel::set_tp_group` arm `Self::Ferrite(m) => m.tp_group =
  Some(group)` replaces the task-vllm-project#1 stub.
- Both `ForwardCtx` construction sites (forward + forward_backbone)
  pass `tp_group: m.tp_group.as_ref()` so the universal
  `Instruction::AllReduce` eval arm has the group reference it
  expects when the lowering pass starts emitting AllReduce rows
  (task vllm-project#7's canonical fanout will activate that).
- `FerriteModel` construction in cuda_worker initializes
  `tp_group: None`; the worker's later `set_tp_group` call wires it.

Also cleans up an `AllReduceImpl::interpreter_arm` method I had
dropped into `impl Implementation for AllReduceImpl` — that method
isn't on the `Implementation` trait (the universal-eval pivot in
`de15e035a` left only `opcode_shape` + `fan_out` as the codegen
override surface). Removed with a comment pointing at the
production eval path.

Verified: `cargo check -p vllm-executor --features cuda` and
`--features nccl` both green; `cargo clippy -D warnings` clean on
ferrite-forward-macro and vllm-executor; macro tests 190/190 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
…+ Embedding)

Tensor-parallel safetensors → GPU loaders matching Python vLLM's
ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding
weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires
the dispatch from the macro side; this commit lands only the
runtime helpers).

Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`:

- `Linear::load_sharded(weights, prefix, dim, rank, world)`. The
  load-bearing bias rules:
  - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias
    shards along dim 0 too — each rank holds its own slice. Mirrors
    Python `ColumnParallelLinear.weight_loader` →
    `loaded_weight.narrow(output_dim=0, …)`.
  - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated
    full-size on rank 0 only**, `None` on other ranks. The forward
    path adds bias before the cross-rank AllReduce-sum; only rank 0's
    contribution survives the sum, giving exactly one bias add to
    the residual stream. Mirrors Python `RowParallelLinear.forward`
    line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`.

- `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`.
  Thin wrapper over `Linear::load_sharded`. The codegen entry point
  for non-fused (single-prefix) sharded loads.

- `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream,
  rank, world)`. Sharded variant of `load_dense_concat` for the fused
  QKV / gate_up paths. Always column-parallel (no row-parallel concat
  exists in any current arch). Each source weight slices along dim 0
  to `[out_i / world, hidden]` then packs into one contiguous
  `[(sum out_i) / world, hidden]` GPU buffer via per-source
  `take_shard_into`. Biases follow the column-parallel rule (sliced
  along dim 0) — matches Python `MergedColumnParallelLinear` /
  `QKVParallelLinear`. Per-rank divisibility is guaranteed by the
  macro's outer-loop fanout `skip` of indivisible (variant, tp)
  tuples (commit `889c44b2f`).

- `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab-
  parallel: slices the embedding table along dim 0
  (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`).
  Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as
  `Linear::load_sharded(dim=0)` — that's what makes
  `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent
  at tp>1.

Defers FP8 / Marlin / BNB sharded variants — the verify model
(commandr) is dense bf16. World == 1 short-circuits to
byte-equivalent behavior with the existing unsharded paths in
every helper, plus shard-kind-aware bias rules. No tests added at
this layer (CUDA stream + safetensors fixtures aren't worth the
infra spend; the integration test is task vllm-project#8). `take_shard` /
`take_shard_into` on `GpuWeights` already exist (used by the prior
hand-written TP path); these wrappers are pure call-site
plumbing on top.

Build clean: ferrite-kernels checks + clippy at default features.
The macro-side consumer that chooses sharded vs unsharded based
on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers
have no runtime caller (intentionally — wholesale codegen
migration per the no_piecemeal_codegen_migration rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
Phase 2 of task vllm-project#6 — lm_head side. Closes the all-gather hole the
Instruction::AllGather + AllGatherImpl + OpKind::AllGather
foundation in `9d70ff563` left for the lowering pass to fill.

Lowering (tp_lowering.rs):
- New `insert_lm_head_allgather(fuf, program, tp_world_size)`. At
  tp>1 walks the FUF for `OpKind::Gemm` nodes whose weight path's
  last segment is `"lm_head"`, and appends an `OpKind::AllGather`
  reading the gemm's output. Rewires every consumer of the
  lm_head Gemm to read the AllGather instead. At tp=1 it's a
  strict no-op.
- Refactor: extract `rewire_consumers(fuf, old, new)` so the
  AllReduce and AllGather inserters share the consumer-rewiring
  walk (was duplicated inline in the AllReduce loop). Behavior
  unchanged.
- Wired into `compile()` at the activation site right after
  `insert_all_reduces` — both passes are gated on tp_world_size > 1
  internally, no extra outer-loop branch.

backbone_output_for (codegen.rs):
- Updated to walk past the AllGather node when present. lm_head's
  hidden-state input was `last_node.inputs.first()`; with the
  AllGather inserted, `last_node` is now the AllGather, and its
  first input is the lm_head Gemm. Skip one hop back to recover
  the lm_head Gemm, then read its first input as before. At tp=1
  the unchanged path is taken (no AllGather node exists). Without
  this, `forward_backbone` (used by pipeline-parallel intermediate
  ranks) would mistakenly return the lm_head gemm output instead
  of the hidden state.

FUF output shape on the AllGather node is left equal to the
lm_head Gemm's output. The FUF carries pre-shard SYMBOLIC dims
(e.g. `vocab_size` Bound, not `vocab_size / tp`); the runtime
allocation comes from the kernel's `alloc_tensor` call, which
reads `weight.dim(0)` (sharded) for the gemm and the gather's
own world-size multiplier internally. The fresh slot for
AllGather output is enforced by `AllGatherImpl::output_alias`
returning `None` (already pinned by the
`all_gather_impl_claims_single_tile_input_with_fresh_output_slot`
test).

Macro tests: 203/203 (was 200/200) at both default `--features
cuda` and `--features nccl`. Three new tests:
- `lowering_inserts_allgather_after_lm_head_gemm_at_tp_gt_1`
- `lowering_no_allgather_at_tp_eq_1`
- `lowering_skips_non_lm_head_gemms_for_allgather` — defends the
  name gate so a future change can't silently start emitting
  AllGathers on q_proj / o_proj / etc. (only one match per FUF
  in every current arch — multiple lm_head gemms would still be
  handled, but no arch produces them).

Full ferrite-models umbrella build clean at `--features
cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in
7m19s. cuda_worker's `!use_tp` ferrite gate stays for one more
commit — lifting it is the last step before task vllm-project#8 verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
starpit added a commit to starpit/vllm that referenced this pull request Apr 27, 2026
End-to-end TP is now wired: codegen routes load calls through
`_sharded` helpers (52615e881), lowering injects AllReduce after
vocab-parallel Embed (aefa1de37), AllGather after lm_head Gemm
(8a3ea25bc), and tp_rank threads through the full
try_load → Weights::load → load_with chain (60d9b9d4e). At tp=1
every code path is byte-equivalent to the pre-TP build via the
sharded-helpers' `world == 1` short-circuits.

cuda_worker's `!use_tp` ferrite eligibility gate served as
belt-and-suspenders during the multi-commit landing. With the
chain complete, drop the gate so `vllm chat ... --tensor-parallel-
size 2` reaches `try_load` with the matching `(arch, tp_world_size,
tp_rank)` triple and gets the per-(model, tp) sharded
registration.

Build clean: vllm-executor + vllm-cuda check at `--features cuda`
(1m44s) and `--features cuda,nccl` (7m35s, full ferrite-models
umbrella for the latter).

Next: task vllm-project#8 — verify on commandr at tp=2 with `vllm chat
CohereForAI/c4ai-command-r-v01 --tensor-parallel-size 2`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant