Add miscellaneous updates by WoosukKwon · Pull Request #8 · vllm-project/vllm

WoosukKwon · 2023-03-13T20:48:11Z

This PR contains several miscellaneous updates to the system, with two notable changes:

The size of the CPU KV cache is now calculated based on the swap_space size provided by the user (defaulting to 20 GiB).
The default value for max_num_batched_tokens has been increased from 2048 to 2560.

Organise

* Return support for other models apart from jamba * Support n>1 * A little cleanup * Rename * Apply whitespace suggestions from code review * Add max batch size to the main func * Fixed attention kv cache bug * log where requests id are deleted from the dict to debug mode * Fix typo * Align with v0.3.3 vllm code * Remove comments * Take out model config from CUDAGraph object * Fix * Fix typo * Make the kv cache selection cleaner * Another typo * Took the num layers calc outside * Remove the -1 * Set as num layer / period --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

remove dummy path in arctic

…128k Support Phi3SuScaledRotaryEmbedding for 128k model

update overhead benchmark

Add PR and issue templates from vLLM project

…Manager - Add store_threshold >= 2 validation in FilterReusedOffloadingManager constructor (mirrors the existing max_tracker_size >= 1 guard) - Fix cpu.py gate from > 1 to >= 2; update comment to clarify that values < 2 disable filtering - Add internal assertions to test_filter_reused_manager to verify tracker eviction and count reset (Comments vllm-project#8 and vllm-project#9) - Remove tests/v1/kv_offload/__init__.py (not needed for pytest discovery) - Remove accidentally tracked dev-workflow files (.patch, diff*.txt, error.txt, log files, mypy/test output files) Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…llm-project#8) Add optional `get_desired_lora_slots()` method to the `LoRAResolver` ABC with a default `return None` so all existing subclasses remain unaffected. The engine will call this hook between batches when dynamic_lora_slots=True to let resolver implementations signal a desired GPU slot count. The returned value is clamped to [min_loras, max_loras] by the engine (implemented in vllm-project#13). Closes vllm-project#8 Co-authored-by: Claude Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>

…rk-slidesparse 更新 framework_slidesparse.md：重构为七阶段工程流程并完善实现细节

…d check Replace all "diminishing returns" / discretionary language with mechanical f-threshold stop condition across SKILL.md, orchestration docs, hooks, and conformance tests. Key changes: - Stage 7 marked AUTONOMOUS with decision tree (no user interaction) - Non-Negotiable vllm-project#8 + Campaign Stop Condition already in place; align all downstream references (Task Graph, Example 1, Resume Protocol) - Escalation Protocol: STOP → HALT (clarify ≠ campaign termination) - Resume Protocol step 9: prohibit autonomous pause (user-request only) - Stop hook: add paused-status exit + replace stale nudge language - Gate hook + test: update "diminishing returns" labels - README: fix stale 3% default → 1.0%, add Non-Negotiable vllm-project#8 - integration-logic.md: fix 5 discretionary-language spots - test-orchestrator.md: update all § references and expected behaviors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…impl CutePagedAttentionImpl becomes a pipeline state object: - bind_fusion_weights() stores static weights + allocates persistent I/O buffers with fixed addresses (graph-safe) - forward() reads from self instead of per-forward side-channels - gate_buf added for output gate fusion (Qwen3NextAttention) Blockers #6, #7, #8 from the CUDA graphs checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MTP large logprob fixes

…nccl` First step of TP support per project_tp_design_notes. Adds the universal `Instruction::AllReduce(u32)` variant + eval arm + the `ForwardCtx::tp_group: Option<&Arc<NcclGroup>>` field, all gated behind a new `nccl` cargo feature on `ferrite-forward`. At tp=1 the upcoming lowering pass emits zero AllReduce rows, so this is a strict superset of the current `cuda` build. Variant placement mirrors `Add` / `FusedAddRmsNorm` — one-tile in- place same-shape, so shape-aware coloring will collapse it to the input slot with no `View` row (validated by task vllm-project#3's coloring test). Eval arm calls `NcclGroup::all_reduce_inplace` and `expect`s both the group reference and the call result; the `None` case is unreachable when canonical fanout (task vllm-project#7) only emits AllReduce rows for tp_world_size > 1 canonicals. Plumbs the feature forward through `vllm-executor`'s `nccl` feature so the cuda_worker `ForwardCtx` construction sites compile under the full feature set; `tp_group: None` for now (task vllm-project#8 wires the real `Arc<NcclGroup>` through). Also stubs the missing `Self::Ferrite(_) => {}` arm in `CudaModel::set_tp_group` — that match was non-exhaustive under `--features nccl` because the ferrite stack was previously TP-oblivious and nobody compiled the nccl path through it. Verified: `cargo check -p ferrite-forward --features cuda` (variant absent) and `--features nccl` (variant present) both green; `cargo check -p vllm-executor --features cuda` and `--features nccl` both green; clippy -D warnings clean on both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tp_group Task vllm-project#8 plumbing. The Ferrite forward path now actually receives the worker's NCCL communicator instead of swallowing it. - `FerriteModel` gains a `tp_group: Option<Arc<NcclGroup>>` field (gated on `feature = "nccl"`), mirroring the same shape the hand- written CudaModel arms already carry. - `CudaModel::set_tp_group` arm `Self::Ferrite(m) => m.tp_group = Some(group)` replaces the task-vllm-project#1 stub. - Both `ForwardCtx` construction sites (forward + forward_backbone) pass `tp_group: m.tp_group.as_ref()` so the universal `Instruction::AllReduce` eval arm has the group reference it expects when the lowering pass starts emitting AllReduce rows (task vllm-project#7's canonical fanout will activate that). - `FerriteModel` construction in cuda_worker initializes `tp_group: None`; the worker's later `set_tp_group` call wires it. Also cleans up an `AllReduceImpl::interpreter_arm` method I had dropped into `impl Implementation for AllReduceImpl` — that method isn't on the `Implementation` trait (the universal-eval pivot in `de15e035a` left only `opcode_shape` + `fan_out` as the codegen override surface). Removed with a comment pointing at the production eval path. Verified: `cargo check -p vllm-executor --features cuda` and `--features nccl` both green; `cargo clippy -D warnings` clean on ferrite-forward-macro and vllm-executor; macro tests 190/190 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ Embedding) Tensor-parallel safetensors → GPU loaders matching Python vLLM's ColumnParallelLinear / RowParallelLinear / VocabParallelEmbedding weight_loader semantics. Used by codegen at tp>1 (task vllm-project#5 wires the dispatch from the macro side; this commit lands only the runtime helpers). Added on the kernel-side `LinearLayer` / `Linear` / `Embedding`: - `Linear::load_sharded(weights, prefix, dim, rank, world)`. The load-bearing bias rules: - `dim = 0` (column-parallel: q/k/v/gate/up/lm_head/embed): bias shards along dim 0 too — each rank holds its own slice. Mirrors Python `ColumnParallelLinear.weight_loader` → `loaded_weight.narrow(output_dim=0, …)`. - `dim = 1` (row-parallel: o_proj, down_proj): bias is **replicated full-size on rank 0 only**, `None` on other ranks. The forward path adds bias before the cross-rank AllReduce-sum; only rank 0's contribution survives the sum, giving exactly one bias add to the residual stream. Mirrors Python `RowParallelLinear.forward` line 1543 `bias_ = None if (self.tp_rank > 0 …) else self.bias`. - `LinearLayer::load_dense_sharded(weights, prefix, dim, rank, world)`. Thin wrapper over `Linear::load_sharded`. The codegen entry point for non-fused (single-prefix) sharded loads. - `LinearLayer::load_dense_concat_sharded(weights, prefixes, stream, rank, world)`. Sharded variant of `load_dense_concat` for the fused QKV / gate_up paths. Always column-parallel (no row-parallel concat exists in any current arch). Each source weight slices along dim 0 to `[out_i / world, hidden]` then packs into one contiguous `[(sum out_i) / world, hidden]` GPU buffer via per-source `take_shard_into`. Biases follow the column-parallel rule (sliced along dim 0) — matches Python `MergedColumnParallelLinear` / `QKVParallelLinear`. Per-rank divisibility is guaranteed by the macro's outer-loop fanout `skip` of indivisible (variant, tp) tuples (commit `889c44b2f`). - `Embedding::load_sharded(weights, prefix, rank, world)`. Vocab- parallel: slices the embedding table along dim 0 (`[vocab_size, hidden]` → `[vocab_size / world, hidden]`). Mirrors Python `VocabParallelEmbedding`. Same dim-0 cut as `Linear::load_sharded(dim=0)` — that's what makes `tie_weights(lm_head.weight = embed_tokens.weight)` self-consistent at tp>1. Defers FP8 / Marlin / BNB sharded variants — the verify model (commandr) is dense bf16. World == 1 short-circuits to byte-equivalent behavior with the existing unsharded paths in every helper, plus shard-kind-aware bias rules. No tests added at this layer (CUDA stream + safetensors fixtures aren't worth the infra spend; the integration test is task vllm-project#8). `take_shard` / `take_shard_into` on `GpuWeights` already exist (used by the prior hand-written TP path); these wrappers are pure call-site plumbing on top. Build clean: ferrite-kernels checks + clippy at default features. The macro-side consumer that chooses sharded vs unsharded based on shard_kind comes in tasks vllm-project#4 + vllm-project#5; until then these helpers have no runtime caller (intentionally — wholesale codegen migration per the no_piecemeal_codegen_migration rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 of task vllm-project#6 — lm_head side. Closes the all-gather hole the Instruction::AllGather + AllGatherImpl + OpKind::AllGather foundation in `9d70ff563` left for the lowering pass to fill. Lowering (tp_lowering.rs): - New `insert_lm_head_allgather(fuf, program, tp_world_size)`. At tp>1 walks the FUF for `OpKind::Gemm` nodes whose weight path's last segment is `"lm_head"`, and appends an `OpKind::AllGather` reading the gemm's output. Rewires every consumer of the lm_head Gemm to read the AllGather instead. At tp=1 it's a strict no-op. - Refactor: extract `rewire_consumers(fuf, old, new)` so the AllReduce and AllGather inserters share the consumer-rewiring walk (was duplicated inline in the AllReduce loop). Behavior unchanged. - Wired into `compile()` at the activation site right after `insert_all_reduces` — both passes are gated on tp_world_size > 1 internally, no extra outer-loop branch. backbone_output_for (codegen.rs): - Updated to walk past the AllGather node when present. lm_head's hidden-state input was `last_node.inputs.first()`; with the AllGather inserted, `last_node` is now the AllGather, and its first input is the lm_head Gemm. Skip one hop back to recover the lm_head Gemm, then read its first input as before. At tp=1 the unchanged path is taken (no AllGather node exists). Without this, `forward_backbone` (used by pipeline-parallel intermediate ranks) would mistakenly return the lm_head gemm output instead of the hidden state. FUF output shape on the AllGather node is left equal to the lm_head Gemm's output. The FUF carries pre-shard SYMBOLIC dims (e.g. `vocab_size` Bound, not `vocab_size / tp`); the runtime allocation comes from the kernel's `alloc_tensor` call, which reads `weight.dim(0)` (sharded) for the gemm and the gather's own world-size multiplier internally. The fresh slot for AllGather output is enforced by `AllGatherImpl::output_alias` returning `None` (already pinned by the `all_gather_impl_claims_single_tile_input_with_fresh_output_slot` test). Macro tests: 203/203 (was 200/200) at both default `--features cuda` and `--features nccl`. Three new tests: - `lowering_inserts_allgather_after_lm_head_gemm_at_tp_gt_1` - `lowering_no_allgather_at_tp_eq_1` - `lowering_skips_non_lm_head_gemms_for_allgather` — defends the name gate so a future change can't silently start emitting AllGathers on q_proj / o_proj / etc. (only one match per FUF in every current arch — multiple lm_head gemms would still be handled, but no arch produces them). Full ferrite-models umbrella build clean at `--features cuda,nccl` across all 11 arches × {1,2,4,8} tp variants in 7m19s. cuda_worker's `!use_tp` ferrite gate stays for one more commit — lifting it is the last step before task vllm-project#8 verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end TP is now wired: codegen routes load calls through `_sharded` helpers (52615e881), lowering injects AllReduce after vocab-parallel Embed (aefa1de37), AllGather after lm_head Gemm (8a3ea25bc), and tp_rank threads through the full try_load → Weights::load → load_with chain (60d9b9d4e). At tp=1 every code path is byte-equivalent to the pre-TP build via the sharded-helpers' `world == 1` short-circuits. cuda_worker's `!use_tp` ferrite eligibility gate served as belt-and-suspenders during the multi-commit landing. With the chain complete, drop the gate so `vllm chat ... --tensor-parallel- size 2` reaches `try_load` with the matching `(arch, tp_world_size, tp_rank)` triple and gets the per-(model, tp) sharded registration. Build clean: vllm-executor + vllm-cuda check at `--features cuda` (1m44s) and `--features cuda,nccl` (7m35s, full ferrite-models umbrella for the latter). Next: task vllm-project#8 — verify on commandr at tp=2 with `vllm chat CohereForAI/c4ai-command-r-v01 --tensor-parallel-size 2`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WoosukKwon added 6 commits March 13, 2023 18:44

Handle empty inputs

e58e731

Minor

63ba824

Add namespace

d7eb2e0

Default batch size 2048 -> 2560

d02d394

memory utilization -> swap space

532365e

Fetch requests every step

d87b2b0

WoosukKwon merged commit cfae35b into main Mar 13, 2023

WoosukKwon deleted the minor branch March 13, 2023 20:48

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#8 from ri938/organise

2617c55

Organise

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Comments & minor changes (vllm-project#8)

a8561b8

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add miscellaneous updates (vllm-project#8)

cd9f1ac

sfc-gh-hazhang pushed a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024

Merge pull request vllm-project#8 from Snowflake-Labs/remove-dummy

15de0c2

remove dummy path in arctic

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#8 from Starmys/dev/chengzhang/phi3moe…

dfaba7c

…128k Support Phi3SuScaledRotaryEmbedding for 128k model

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#8 from KuntaiDu/jiayi-dev-v2

0dd3571

update overhead benchmark

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Closed

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

This was referenced Jan 27, 2026

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add API parent span lifecycle management (PR #6/9) #33182

Closed

[Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) #33190

Closed

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

Merge pull request vllm-project#8 from hsliuustc0106/hsliu-dev-C

c150346

Add PR and issue templates from vLLM project

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#8 from bcacdwk/copilot/update-framewo…

4447366

…rk-slidesparse 更新 framework_slidesparse.md：重构为七阶段工程流程并完善实现细节

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Apr 17, 2026

Merge pull request vllm-project#8 from de-inf/nemo-mtp-logprob-fixed

2549422

MTP large logprob fixes

SongXiaoMao mentioned this pull request May 13, 2026

[Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) #40756

Open

1 task

maeehart mentioned this pull request May 17, 2026

[ROCm][DSv4][WIP] Sparse-MLA bring-up on MI300X (FP8 encoder/decoder symmetry + cudagraph fixes) maeehart/vllm#1

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add miscellaneous updates#8

Add miscellaneous updates#8
WoosukKwon merged 6 commits into
mainfrom
minor

WoosukKwon commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant