Fix a bug in tying OPT embeddings by WoosukKwon · Pull Request #1 · vllm-project/vllm

WoosukKwon · 2023-02-25T00:27:11Z

This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.

The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.

add rope scaling as a cli arg so openai server can load rope scaled models

Fix key cache block shape.

Deterministic OpenVINO inference

merge code

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

Triton compilation fix

Group Gemm Version

Rebase of PR vllm-project#33315 onto current main. Adds max_tokens_per_doc parameter to rerank requests, matching Cohere and Jina rerank APIs. Documents longer than this limit are truncated before scoring. Handles all three cross-encoder code paths: - Cross-encoder with sep token (tokenizer built-in truncation) - Chat template / Jinja path (text truncation before template) - Score template path (text truncation before template) Also supports offline usage via PoolingParams(extra_kwargs={"max_tokens_per_doc": N}). Addresses reviewer feedback from original PR: - Offline support via PoolingParams (noooop vllm-project#1) - Score template compatibility tests (noooop vllm-project#2) - Tests across BAAI/bge-reranker-base, BAAI/bge-reranker-v2-gemma, and Qwen/Qwen3-Reranker-0.6B Original PR: vllm-project#33315 Original author: hustxiayang Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jesus Federico <jefp@amazon.com>

Fix OPT errors

44735b4

WoosukKwon merged commit cbf8779 into main Feb 25, 2023

WoosukKwon deleted the fix-opt branch February 25, 2023 00:29

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

CZT0 referenced this pull request in semedia-tech/vllm Sep 11, 2023

#1 测试部署vllm

cc4f1ce

orangetin referenced this pull request in togethercomputer/vllm-ttgi Sep 14, 2023

Merge pull request #1 from winglian/longchat-args

b9012fb

add rope scaling as a cli arg so openai server can load rope scaled models

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Add function invoke call for underlying models (vllm-project#1)

9895bbd

bigPYJ1151 referenced this pull request in bigPYJ1151/vllm Oct 30, 2023

Merge pull request #1 from bigPYJ1151/fix_ans

b5e7066

Fix key cache block shape.

l1cacheDell pushed a commit to CaspianFang/vllm that referenced this pull request Nov 15, 2023

blora LlaMa support vllm-project#1

424df61

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Fix a bug in tying OPT embeddings (#1)

2cb721d

kvikk mentioned this pull request Feb 15, 2024

ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects #2735

Closed

ilya-lavrenov referenced this pull request in ilya-lavrenov/vllm Feb 19, 2024

Merge pull request #1 from ilya-lavrenov/cpu-works

e3d65e0

Deterministic OpenVINO inference

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#1 from DeepAuto-AI/geon-dev

d9d746e

merge code

afeldman-nm mentioned this pull request Apr 30, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

fmmoret mentioned this pull request May 8, 2024

[Bug]: Chunked prefill returning gibberish in some cases. #4697

Closed

Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024

Merge pull request vllm-project#1 from Bellk17/main

b36d574

Triton compilation fix

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

ykim362 referenced this pull request in ykim362/vllm Jun 17, 2024

Wenxh/fp8 on a100 v5 (#1)

aca4a33

Group Gemm Version

xiejibing mentioned this pull request Jun 24, 2024

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed

llmpros mentioned this pull request Jun 27, 2024

[Frontend]: Support base64 embedding #5935

Merged

Juelianqvq mentioned this pull request Jul 3, 2024

[Bug]: Flashinfer stuck with CUDA Graph #6086

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

This was referenced Apr 2, 2026

feat: add max_tokens_per_doc in rerank request. #38827

Merged

feat: add max tokens per doc in rerank request #33315

Closed

varjoranta mentioned this pull request Apr 4, 2026

[Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479

Merged

7 tasks

AlexanderValentini mentioned this pull request Apr 5, 2026

Qwen-3.5 9B often producing repetitive/garbled output with Intel Backend #38994

Open

1 task

1220856302 mentioned this pull request Apr 7, 2026

[Bug]: Segfault in Triton LLVM (MachineCSE / translateLLVMIRToASM) when serving Qwen3.5-4B on RTX 4090 (WSL2) with vLLM 0.19.0 #39149

Open

1 task

ianliuy mentioned this pull request Apr 9, 2026

[Bug]: /v1/responses: Protocol drift and malformed tool aggregation breaking official OpenAI SDK compatibility #39426

Open

1 task

alexm-redhat mentioned this pull request Apr 15, 2026

[Attention][SM90] Add CUTLASS FA3 sparse MLA attention backend for Hopper GPUs #39941

Open

6 tasks

TheDuyIT mentioned this pull request Apr 19, 2026

[Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target #40275

Open

4 tasks

JartX mentioned this pull request Apr 21, 2026

[Feature] KV cache per-token-head Int2/Int4 Quantization + Triton_Quant_KV Interface #39074

Open

inheaden-admin mentioned this pull request Apr 22, 2026

[Bug]: Kimi 2.6 on 8x A100 SMX4 leads to NVLink Crash Coredump #40652

Closed

1 task

cferra mentioned this pull request Apr 30, 2026

[Tracking] TurboQuant + Gemma 4 multimodal: 5-gate blocker stack #41403

Open

vaderyang mentioned this pull request May 1, 2026

[LoRA][MoE] Fix PEFT 0.18+ target_parameters LoRA loading for 3D MoE experts (Qwen3.5) #41384

Open

MidasMining mentioned this pull request May 3, 2026

[Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression #41565

Open

aabbccddwasd mentioned this pull request May 5, 2026

[DSv4][Nvidia] SM12x DeepSeek V4 support #40991

Closed

JartX mentioned this pull request May 5, 2026

[Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 #41394

Open

14 tasks

juhi10071998 mentioned this pull request May 6, 2026

[Quantization] Add ModelOpt NVFP4 W4A16 (4-bit weights, fp16/bf16 activations) support #41769

Merged

7 tasks

dmvevents mentioned this pull request May 8, 2026

NixlConnector hardcodes backends=["UCX"] default; no env-var override path; LIBFABRIC/EFA operators must discover kv_connector_extra_config.backends from source #41814

Open

shanyulu mentioned this pull request May 11, 2026

[Attention][MLA] Add Triton-fused TurboQuant decode backend #41803

Open

4 tasks

maeehart mentioned this pull request May 11, 2026

[ROCm][MLA] FP8 ASM prefill on gfx950 for AITER MLA backend #42294

Closed

6 tasks

SongXiaoMao mentioned this pull request May 13, 2026

[Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) #40756

Open

1 task

DoradusResearch mentioned this pull request May 13, 2026

cumem allocator: double-free and stale error codes during sleep/wake cycles #36651

Open

1 task

alexbi29 mentioned this pull request May 16, 2026

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes #41834

Open

kliukovkin mentioned this pull request May 16, 2026

[RFC]: Cache-affinity-aware request ordering for the V1 scheduler #42185

Open

panpan0000 mentioned this pull request May 18, 2026

[kv_offload] Fix OffloadingSpec crash on DeepSeek-V4-Flash and hybrid KV cache models #42992

Open

3 tasks

This was referenced May 19, 2026

[Core] Add shmem-aware autotune pruner for non-H100 Triton kernels #43044

Closed

[Core] Add shmem-aware autotune pruner for non-H100 Triton kernels #43047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a bug in tying OPT embeddings#1

Fix a bug in tying OPT embeddings#1
WoosukKwon merged 1 commit into
mainfrom
fix-opt

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant