Skip to content

[Bug] vLLM+ATOM_OOT (gpt-oss-120b) server crashes for particular sequence lengths #623

@divakar-amd

Description

@divakar-amd

I'm noticing that the server runs fine for --random-input-len 1024 --random-output-len 1024 --max-concurrency 8 but crashes for --random-input-len **4096** --random-output-len 1024 --max-concurrency 8

Error:

(EngineCore pid=34835) File "/app/ATOM/atom/plugin/attention.py", line 349, in build
(EngineCore pid=34835) query_lens_cpu[num_decodes + num_extends :].max().item()
(EngineCore pid=34835) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=34835) RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' a
rgument.

Docker used:
rocm/atom-dev:vllm-latest
ATOM commit: 58af3e4
vLLM commit: 0.19.1.dev0+g2a69949bd.d20260420.rocm722

Machine used: mi355

logs_client.txt
logs_server.txt

Server Launch cmd:

export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
export VLLM_ROCM_USE_AITER=1

vllm serve /data/models/gpt-oss-120b/ -tp 1 --disable-uvicorn-access-log --no-enable-prefix-caching --port 8004 --kv-cache-dtype=fp8

Client Launch cmd:

vllm bench serve --model /data/models/gpt-oss-120b/  --dataset-name random --random-input-len 4096 --random-output-len 1024 --max-concurrency 8 --num-prompts 80 --percentile-metrics ttft,tpot,itl,e2el --metric-percentiles 99 --ignore-eos --temperature 0 --seed 0 --trust-remote-code

Metadata

Metadata

Labels

ATOMAdded to frameworks-internal ATOM GitHub board

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions