-
Notifications
You must be signed in to change notification settings - Fork 350
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When scaling vLLM production stack to more then 1k concurrent users, we observed the router crashing and restarting.
The final logs are
INFO: Shutting down
INFO: 127.0.0.1:42496 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: Waiting for connections to close. (CTRL+C to force quit)
INFO: 127.0.0.1:42526 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: 127.0.0.1:42524 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: 127.0.0.1:42546 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: 127.0.0.1:42562 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: 127.0.0.1:42536 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO: Waiting for application shutdown.
�[32;20m[2025-05-21 11:44:46,898] INFO:�[0m httpx async_client.is_closed(): False - Now close it. Id (will be unchanged): 139913824959280 �[3m(httpx_client.py:35:vllm_router.httpx_client)�[0m
�[32;20m[2025-05-21 11:44:46,899] INFO:�[0m httpx async_client.is_closed(): True. Id (will be unchanged): 139913824959280 �[3m(httpx_client.py:39:vllm_router.httpx_client)�[0m
�[32;20m[2025-05-21 11:44:46,899] INFO:�[0m httpx AsyncClient closed �[3m(httpx_client.py:43:vllm_router.httpx_client)�[0m
INFO: Closing engine stats scraper
INFO: Closing service discovery module
INFO: Application shutdown complete.
INFO: Finished server process [1]
To Reproduce
Setup
values.yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3"
repository: "lmcache/vllm-openai"
tag: "latest"
# tag: "2025-03-10"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 2
requestCPU: 4
requestMemory: "8Gi"
requestGPU: 1
# pvcStorage: "50Gi"
# pvcAccessMode:
# - ReadWriteOnce
vllmConfig:
maxModelLen: 30000
v1: 1
extraArgs: ["--no-enable-prefix-caching"]
lmcacheConfig:
enabled: true
cpuOffloadingBufferSize: "6"
env:
- name: LMCACHE_LOG_LEVEL
value: "DEBUG"
hf_token: "my_token"
cacheserverSpec:
# -- Number of replicas
replicaCount: 1
# -- Container port
containerPort: 8080
# -- Service port
servicePort: 81
# -- Serializer/Deserializer type
serde: "naive"
# -- Cache server image (reusing the vllm image)
repository: "lmcache/vllm-openai"
tag: "latest"
# tag: "2025-03-10"
# TODO (Jiayi): please adjust this once we have evictor
# -- router resource requests and limits
resources:
requests:
cpu: "4"
memory: "4G"
limits:
cpu: "4"
memory: "8G"
# -- Customized labels for the cache server deployment
labels:
environment: "cacheserver"
release: "cacheserver"
routerSpec:
resources:
requests:
cpu: "8"
memory: "8G"
limits:
cpu: "8"
memory: "8G"
routingLogic: "session"
sessionKey: "x-user-id"Deploy with helm install vllm vllm/vllm-stack -f values.yaml
Port forwarding kubectl port-forward svc/vllm-router-service 30080:80
Testing
ab -n 10000 -c 2000 -p llama_request.json -T application/json http://localhost:30080/v1/chat/completions
with llama_request.josn
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who won the world series in 2020?"
}
],
"max_tokens": 100
}Expected behavior
Router does not shut down
Additional context
- If you apply this load directly to the worker it's fine.
- We observed the following error being raised and not handled correctly. However, fixing it did not solve the issue above although it allowed for more concurrency
Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
resp = await self._pool.handle_async_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
raise exc from None
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
response = await connection.handle_async_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
return await self._connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 136, in handle_async_request
raise exc
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 106, in handle_async_request
) = await self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 177, in _receive_response_headers
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 217, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 32, in read
with map_exceptions(exc_map):
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File "/usr/local/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm_router/routers/main_router.py", line 67, in route_chat_completion
return await route_general_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 249, in route_general_request
headers, status_code = await anext(stream_generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 109, in process_request
async with request.app.state.httpx_client_wrapper().stream(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1583, in stream
response = await self.send(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
response = await self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
response = await self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
response = await self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
response = await transport.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
with map_httpcore_exceptions():
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadError
tugot17, ericschreiber, iPieter, moldhouse and LukasBluebaummoldhouse
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working