Skip to content

bug: Router crashing when under load (1k+ concurrent requests) #448

@ericschreiber

Description

@ericschreiber

Describe the bug

When scaling vLLM production stack to more then 1k concurrent users, we observed the router crashing and restarting.

The final logs are

INFO:     Shutting down
INFO:     127.0.0.1:42496 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     Waiting for connections to close. (CTRL+C to force quit)
INFO:     127.0.0.1:42526 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     127.0.0.1:42524 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     127.0.0.1:42546 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     127.0.0.1:42562 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     127.0.0.1:42536 - "POST /v1/chat/completions HTTP/1.0" 200 OK
INFO:     Waiting for application shutdown.
�[32;20m[2025-05-21 11:44:46,898] INFO:�[0m httpx async_client.is_closed(): False - Now close it. Id (will be unchanged): 139913824959280 �[3m(httpx_client.py:35:vllm_router.httpx_client)�[0m
�[32;20m[2025-05-21 11:44:46,899] INFO:�[0m httpx async_client.is_closed(): True. Id (will be unchanged): 139913824959280 �[3m(httpx_client.py:39:vllm_router.httpx_client)�[0m
�[32;20m[2025-05-21 11:44:46,899] INFO:�[0m httpx AsyncClient closed �[3m(httpx_client.py:43:vllm_router.httpx_client)�[0m
INFO:     Closing engine stats scraper
INFO:     Closing service discovery module
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

To Reproduce

Setup

values.yaml

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3"
    repository: "lmcache/vllm-openai"
    tag: "latest"
    # tag: "2025-03-10"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 2
    requestCPU: 4
    requestMemory: "8Gi"
    requestGPU: 1
    # pvcStorage: "50Gi"
    # pvcAccessMode:
    #   - ReadWriteOnce
    vllmConfig:
      maxModelLen: 30000
      v1: 1
      extraArgs: ["--no-enable-prefix-caching"]

    lmcacheConfig:
      enabled: true
      cpuOffloadingBufferSize: "6"
    env:
      - name: LMCACHE_LOG_LEVEL
        value: "DEBUG"
    hf_token: "my_token"

cacheserverSpec:
  # -- Number of replicas
  replicaCount: 1

  # -- Container port
  containerPort: 8080

  # -- Service port
  servicePort: 81

  # -- Serializer/Deserializer type
  serde: "naive"

  # -- Cache server image (reusing the vllm image)
  repository: "lmcache/vllm-openai"
  tag: "latest"
  # tag: "2025-03-10"

  # TODO (Jiayi): please adjust this once we have evictor
  # -- router resource requests and limits
  resources:
    requests:
      cpu: "4"
      memory: "4G"
    limits:
      cpu: "4"
      memory: "8G"

  # -- Customized labels for the cache server deployment
  labels:
    environment: "cacheserver"
    release: "cacheserver"

routerSpec:
  resources:
    requests:
      cpu: "8"
      memory: "8G"
    limits:
      cpu: "8"
      memory: "8G"
  routingLogic: "session"
  sessionKey: "x-user-id"

Deploy with helm install vllm vllm/vllm-stack -f values.yaml
Port forwarding kubectl port-forward svc/vllm-router-service 30080:80

Testing

ab -n 10000 -c 2000 -p llama_request.json -T application/json http://localhost:30080/v1/chat/completions

with llama_request.josn

{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Who won the world series in 2020?"
        }
    ],
    "max_tokens": 100
}

Expected behavior

Router does not shut down

Additional context

  1. If you apply this load directly to the worker it's fine.
  2. We observed the following error being raised and not handled correctly. However, fixing it did not solve the issue above although it allowed for more concurrency
Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
    raise exc from None
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 136, in handle_async_request
    raise exc
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 106, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 177, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_async/http11.py", line 217, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 32, in read
    with map_exceptions(exc_map):
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/local/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm_router/routers/main_router.py", line 67, in route_chat_completion
    return await route_general_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 249, in route_general_request
    headers, status_code = await anext(stream_generator)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 109, in process_request
    async with request.app.state.httpx_client_wrapper().stream(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1583, in stream
    response = await self.send(
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
    with map_httpcore_exceptions():
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions