LMServe is a lightweight and fast LLM serving framework for academic purposes. It includes the following features.
- KV block management with PagedAttention for memory efficiency
- Prefix KV sharing to cut redundant compute and memory pressure across multiple requests
- Multi-level KV caching across GPU, host DRAM, and SSD to avoid recomputation for historical KVs
- Request reordering to mitigate head-of-line blocking and improve tail latency
- Chunked prefill to reduce decode delays from long prefills
- Disaggregated inference to isolate prefill and decode phases
Note that LMServe includes key ideas proposed in our ASPLOS 2025 paper, Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management [Paper].
- CUDA
- OpenSSL
- Protobuf compiler
To install the openssl and the protobuf compiler on Ubuntu, run:
$ apt install -y pkg-config libssl-dev protobuf-compiler- Llama: 2, 3, 3.1, 3.2
- Qwen: 2, 2.5
You can easily build this project by running:
$ makeBefore running the server, you must set the LMSERVE_HOME environment variable to the root directory of the project:
$ export LMSERVE_HOME=/path/to/LMServeAdditionally, LMServe has a monitoring daemon configured with a pub/sub architecture to track the status of each node (e.g., number of running or pending requests, etc.).
Before launching LMServe, we must prepare the nats-server. You can simply run it using Docker:
$ docker network create nats
$ docker run -d --name nats --network nats --rm -p 4222:4222 -p 8222:8222 nats --http_port 8222Then, launch the server with:
$ bin/launcher --config configs/default.yamlOnce the server is running, you can measure its performance using the following benchmark scripts:
| Single-turn benchmark
$ python3 benchmarks/benchmark_server.py --dataset sharegpt| Multi-turn benchmark
$ python3 benchmark/benchmark_server_chat.py --dataset sharegpt_chat --num-clients 50