A lightweight control-plane service for LLM KV-cache placement and routing.
This system tracks where cached attention prefixes live across distributed inference nodes and routes requests to maximize cache reuse and minimize recomputation.
Large Language Model inference can be significantly accelerated through KV-cache reuse.
In distributed serving systems, this introduces key challenges:
• Where are cached prefixes stored?
• Which nodes hold reusable KV blocks?
• How should requests be routed for maximum reuse?
• How do we avoid redundant recomputation?This project implements a centralized metadata-driven control plane to solve these problems.
llm-serving-cache is the inference serving layer of a layered AI infrastructure stack:
| Layer | Project | Role |
|---|---|---|
| Storage | VeriStore | WAL durability, crash recovery, Raft replication, object storage |
| Inference Serving | llm-serving-cache (this project) | KV-cache placement and routing control plane, backed by VeriStore |
| Workload Orchestration | Veriflow | GPU-aware scheduler for training, inference, and evaluation jobs |
This project depends on VeriStore (github.com/NasitSony/VeriStore) as its durable metadata backend — all cache entries, session routes, and node registry state are persisted using VeriStore's WAL-backed KV engine, inheriting its crash-consistency and deterministic recovery guarantees.
Tracks:
• cache entries (prefix → node mapping)
• session routing decisions
• cache access patterns (for eviction)Maintains:
• available serving nodes
• capacity and utilization
• node livenessResponsible for request routing:
exact cache hit → reuse cached result
prefix match → reuse partial computation
cache miss → select node + trigger cache fillOrchestrates control-plane operations:
• cache registration
• node capacity updates
• routing delegation
• lifecycle managementDetermines where new cache blocks are placed:
• least-loaded node selection
• capacity-aware routingClient Request
↓
Router
↓
Session Affinity (if exists)
↓
Exact Cache Hit?
↓
Prefix Match?
↓
Cache Miss
↓
Node Selection (least-loaded)
↓
[If full] Evict old cache entry
↓
Route request to node
↓
Inference completes
↓
Register new cache entry
↓
Update node capacity
↓
KV-backed Metadata Store (WAL)- ✅ Exact cache lookup
- ✅ Longest-prefix cache reuse
- ✅ Session-affinity routing
- ✅ Cache fill after miss
- ✅ Node capacity tracking
- ✅ Capacity-aware placement
- ✅ Eviction on full node (LRU-style via access time)
- ✅ WAL-backed persistent metadata
- ✅ Crash recovery via deterministic WAL replaymodel_id: llama-70b
prefix_hash: 9fa21ab
block_id: block-134
node_id: node-asession_id: sess-101
model_id: llama-70b
node_id: node-allm-serving-cache/
├── CMakeLists.txt
├── README.md
│
├── include/
│ └── cache/
│ ├── cache_types.h
│ ├── metadata_store.h
│ ├── placement_policy.h
│ ├── router.h
│ ├── node_registry.h
│ └── coordinator.h
│
├── src/
│ ├── metadata_store.cpp
│ ├── placement_policy.cpp
│ ├── router.cpp
│ ├── node_registry.cpp
│ ├── coordinator.cpp
│ └── main.cpp
│
├── demos/
│ ├── cache_register_demo.cpp
│ ├── routing_demo.cpp
│ └── node_failure_demo.cpp
│
├── tests/
│ ├── metadata_store_test.cpp
│ ├── routing_test.cpp
│ └── placement_policy_test.cpp
│
└── docs/
├── architecture.md
├── api.md
└── roadmap.mdExact request routed to: node-a
Cache hit: yes
Prefix-match request routed to: node-a
Cache hit: yes
Miss request routed to: node-b
Cache hit: no
Registered new cache entry on: node-b
node-b used_capacity: 1/1
After fill routed to: node-b
Cache hit: yes
Evicted cache block from node: node-b• In-memory metadata store
• Cache registration
• Session routing
• Node registry• Exact cache hits
• Prefix-aware routing
• Session-affinity routing
• Cache fill after miss• Node capacity tracking
• Capacity-aware placement
• Eviction on full nodes
• Complete cache lifecycle• KV-backed metadata store using KV Shuttle
• WAL-based durability for cache and session metadata
• Crash recovery via WAL replay
• Deterministic reconstruction after restart• Raft-backed metadata replication
• Leader-based coordination
• Fault-tolerant cache placement decisionsThis project models a real AI inference control plane, focusing on:
• reducing redundant inference computation
• maximizing cache reuse across nodes
• ensuring correctness under failure
• maintaining consistent distributed statev0.4 complete — persistent, crash-consistent control plane prototypeThis project builds on:
- - VeriStore — WAL-based, crash-consistent KV storage engine with Raft replication
https://github.com/NasitSony/VeriStore