A correctness-first replicated storage system in C++
This project builds a crash-consistent and replicated key–value storage engine from first principles. It focuses on durability, deterministic recovery, and consensus-based replication — the core foundations behind modern distributed databases and large-scale infrastructure systems.
The system evolves step-by-step from a local storage engine to a replicated distributed service, and further extends into a metadata-driven object storage layer built on top of the KV engine.
- Ensures crash-consistent recovery, guaranteeing zero data loss up to the last durability boundary (fsync / explicit FLUSH)
- Implements group commit batching, improving write throughput by ~2.7× compared to flush-per-write durability
- Achieves deterministic state reconstruction through ordered WAL replay, ensuring identical recovery outcomes
- Detects and safely ignores partial or corrupted WAL records using end-to-end CRC validation
- Reduces recovery time via snapshot-based checkpointing, eliminating the need for full log replay after crashes
- Implements Raft-based replicated state machine semantics, enabling leader election, majority-based commit, and consistent log replication across nodes
- Demonstrates fault-tolerant failover and recovery, including leader crash, re-election, follower catch-up, and client redirection to the active leader
- Builds a mini object storage layer on top of the KV engine supporting bucket creation, chunked object storage, and metadata-based commit semantics
- Implements prefix-based object listing, enabling hierarchical-style scans over object metadata indexes
- Supports safe object overwrite semantics, where new metadata commits atomically replace previous object versions
- Adds mark-and-sweep garbage collection to reclaim unreachable object chunks created by overwrites or deletes
VeriStore is the storage foundation of a layered AI infrastructure stack:
| Layer | Project | Role |
|---|---|---|
| Storage | VeriStore (this project) | WAL durability, crash recovery, Raft replication, object storage |
| Inference Serving | llm-serving-cache | KV-cache placement and routing control plane, backed by VeriStore |
| Workload Orchestration | Veriflow | GPU-aware scheduler for training, inference, and evaluation jobs |
VeriStore's WAL-backed KV engine and object storage layer serve as the durable metadata backend for both llm-serving-cache and Veriflow — providing crash-consistent state across the entire stack.
Workload: 10,000 sequential PUTs + final FLUSH (local disk)
| Mode | Setting | total_ms | ops/sec |
|---|---|---|---|
| Immediate flush | SETBATCH 1 |
255 ms | 39,216 ops/s |
| Group commit | SETBATCH 5 |
96 ms | 104,167 ops/s |
Result: Group commit improves write throughput by ~2.66× by batching WAL flushes (reducing flush frequency compared to flushing every write).
Modern storage and distributed systems rely on strong durability and recovery guarantees. Databases, stream processors, and consensus systems depend on write-ahead logging (WAL), crash consistency, and deterministic state reconstruction to maintain correctness under failures.
This project implements these core primitives from first principles, mirroring the foundations of production systems such as RocksDB, etcd, and distributed databases. WAL, batching (group commit), snapshotting, and recovery form the backbone of reliable data infrastructure.
By focusing on real failure scenarios (crashes, partial writes, corruption), this system demonstrates how correctness and durability are enforced in production backends, forming the basis for replication, consensus, and fault-tolerant distributed systems.
WAL durability • Crash recovery • Raft replication • Object storage • Garbage collection
After a crash, the system recovers to the last valid durable state using snapshot + WAL replay.
Replaying the same WAL produces the same state.
Incomplete or corrupted WAL records are detected and ignored.
Writes are batched and become durable on:
- periodic flush (group commit), or
- manual
FLUSHcommand
+----------------------+
| Client |
| PUT / GET / DEL |
+----------+-----------+
|
v
+-----------------------+
| KV Engine |
| Thread-safe Map |
+----------+------------+
|
v
+-----------------------+
| Write-Ahead Log |
| Append-only |
+----------+------------+
|
v
+-----------------------+
| Durability Layer |
| fsync / Group Commit |
+----------+------------+
|
v
+-----------------------+
| Crash Recovery |
| WAL Replay |
+----------+------------+
|
v
+-----------------------+
| Snapshot & Compaction|
+----------+------------+
|
v
+-----------------------+
| Raft Replication |
| Leader / Quorum Commit|
+----------+------------+
|
v
+-----------------------+
| Object Storage |
| Buckets / Metadata |
| Chunked Object Data |
+----------+------------+
|
v
+-----------------------+
| Metadata Indexing |
| Prefix Scan / Listing |
+----------+------------+
|
v
+-----------------------+
| Garbage Collection |
| Remove old chunks |
+-----------------------+
✅ v0.1 — In-Memory KV Store
- Thread-safe key–value map
- PUT / GET / DEL operations
- Reader–writer locking with shared_mutex
✅ v0.2 — Write-Ahead Log (WAL) + Crash Recovery
- Append-only WAL
- fsync durability guarantees
- Crash consistency under process kill
- Deterministic state reconstruction via WAL replay
- Corruption detection (partial/torn writes ignored)
Guarantee: If a write returns OK, it survives crashes.
✅ v0.3 — Snapshots & Log Compaction
- Periodic snapshot of in-memory state
- WAL truncation after snapshot
- Faster restarts
- Reduced disk growth
✅ v0.4 — Group Commit & Performance
- Batched WAL flushes (group commit)
- Reduced fsync frequency
- Lock contention reduction
- Maintains durability while improving throughput
✅ v0.5 — Raft Consensus Replication
- Leader election
- Heartbeat mechanism
- Log replication across nodes
- Majority quorum commit
- Commit index advancement
- Follower catch-up after crash
- Client redirection to leader
Guarantee: Cluster remains consistent despite node failures.
✅ v0.6 — Mini Object Storage Layer
- Bucket creation
- Object PUT / GET / DELETE
- Chunked storage for larger objects
- Object metadata management
- Logical delete via metadata state
- Object reconstruction after restart using the underlying WAL-backed KV engine
Guarantee:
Committed objects remain recoverable after restart, inheriting the durability and crash-recovery semantics of the storage engine.
Object writes follow a correctness-first design:
- Object data is split into chunks
- Each chunk is written as a KV entry
- Object metadata is written last and acts as the commit point
An object is considered valid only if committed metadata exists. During recovery, objects without committed metadata are ignored.
bucket:<bucket-name>→ bucket metadataobjmeta:<bucket>:<object-key>→ object metadataobjchunk:<object-id>:<chunk-index>→ object chunk databucketidx:<bucket>:<object-key>→ object index entry
✅ v0.7 — Prefix Scan + ListObjects
- Prefix-based key scanning in the KV engine
- Bucket object listing via metadata index
- Prefix filtering support for hierarchical-style paths
- Deterministic lexicographic listing order
- Logical delete filtering during listing
Guarantee:
- Object listings reflect the committed metadata state of the system. Only objects with valid committed metadata entries are returned during listing.
- Object Listing Semantics
- Object listing works by scanning the bucket index namespace:
bucketidx:<bucket>:<object-key>The KV engine performs a prefix scan over this namespace to retrieve matching object keys. For each entry:
-
The object key is extracted from the index entry
-
Object metadata is loaded from:
objmeta:<bucket>:<object-key>
-
Deleted objects are filtered out
-
Remaining objects are returned in lexicographic order
This enables operations equivalent to:
- ListObjects(bucket)
- ListObjects(bucket, prefix)
Storage Index Layout
- bucket:<bucket-name> → bucket metadata
- objmeta:<bucket>:<object-key> → object metadata
- objchunk:<object-id>:<chunk-index> → object data chunks
- bucketidx:<bucket>:<object-key> → object index entry used for prefix scans✅ v0.8 — Overwrite Semantics + Garbage Collection
- Object overwrite support for existing keys
- Metadata replacement as the object commit boundary
- Detection of superseded object versions
- Garbage collection of orphaned object chunks
- Lifecycle-safe object updates under crash recovery
Guarantee: When an object key is overwritten, the new object becomes the authoritative version once its metadata is committed. Previous object data becomes unreachable and can be safely reclaimed by garbage collection.
Object Overwrite Semantics
When an object is written with an existing key:
- The new object data is chunked and written to storage
- A new object identifier is generated
- Object metadata is written last and becomes the commit point
- The bucket index entry is updated to reference the new object
Any previous object chunks remain unreachable after the index update and are considered garbage.
Garbage Collection
Garbage collection identifies and removes object chunks that are no longer referenced by committed object metadata. This prevents storage growth from repeated overwrites.
Storage Layout
bucket:<bucket-name> → bucket metadata
objmeta:<bucket>:<object-key> → object metadata (latest committed version)
objchunk:<object-id>:<chunk-index> → object chunk data
bucketidx:<bucket>:<object-key> → object index entry✔ Process crash (kill -9) ✔ Partial disk writes ✔ Leader node crash ✔ Follower crash & recovery ✔ Log divergence repair ✔ Replica catch-up via log backtracking
PUT / DEL
- append to WAL buffer
- apply to in-memory map
- fsync on batch boundary / FLUSH
On startup:
- load snapshot
- replay WAL
- validate records (CRC)
- apply valid entries
- stop at first corrupted entry
Handled:
- Process crashes (
kill -9) - Partial/torn writes
- Interrupted disk writes
Not yet handled:
- Distributed replication
- Byzantine faults
- Performance tuning beyond batching
cmake -S . -B build
cmake --build build
./build/kv_cliStart KV CLI
./build/kv_cliCommands
PUT key value
GET key
DEL key
SIZE
SNAP
FLUSH
EXITThis demo shows leader election, log replication, and fault-tolerant recovery using a 3-node Raft cluster.
Features Demonstrated
- Automatic leader election
- Strong consistency via majority commit
- Log replication across nodes
- Leader crash and re-election
- Follower log catch-up after restart
- Client redirection to active leader
Run Raft Simulation
./build/raft_demoExample Output
[raft] node 3 became LEADER term=1
Leader is node 3
ProposePut(a=100) -> true
s1.get(a)=100
s2.get(a)=100
s3.get(a)=100
=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
New leader is node 2
ProposePut(b=200) -> true
s1.get(b)=200
s2.get(b)=200
s3.get(b)=200This demonstrates Raft’s guarantees of:
- Leader-based coordination
- Majority-based commit
- Safety under node failures
- Deterministic state convergence
./build/kv_cliPUT x 100
PUT y 200
FLUSHCrash:
pkill -9 kv_cliRestart:
./build/kv_cli
GET x # 100
GET y # 200Write an object:
./build/object_store_write_demoRecover after restart:
./build/object_store_recover_demoRun benchmark with different durability settings:
./build/kv_cli
SETBATCH 1
BENCH 10000./build/kv_cli
SETBATCH 5
BENCH 10000/tmp/kv.wal → Write-Ahead Log
/tmp/kv.snapshot → Snapshot filecmake -S . -B build
cmake --build buildObject writes follow a correctness-first design aligned with the WAL-based storage engine.
Write path:
- Object data is split into chunks.
- Each chunk is written as a KV entry.
- Object metadata is written last and acts as the commit point.
An object is considered valid only if committed metadata exists.
During recovery, objects without committed metadata are ignored.
This ensures deterministic recovery consistent with the storage engine's durability guarantees.
Storage Engine Concepts
- WAL durability
- Crash consistency
- Deterministic replay
- Snapshotting & compaction
Distributed Systems Concepts
- Consensus protocols
- Leader election
- Log replication
- Quorum commits
- Failure recovery
Systems Engineering
- Correctness-first design
- Failure-aware architecture
- Deterministic state machines
- Clean layering (storage → replication → clients)
v0.6 — Replicated Metadata Service
A control-plane metadata store for distributed systems:
- Shard ownership metadata
- Leader leases
- Rebalancing coordination
v0.7 — LLM Serving Cache Coordinator
- KV-cache placement metadata
- Node failure reassignment
- Session routing
- AI inference infrastructure support
v0.8 — Formal Correctness & Testing
- Failure injection tests
- Jepsen-style validation
- Deterministic replay testing
This project is not about building the fastest KV store. It is about understanding:
- How data survives crashes
- How write durability actually works
- How distributed replicas stay consistent
- How consensus protocols coordinate state
- How storage engines underpin AI & cloud infrastructure
This project serves as a hands-on exploration of:
- Database internals
- Distributed consensus
- Systems reliability engineering
- AI infrastructure foundations
Nasit Sarwar Sony Distributed Systems & AI Infrastructure Engineer
Built as part of a deep exploration into: Distributed Systems • Storage Engines • Consensus Protocols • Fault Tolerance • AI Infrastructure
MIT License