GitHub - NasitSony/VeriStore: Correctness-first C++ storage engine with WAL durability, crash recovery, Raft replication, and a minimal S3-style object store.

🚀 VeriStore

A correctness-first replicated storage system in C++

This project builds a crash-consistent and replicated key–value storage engine from first principles. It focuses on durability, deterministic recovery, and consensus-based replication — the core foundations behind modern distributed databases and large-scale infrastructure systems.

The system evolves step-by-step from a local storage engine to a replicated distributed service, and further extends into a metadata-driven object storage layer built on top of the KV engine.

⚡ Key Results

Ensures crash-consistent recovery, guaranteeing zero data loss up to the last durability boundary (fsync / explicit FLUSH)
Implements group commit batching, improving write throughput by ~2.7× compared to flush-per-write durability
Achieves deterministic state reconstruction through ordered WAL replay, ensuring identical recovery outcomes
Detects and safely ignores partial or corrupted WAL records using end-to-end CRC validation
Reduces recovery time via snapshot-based checkpointing, eliminating the need for full log replay after crashes
Implements Raft-based replicated state machine semantics, enabling leader election, majority-based commit, and consistent log replication across nodes
Demonstrates fault-tolerant failover and recovery, including leader crash, re-election, follower catch-up, and client redirection to the active leader
Builds a mini object storage layer on top of the KV engine supporting bucket creation, chunked object storage, and metadata-based commit semantics
Implements prefix-based object listing, enabling hierarchical-style scans over object metadata indexes
Supports safe object overwrite semantics, where new metadata commits atomically replace previous object versions
Adds mark-and-sweep garbage collection to reclaim unreachable object chunks created by overwrites or deletes

🔗 Part of the Veri* AI Infrastructure Stack

VeriStore is the storage foundation of a layered AI infrastructure stack:

Layer	Project	Role
Storage	VeriStore (this project)	WAL durability, crash recovery, Raft replication, object storage
Inference Serving	llm-serving-cache	KV-cache placement and routing control plane, backed by VeriStore
Workload Orchestration	Veriflow	GPU-aware scheduler for training, inference, and evaluation jobs

VeriStore's WAL-backed KV engine and object storage layer serve as the durable metadata backend for both llm-serving-cache and Veriflow — providing crash-consistent state across the entire stack.

📊 Performance Snapshot (Group Commit vs Immediate Flush)

Workload: 10,000 sequential PUTs + final FLUSH (local disk)

Mode	Setting	total_ms	ops/sec
Immediate flush	`SETBATCH 1`	255 ms	39,216 ops/s
Group commit	`SETBATCH 5`	96 ms	104,167 ops/s

Result: Group commit improves write throughput by ~2.66× by batching WAL flushes (reducing flush frequency compared to flushing every write).

🏭 Why This Matters (Industry Context)

Modern storage and distributed systems rely on strong durability and recovery guarantees. Databases, stream processors, and consensus systems depend on write-ahead logging (WAL), crash consistency, and deterministic state reconstruction to maintain correctness under failures.

This project implements these core primitives from first principles, mirroring the foundations of production systems such as RocksDB, etcd, and distributed databases. WAL, batching (group commit), snapshotting, and recovery form the backbone of reliable data infrastructure.

By focusing on real failure scenarios (crashes, partial writes, corruption), this system demonstrates how correctness and durability are enforced in production backends, forming the basis for replication, consensus, and fault-tolerant distributed systems.

🚀 Current Version: v0.8

WAL durability • Crash recovery • Raft replication • Object storage • Garbage collection

✅ Guarantees

Crash Consistency

After a crash, the system recovers to the last valid durable state using snapshot + WAL replay.

Deterministic Recovery

Replaying the same WAL produces the same state.

Corruption Handling

Incomplete or corrupted WAL records are detected and ignored.

Durability Boundary (v0.4)

Writes are batched and become durable on:

periodic flush (group commit), or
manual FLUSH command

🧱 Architecture Overview

                 +----------------------+
                 |        Client        |
                 |   PUT / GET / DEL    |
                 +----------+-----------+
                            |
                            v
                +-----------------------+
                |       KV Engine       |
                |  Thread-safe Map     |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Write-Ahead Log     |
                |     Append-only       |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Durability Layer    |
                |  fsync / Group Commit |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |    Crash Recovery     |
                |      WAL Replay       |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |  Snapshot & Compaction|
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Raft Replication    |
                | Leader / Quorum Commit|
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Object Storage      |
                | Buckets / Metadata    |
                | Chunked Object Data   |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Metadata Indexing   |
                | Prefix Scan / Listing |
                +----------+------------+
                           |
                           v
                +-----------------------+
                | Garbage Collection    |
                | Remove old chunks     |
                +-----------------------+

✨ Features Implemented

✅ v0.1 — In-Memory KV Store

Thread-safe key–value map
PUT / GET / DEL operations
Reader–writer locking with shared_mutex

✅ v0.2 — Write-Ahead Log (WAL) + Crash Recovery

Append-only WAL
fsync durability guarantees
Crash consistency under process kill
Deterministic state reconstruction via WAL replay
Corruption detection (partial/torn writes ignored)

Guarantee: If a write returns OK, it survives crashes.

✅ v0.3 — Snapshots & Log Compaction

Periodic snapshot of in-memory state
WAL truncation after snapshot
Faster restarts
Reduced disk growth

✅ v0.4 — Group Commit & Performance

Batched WAL flushes (group commit)
Reduced fsync frequency
Lock contention reduction
Maintains durability while improving throughput

✅ v0.5 — Raft Consensus Replication

Leader election
Heartbeat mechanism
Log replication across nodes
Majority quorum commit
Commit index advancement
Follower catch-up after crash
Client redirection to leader

Guarantee: Cluster remains consistent despite node failures.

✅ v0.6 — Mini Object Storage Layer

Bucket creation
Object PUT / GET / DELETE
Chunked storage for larger objects
Object metadata management
Logical delete via metadata state
Object reconstruction after restart using the underlying WAL-backed KV engine

Guarantee:
Committed objects remain recoverable after restart, inheriting the durability and crash-recovery semantics of the storage engine.

Object Write Commit Semantics

Object writes follow a correctness-first design:

Object data is split into chunks
Each chunk is written as a KV entry
Object metadata is written last and acts as the commit point

An object is considered valid only if committed metadata exists. During recovery, objects without committed metadata are ignored.

Storage Layout

bucket:<bucket-name> → bucket metadata
objmeta:<bucket>:<object-key> → object metadata
objchunk:<object-id>:<chunk-index> → object chunk data
bucketidx:<bucket>:<object-key> → object index entry

✅ v0.7 — Prefix Scan + ListObjects

Prefix-based key scanning in the KV engine
Bucket object listing via metadata index
Prefix filtering support for hierarchical-style paths
Deterministic lexicographic listing order
Logical delete filtering during listing

Guarantee:

Object listings reflect the committed metadata state of the system. Only objects with valid committed metadata entries are returned during listing.
Object Listing Semantics
Object listing works by scanning the bucket index namespace:

bucketidx:<bucket>:<object-key>

The KV engine performs a prefix scan over this namespace to retrieve matching object keys. For each entry:

The object key is extracted from the index entry
Object metadata is loaded from:
```
objmeta:<bucket>:<object-key>
```
Deleted objects are filtered out
Remaining objects are returned in lexicographic order

This enables operations equivalent to:

ListObjects(bucket)
ListObjects(bucket, prefix)

Storage Index Layout

- bucket:<bucket-name> → bucket metadata  
- objmeta:<bucket>:<object-key> → object metadata  
- objchunk:<object-id>:<chunk-index> → object data chunks  
- bucketidx:<bucket>:<object-key> → object index entry used for prefix scans

✅ v0.8 — Overwrite Semantics + Garbage Collection

Object overwrite support for existing keys
Metadata replacement as the object commit boundary
Detection of superseded object versions
Garbage collection of orphaned object chunks
Lifecycle-safe object updates under crash recovery

Guarantee: When an object key is overwritten, the new object becomes the authoritative version once its metadata is committed. Previous object data becomes unreachable and can be safely reclaimed by garbage collection.

Object Overwrite Semantics

When an object is written with an existing key:

The new object data is chunked and written to storage
A new object identifier is generated
Object metadata is written last and becomes the commit point
The bucket index entry is updated to reference the new object

Any previous object chunks remain unreachable after the index update and are considered garbage.

Garbage Collection

Garbage collection identifies and removes object chunks that are no longer referenced by committed object metadata. This prevents storage growth from repeated overwrites.

Storage Layout

bucket:<bucket-name> → bucket metadata  
objmeta:<bucket>:<object-key> → object metadata (latest committed version)  
objchunk:<object-id>:<chunk-index> → object chunk data  
bucketidx:<bucket>:<object-key> → object index entry

🧪 Failure Scenarios Tested

✔ Process crash (kill -9) ✔ Partial disk writes ✔ Leader node crash ✔ Follower crash & recovery ✔ Log divergence repair ✔ Replica catch-up via log backtracking

Write Path (v0.4)

PUT / DEL

append to WAL buffer
apply to in-memory map
fsync on batch boundary / FLUSH

Recovery Path

On startup:

load snapshot
replay WAL
validate records (CRC)
apply valid entries
stop at first corrupted entry

🧪 Failure Model

Handled:

Process crashes (kill -9)
Partial/torn writes
Interrupted disk writes

Not yet handled:

Distributed replication
Byzantine faults
Performance tuning beyond batching

⚙️ How to Run

cmake -S . -B build
cmake --build build
./build/kv_cli

🖥️ CLI Usage

Start KV CLI

./build/kv_cli

Commands

PUT key value
GET key
DEL key
SIZE
SNAP
FLUSH
EXIT

🌐 Raft Replication Demo

This demo shows leader election, log replication, and fault-tolerant recovery using a 3-node Raft cluster.

Features Demonstrated

Automatic leader election
Strong consistency via majority commit
Log replication across nodes
Leader crash and re-election
Follower log catch-up after restart
Client redirection to active leader

Run Raft Simulation

./build/raft_demo

Example Output

[raft] node 3 became LEADER term=1
Leader is node 3
ProposePut(a=100) -> true
s1.get(a)=100
s2.get(a)=100
s3.get(a)=100

=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
New leader is node 2
ProposePut(b=200) -> true
s1.get(b)=200
s2.get(b)=200
s3.get(b)=200

This demonstrates Raft’s guarantees of:

Leader-based coordination
Majority-based commit
Safety under node failures
Deterministic state convergence

💥 Crash Recovery Demo

./build/kv_cli

PUT x 100
PUT y 200
FLUSH

Crash:

pkill -9 kv_cli

Restart:

./build/kv_cli
GET x   # 100
GET y   # 200

Object Storage Recovery Demo

Write an object:

./build/object_store_write_demo

Recover after restart:

./build/object_store_recover_demo

📊 Performance Benchmark Demo

Run benchmark with different durability settings:

Immediate flush (no batching)

./build/kv_cli
SETBATCH 1
BENCH 10000

Group commit (batched flush)

./build/kv_cli
SETBATCH 5
BENCH 10000

📁 Storage Files

/tmp/kv.wal        → Write-Ahead Log
/tmp/kv.snapshot   → Snapshot file

🛠️ Build Instructions

cmake -S . -B build
cmake --build build

Object Write Commit Semantics

Object writes follow a correctness-first design aligned with the WAL-based storage engine.

Write path:

Object data is split into chunks.
Each chunk is written as a KV entry.
Object metadata is written last and acts as the commit point.

An object is considered valid only if committed metadata exists.
During recovery, objects without committed metadata are ignored.

This ensures deterministic recovery consistent with the storage engine's durability guarantees.

🧠 What This Project Demonstrates

Storage Engine Concepts

WAL durability
Crash consistency
Deterministic replay
Snapshotting & compaction

Distributed Systems Concepts

Consensus protocols
Leader election
Log replication
Quorum commits
Failure recovery

Systems Engineering

Correctness-first design
Failure-aware architecture
Deterministic state machines
Clean layering (storage → replication → clients)

🔮 Roadmap

v0.6 — Replicated Metadata Service

A control-plane metadata store for distributed systems:

Shard ownership metadata
Leader leases
Rebalancing coordination

v0.7 — LLM Serving Cache Coordinator

KV-cache placement metadata
Node failure reassignment
Session routing
AI inference infrastructure support

v0.8 — Formal Correctness & Testing

Failure injection tests
Jepsen-style validation
Deterministic replay testing

🎯 Project Goals

This project is not about building the fastest KV store. It is about understanding:

How data survives crashes
How write durability actually works
How distributed replicas stay consistent
How consensus protocols coordinate state
How storage engines underpin AI & cloud infrastructure

🎓 Educational Value

This project serves as a hands-on exploration of:

Database internals
Distributed consensus
Systems reliability engineering
AI infrastructure foundations

👤 Author

Nasit Sarwar Sony Distributed Systems & AI Infrastructure Engineer

Built as part of a deep exploration into: Distributed Systems • Storage Engines • Consensus Protocols • Fault Tolerance • AI Infrastructure

📜 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
include		include
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🚀 VeriStore

The system evolves step-by-step from a local storage engine to a replicated distributed service, and further extends into a metadata-driven object storage layer built on top of the KV engine.

⚡ Key Results

🔗 Part of the Veri* AI Infrastructure Stack

📊 Performance Snapshot (Group Commit vs Immediate Flush)

🏭 Why This Matters (Industry Context)

🚀 Current Version: v0.8

✅ Guarantees

Crash Consistency

Deterministic Recovery

Corruption Handling

Durability Boundary (v0.4)

🧱 Architecture Overview

✨ Features Implemented

Object Write Commit Semantics

Storage Layout

🧪 Failure Scenarios Tested

Write Path (v0.4)

Recovery Path

🧪 Failure Model

⚙️ How to Run

🖥️ CLI Usage

🌐 Raft Replication Demo

💥 Crash Recovery Demo

Object Storage Recovery Demo

📊 Performance Benchmark Demo

Immediate flush (no batching)

Group commit (batched flush)

📁 Storage Files

🛠️ Build Instructions

Object Write Commit Semantics

🧠 What This Project Demonstrates

🔮 Roadmap

🎯 Project Goals

🎓 Educational Value

👤 Author

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages