Skip to content

NasitSony/VeriStore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 VeriStore

A correctness-first replicated storage system in C++

This project builds a crash-consistent and replicated key–value storage engine from first principles. It focuses on durability, deterministic recovery, and consensus-based replication — the core foundations behind modern distributed databases and large-scale infrastructure systems.

The system evolves step-by-step from a local storage engine to a replicated distributed service, and further extends into a metadata-driven object storage layer built on top of the KV engine.

⚡ Key Results

  • Ensures crash-consistent recovery, guaranteeing zero data loss up to the last durability boundary (fsync / explicit FLUSH)
  • Implements group commit batching, improving write throughput by ~2.7× compared to flush-per-write durability
  • Achieves deterministic state reconstruction through ordered WAL replay, ensuring identical recovery outcomes
  • Detects and safely ignores partial or corrupted WAL records using end-to-end CRC validation
  • Reduces recovery time via snapshot-based checkpointing, eliminating the need for full log replay after crashes
  • Implements Raft-based replicated state machine semantics, enabling leader election, majority-based commit, and consistent log replication across nodes
  • Demonstrates fault-tolerant failover and recovery, including leader crash, re-election, follower catch-up, and client redirection to the active leader
  • Builds a mini object storage layer on top of the KV engine supporting bucket creation, chunked object storage, and metadata-based commit semantics
  • Implements prefix-based object listing, enabling hierarchical-style scans over object metadata indexes
  • Supports safe object overwrite semantics, where new metadata commits atomically replace previous object versions
  • Adds mark-and-sweep garbage collection to reclaim unreachable object chunks created by overwrites or deletes

🔗 Part of the Veri* AI Infrastructure Stack

VeriStore is the storage foundation of a layered AI infrastructure stack:

Layer Project Role
Storage VeriStore (this project) WAL durability, crash recovery, Raft replication, object storage
Inference Serving llm-serving-cache KV-cache placement and routing control plane, backed by VeriStore
Workload Orchestration Veriflow GPU-aware scheduler for training, inference, and evaluation jobs

VeriStore's WAL-backed KV engine and object storage layer serve as the durable metadata backend for both llm-serving-cache and Veriflow — providing crash-consistent state across the entire stack.

📊 Performance Snapshot (Group Commit vs Immediate Flush)

Workload: 10,000 sequential PUTs + final FLUSH (local disk)

Mode Setting total_ms ops/sec
Immediate flush SETBATCH 1 255 ms 39,216 ops/s
Group commit SETBATCH 5 96 ms 104,167 ops/s

Result: Group commit improves write throughput by ~2.66× by batching WAL flushes (reducing flush frequency compared to flushing every write).

🏭 Why This Matters (Industry Context)

Modern storage and distributed systems rely on strong durability and recovery guarantees. Databases, stream processors, and consensus systems depend on write-ahead logging (WAL), crash consistency, and deterministic state reconstruction to maintain correctness under failures.

This project implements these core primitives from first principles, mirroring the foundations of production systems such as RocksDB, etcd, and distributed databases. WAL, batching (group commit), snapshotting, and recovery form the backbone of reliable data infrastructure.

By focusing on real failure scenarios (crashes, partial writes, corruption), this system demonstrates how correctness and durability are enforced in production backends, forming the basis for replication, consensus, and fault-tolerant distributed systems.

🚀 Current Version: v0.8

WAL durability • Crash recovery • Raft replication • Object storage • Garbage collection


✅ Guarantees

Crash Consistency

After a crash, the system recovers to the last valid durable state using snapshot + WAL replay.

Deterministic Recovery

Replaying the same WAL produces the same state.

Corruption Handling

Incomplete or corrupted WAL records are detected and ignored.

Durability Boundary (v0.4)

Writes are batched and become durable on:

  • periodic flush (group commit), or
  • manual FLUSH command

🧱 Architecture Overview

                 +----------------------+
                 |        Client        |
                 |   PUT / GET / DEL    |
                 +----------+-----------+
                            |
                            v
                +-----------------------+
                |       KV Engine       |
                |  Thread-safe Map     |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Write-Ahead Log     |
                |     Append-only       |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Durability Layer    |
                |  fsync / Group Commit |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |    Crash Recovery     |
                |      WAL Replay       |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |  Snapshot & Compaction|
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Raft Replication    |
                | Leader / Quorum Commit|
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Object Storage      |
                | Buckets / Metadata    |
                | Chunked Object Data   |
                +----------+------------+
                           |
                           v
                +-----------------------+
                |   Metadata Indexing   |
                | Prefix Scan / Listing |
                +----------+------------+
                           |
                           v
                +-----------------------+
                | Garbage Collection    |
                | Remove old chunks     |
                +-----------------------+

✨ Features Implemented

✅ v0.1 — In-Memory KV Store

  • Thread-safe key–value map
  • PUT / GET / DEL operations
  • Reader–writer locking with shared_mutex

✅ v0.2 — Write-Ahead Log (WAL) + Crash Recovery

  • Append-only WAL
  • fsync durability guarantees
  • Crash consistency under process kill
  • Deterministic state reconstruction via WAL replay
  • Corruption detection (partial/torn writes ignored)

Guarantee: If a write returns OK, it survives crashes.

✅ v0.3 — Snapshots & Log Compaction

  • Periodic snapshot of in-memory state
  • WAL truncation after snapshot
  • Faster restarts
  • Reduced disk growth

✅ v0.4 — Group Commit & Performance

  • Batched WAL flushes (group commit)
  • Reduced fsync frequency
  • Lock contention reduction
  • Maintains durability while improving throughput

✅ v0.5 — Raft Consensus Replication

  • Leader election
  • Heartbeat mechanism
  • Log replication across nodes
  • Majority quorum commit
  • Commit index advancement
  • Follower catch-up after crash
  • Client redirection to leader

Guarantee: Cluster remains consistent despite node failures.

✅ v0.6 — Mini Object Storage Layer

  • Bucket creation
  • Object PUT / GET / DELETE
  • Chunked storage for larger objects
  • Object metadata management
  • Logical delete via metadata state
  • Object reconstruction after restart using the underlying WAL-backed KV engine

Guarantee:
Committed objects remain recoverable after restart, inheriting the durability and crash-recovery semantics of the storage engine.

Object Write Commit Semantics

Object writes follow a correctness-first design:

  1. Object data is split into chunks
  2. Each chunk is written as a KV entry
  3. Object metadata is written last and acts as the commit point

An object is considered valid only if committed metadata exists. During recovery, objects without committed metadata are ignored.

Storage Layout

  • bucket:<bucket-name> → bucket metadata
  • objmeta:<bucket>:<object-key> → object metadata
  • objchunk:<object-id>:<chunk-index> → object chunk data
  • bucketidx:<bucket>:<object-key> → object index entry

✅ v0.7 — Prefix Scan + ListObjects

  • Prefix-based key scanning in the KV engine
  • Bucket object listing via metadata index
  • Prefix filtering support for hierarchical-style paths
  • Deterministic lexicographic listing order
  • Logical delete filtering during listing

Guarantee:

  • Object listings reflect the committed metadata state of the system. Only objects with valid committed metadata entries are returned during listing.
  • Object Listing Semantics
  • Object listing works by scanning the bucket index namespace:
bucketidx:<bucket>:<object-key>

The KV engine performs a prefix scan over this namespace to retrieve matching object keys. For each entry:

  1. The object key is extracted from the index entry

  2. Object metadata is loaded from:

    objmeta:<bucket>:<object-key>
  3. Deleted objects are filtered out

  4. Remaining objects are returned in lexicographic order

This enables operations equivalent to:

  • ListObjects(bucket)
  • ListObjects(bucket, prefix)

Storage Index Layout

- bucket:<bucket-name> → bucket metadata  
- objmeta:<bucket>:<object-key> → object metadata  
- objchunk:<object-id>:<chunk-index> → object data chunks  
- bucketidx:<bucket>:<object-key> → object index entry used for prefix scans

✅ v0.8 — Overwrite Semantics + Garbage Collection

  • Object overwrite support for existing keys
  • Metadata replacement as the object commit boundary
  • Detection of superseded object versions
  • Garbage collection of orphaned object chunks
  • Lifecycle-safe object updates under crash recovery

Guarantee: When an object key is overwritten, the new object becomes the authoritative version once its metadata is committed. Previous object data becomes unreachable and can be safely reclaimed by garbage collection.

Object Overwrite Semantics

When an object is written with an existing key:

  1. The new object data is chunked and written to storage
  2. A new object identifier is generated
  3. Object metadata is written last and becomes the commit point
  4. The bucket index entry is updated to reference the new object

Any previous object chunks remain unreachable after the index update and are considered garbage.

Garbage Collection

Garbage collection identifies and removes object chunks that are no longer referenced by committed object metadata. This prevents storage growth from repeated overwrites.

Storage Layout

bucket:<bucket-name> → bucket metadata  
objmeta:<bucket>:<object-key> → object metadata (latest committed version)  
objchunk:<object-id>:<chunk-index> → object chunk data  
bucketidx:<bucket>:<object-key> → object index entry

🧪 Failure Scenarios Tested

✔ Process crash (kill -9) ✔ Partial disk writes ✔ Leader node crash ✔ Follower crash & recovery ✔ Log divergence repair ✔ Replica catch-up via log backtracking

Write Path (v0.4)

PUT / DEL

  • append to WAL buffer
  • apply to in-memory map
  • fsync on batch boundary / FLUSH

Recovery Path

On startup:

  • load snapshot
  • replay WAL
  • validate records (CRC)
  • apply valid entries
  • stop at first corrupted entry

🧪 Failure Model

Handled:

  • Process crashes (kill -9)
  • Partial/torn writes
  • Interrupted disk writes

Not yet handled:

  • Distributed replication
  • Byzantine faults
  • Performance tuning beyond batching

⚙️ How to Run

cmake -S . -B build
cmake --build build
./build/kv_cli

🖥️ CLI Usage

Start KV CLI

./build/kv_cli

Commands

PUT key value
GET key
DEL key
SIZE
SNAP
FLUSH
EXIT

🌐 Raft Replication Demo

This demo shows leader election, log replication, and fault-tolerant recovery using a 3-node Raft cluster.

Features Demonstrated

  • Automatic leader election
  • Strong consistency via majority commit
  • Log replication across nodes
  • Leader crash and re-election
  • Follower log catch-up after restart
  • Client redirection to active leader

Run Raft Simulation

./build/raft_demo

Example Output

[raft] node 3 became LEADER term=1
Leader is node 3
ProposePut(a=100) -> true
s1.get(a)=100
s2.get(a)=100
s3.get(a)=100

=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
New leader is node 2
ProposePut(b=200) -> true
s1.get(b)=200
s2.get(b)=200
s3.get(b)=200

This demonstrates Raft’s guarantees of:

  • Leader-based coordination
  • Majority-based commit
  • Safety under node failures
  • Deterministic state convergence

💥 Crash Recovery Demo

./build/kv_cli
PUT x 100
PUT y 200
FLUSH

Crash:

pkill -9 kv_cli

Restart:

./build/kv_cli
GET x   # 100
GET y   # 200

Object Storage Recovery Demo

Write an object:

./build/object_store_write_demo

Recover after restart:

./build/object_store_recover_demo

📊 Performance Benchmark Demo

Run benchmark with different durability settings:

Immediate flush (no batching)

./build/kv_cli
SETBATCH 1
BENCH 10000

Group commit (batched flush)

./build/kv_cli
SETBATCH 5
BENCH 10000

📁 Storage Files

/tmp/kv.wal        → Write-Ahead Log
/tmp/kv.snapshot   → Snapshot file

🛠️ Build Instructions

cmake -S . -B build
cmake --build build

Object Write Commit Semantics

Object writes follow a correctness-first design aligned with the WAL-based storage engine.

Write path:

  1. Object data is split into chunks.
  2. Each chunk is written as a KV entry.
  3. Object metadata is written last and acts as the commit point.

An object is considered valid only if committed metadata exists.
During recovery, objects without committed metadata are ignored.

This ensures deterministic recovery consistent with the storage engine's durability guarantees.

🧠 What This Project Demonstrates

Storage Engine Concepts

  • WAL durability
  • Crash consistency
  • Deterministic replay
  • Snapshotting & compaction

Distributed Systems Concepts

  • Consensus protocols
  • Leader election
  • Log replication
  • Quorum commits
  • Failure recovery

Systems Engineering

  • Correctness-first design
  • Failure-aware architecture
  • Deterministic state machines
  • Clean layering (storage → replication → clients)

🔮 Roadmap

v0.6 — Replicated Metadata Service

A control-plane metadata store for distributed systems:

  • Shard ownership metadata
  • Leader leases
  • Rebalancing coordination

v0.7 — LLM Serving Cache Coordinator

  • KV-cache placement metadata
  • Node failure reassignment
  • Session routing
  • AI inference infrastructure support

v0.8 — Formal Correctness & Testing

  • Failure injection tests
  • Jepsen-style validation
  • Deterministic replay testing

🎯 Project Goals

This project is not about building the fastest KV store. It is about understanding:

  • How data survives crashes
  • How write durability actually works
  • How distributed replicas stay consistent
  • How consensus protocols coordinate state
  • How storage engines underpin AI & cloud infrastructure

🎓 Educational Value

This project serves as a hands-on exploration of:

  • Database internals
  • Distributed consensus
  • Systems reliability engineering
  • AI infrastructure foundations

👤 Author

Nasit Sarwar Sony Distributed Systems & AI Infrastructure Engineer

Built as part of a deep exploration into: Distributed Systems • Storage Engines • Consensus Protocols • Fault Tolerance • AI Infrastructure

📜 License

MIT License

About

Correctness-first C++ storage engine with WAL durability, crash recovery, Raft replication, and a minimal S3-style object store.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors