Status: Phase 5 (Complete: Raft Consensus & Log Replication)
License: MIT
Code: https://github.com/MChandrahas/Distributed-KV-Store
This is an academic-grade exploration of distributed systems fundamentals. The goal is to build, break, and learn to understand why systems like etcd, CockroachDB, and ZooKeeper are complex by implementing the core algorithms from scratch.
Core Architecture: 3-Node Cluster with Raft Consensus
Key Lesson: Availability is easy; Consistency is hard. You cannot have a distributed database without a Write-Ahead Log and a Consensus Algorithm.
graph TD
Client[Client] -->|1. PUT key=500| Leader[Leader Node]
subgraph "Raft Cluster (gRPC/Protobuf)"
Leader -->|2. Write WAL & RAM| LeaderStore[(Leader Storage)]
Leader -->|3. Replicate Data| Follower1[Follower Node 1]
Leader -->|3. Replicate Data| Follower2[Follower Node 2]
Follower1 -->|4. Write WAL & RAM| F1Store[(Follower Storage)]
Follower2 -->|4. Write WAL & RAM| F2Store[(Follower Storage)]
end
Client -.->|5. GET key=500| Follower1
- Consensus: Raft Algorithm (Leader Election, Randomized Timeouts, Term Numbers)
- Replication: Log Replication (Leader pushes updates to followers; strong consistency)
- Persistence: Write-Ahead Log (WAL) ensures crash recovery
- Storage Engine: In-memory B+ Tree (Order 128) backed by disk
- Communication: gRPC (Protobuf) over Netty
To meet modern observability standards, I implemented a Sidecar Pattern using Go and Kubernetes.
- Architecture: The Java application runs in a Pod alongside a Go-based Prometheus Exporter.
- Mechanism: The Go sidecar scrapes the Java app's internal JSON state (
/status) over localhost and converts it into OpenMetrics standard format (/metrics) for Prometheus. - Orchestration: Deployed on MicroK8s using StatefulSets to maintain stable network identity for Raft consensus.
graph LR
subgraph "Kubernetes Pod"
Java[Java KV Store] -- JSON (localhost:8080) --> Go[Go Exporter]
end
Go -- Metrics (Port 9100) --> Prometheus[Prometheus / Grafana]
- Java 17+
- Docker & Docker Compose
# 1. Start 3 nodes (Raft Cluster)
docker-compose up -d
# 2. View leader election in real time
docker-compose logs -f | grep "LEADER"# Write to the Leader (e.g., Port 9093)
./gradlew runClient --args="9093 PUT 100 Alice"
# Read from a Follower (e.g., Port 9091)
./gradlew runClient --args="9091 GET 100"I used this project to verify distributed systems theory. Below is the critical failure that forced the Raft implementation.
Phase: Week 2 (Naive Replication)
Status: SOLVED (Phase 4)
-
Started cluster (Nodes 1, 2, 3)
-
Killed the leader (Node 1)
-
Result: Nodes 2 and 3 both promoted themselves to leader
-
Impact:
- Node 2 accepted
Key=A - Node 3 accepted
Key=B - Data diverged permanently
- Node 2 accepted
I implemented Raft Leader Election. Now, when a leader dies:
- Followers wait for a randomized timeout (e.g., 1000-2000 ms to account for Docker latency)
- They request votes with a term number
- Only one node can achieve majority (2/3)
- Result: Split-brain is mathematically impossible
Test Setup: Docker Desktop (WSL2) on Local Machine. Single-threaded blocking client.
| Operation | Scenario | Latency (Avg) | Throughput |
|---|---|---|---|
| B+ Tree Insert | In-Memory Unit Test | ~0.004 ms | ~250,000 op/s |
| Cluster Write | 3-Node Raft (Leader) | ~15.0 ms* | ~60 op/s |
| Cluster Read | 3-Node Raft (Follower) | ~2.0 ms | ~500 op/s |
*Note: Write latency is intentionally high. It includes Synchronous Replication to 3 nodes and Disk I/O (WAL) to ensure Strong Consistency.
Explicit gaps compared to production systems (e.g., etcd):
- Log Compaction: WAL grows indefinitely; snapshotting needed
- Dynamic Membership: Static peer list via env vars
- Batching: Each
PUTreplicated individually - Sharding: No horizontal partitioning (replicated state machine)
- Java 17 Core logic & B+ Tree Engine
- gRPC / Protobuf RPC layer for node communication
- Netty Underlying non-blocking I/O framework
- Docker Compose Cluster orchestration
- Gradle 8.5 Build tool
MIT use this to learn from my mistakes.