Skip to content

feat: Raft clustering with TLS, streaming snapshots, and production hardening#74

Merged
ApiliumDevTeam merged 14 commits intodevfrom
feature/clustering-v0.5.0
Mar 13, 2026
Merged

feat: Raft clustering with TLS, streaming snapshots, and production hardening#74
ApiliumDevTeam merged 14 commits intodevfrom
feature/clustering-v0.5.0

Conversation

@ApiliumDevTeam
Copy link
Copy Markdown
Contributor

Summary

  • WAL (Write-Ahead Log): New aingle_wal crate with segment rotation, CRC32 checksums, and crash-safe durability
  • Raft consensus: New aingle_raft crate — leader election, log replication, streaming snapshots (512KB chunks with ACK), blake3 snapshot checksums, atomic dual-lock state restore
  • Cluster orchestration: cluster_init module with TLS support (self-signed via rcgen or custom PEM certs), HMAC-authenticated inter-node RPCs, exponential backoff join retries
  • CRDT conflict resolution: LWW registers and OR-Set for concurrent writes across nodes
  • Cluster endpoints: /cluster/status, /members, /join, /leave, /wal/stats, /wal/verify
  • Quorum reads: Optional X-Consistency: quorum header for strong consistency
  • Ineru memory replication: LTM replicated via Raft, snapshot transfer between nodes
  • Production hardening: Constant-time secret comparison (subtle), Raft shutdown timeout, config validation, learner rollback on failed membership changes, snapshot buffer TTL eviction (5min/256MB cap)
  • README: Added clustering section, Consensus Layer in architecture diagram, Mayros AI badge, Rust version bump to 1.83

Crates added

Crate Purpose
aingle_wal Write-Ahead Log with segment management
aingle_raft Raft consensus (openraft 0.10.0-alpha.17)

Test plan

  • cargo check --workspace passes
  • cargo check -p aingle_cortex --features cluster passes
  • cargo test -p aingle_raft — 33/33 tests pass
  • cargo test -p aingle_cortex --features cluster --lib — 144/144 tests pass
  • cargo test -p aingle_cortex --features cluster --test cluster_integration_test — 3/3 integration tests pass (single-node bootstrap, 3-node replication, WAL stats/verify)

ApiliumDevTeam and others added 14 commits March 11, 2026 22:44
New crate `aingle_wal` with segment-based WAL, hash chain integrity,
thread-safe writer, reader with replay/verification, and segment rotation.
WAL integrated into AppState and mutation paths (triples, memory) behind
`#[cfg(feature = "cluster")]`. All 20 WAL tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New crate `aingle_raft` with openraft TypeConfig, log store, state machine,
network layer, and consistency levels. Cluster REST endpoints added
(status, join, leave, members, WAL stats/verify). P2pMessage extended
with Raft + cluster variants. CLI flags for cluster mode. All 15 raft
tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ConsistencyLevel enum (Local/Quorum/Linearizable) with header parsing.
Read endpoints (get_triple, list_triples) now accept X-Consistency header
and route through appropriate consistency logic when cluster feature is
enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LwwTriple (Last-Writer-Wins with deterministic tie-break by node_id)
and OrSet (Observed-Remove Set for triple existence) implemented in
aingle_graph behind `#[cfg(feature = "crdt")]`. Merge is commutative,
associative, and idempotent. All 9 CRDT tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sfer)

ClusterSnapshot with TripleSnapshot wire format for full state transfer.
STM explicitly excluded (node-local). HNSW index rebuilt locally from
replicated LTM. LTM WAL entry kinds (LtmEntityCreate, LtmLinkCreate,
LtmEntityDelete) already present from Phase 1. Snapshot serialization
roundtrip tested. All 18 raft tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bump all 10 product crates from 0.4.2 → 0.5.0 and update internal
dependency version ranges from "0.4" → "0.5" to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement RaftLogReader and RaftLogStorage for CortexLogStore with
WAL-backed persistence. Vote and committed state persisted to JSON files.
Recovery on restart reads WAL segments to rebuild the in-memory BTreeMap.
Add RaftEntry and Noop variants to WalEntryKind.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connect CortexStateMachine to real GraphDB and IneruMemory so Raft-committed
mutations are applied: TripleInsert/Delete to graph, MemoryStore/Forget to
Ineru LTM. Add CortexSnapshotBuilder for full-state snapshots.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement RaftNetworkFactory and RaftNetworkV2 for CortexNetworkConnection.
Add RaftRpcSender trait to decouple from QUIC transport, enabling stub
senders during bootstrap and real P2P transport at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bootstrap openraft::Raft in main.rs with CortexLogStore, CortexStateMachine,
and CortexNetworkFactory. Add raft and cluster_node_id fields to AppState.
Single-node cluster auto-initializes when no peers are configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Route triple and memory writes through Raft in cluster mode. Add
ensure_linearizable guards to GET handlers honoring X-Consistency header
(linearizable via ReadIndex, quorum via LeaseRead, local passthrough).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite cluster status/join/leave/members endpoints to use real Raft
metrics (role, term, leader, membership). Join adds learner then promotes
to voter; leave removes node from voter set via change_membership.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d Mayros badge

- Add Clustering section with 3-node quickstart, TLS, and endpoint reference
- Add Consensus Layer (Raft, WAL, Streaming Snapshots, TLS) to architecture diagram
- Add aingle_raft and aingle_wal to platform components table
- Add "Powers Mayros AI" badge linking to ApiliumCode/mayros
- Update Rust version badge and prerequisites from 1.70 to 1.83
- Add cluster build command to quickstart
Refactors cluster initialization into a dedicated, reusable module.
Implements robust HTTP-based Raft RPC with TLS encryption, shared secret
authentication, and exponential backoff for inter-node communication.

Adds automatic leader redirection (HTTP 307) for client requests and
cluster management operations to improve client routing and cluster availability.

Introduces chunked snapshot transfers with Blake3 integrity checksums
for efficient and reliable state replication, especially for large datasets.
Improves WAL durability by persisting purge and truncation boundaries.

Ensures data consistency by routing all write operations through Raft when
clustering is enabled, preventing direct writes and potential split-brain.

Includes comprehensive integration tests for cluster functionality.
@ApiliumDevTeam ApiliumDevTeam merged commit 32a62b3 into dev Mar 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant