Skip to content

feat: 14.4 replication, failover & recovery#59

Merged
vieiralucas merged 11 commits intomainfrom
feat/14.4-replication-failover-recovery
Mar 18, 2026
Merged

feat: 14.4 replication, failover & recovery#59
vieiralucas merged 11 commits intomainfrom
feat/14.4-replication-failover-recovery

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Mar 11, 2026

Summary

  • Raft data replication: Queue-level Raft state machines now apply committed entries (enqueue, ack, nack) to broker storage on ALL nodes — not just the leader. Followers have replicated data at all times, enabling zero-loss failover.
  • Leader change detection: watch_leader_changes() background task polls queue Raft groups for leadership transitions. On leader gain → RecoverQueue (rebuild scheduler state). On leader loss → DropQueueConsumers (close consumer streams).
  • Per-queue scheduler recovery: recover_queue() rebuilds DRR keys, pending index, and leased_msg_keys from RocksDB for a single queue without disrupting other queues.
  • Consumer stream leader-awareness: consume() handler rejects non-leader nodes with UNAVAILABLE status, directing clients to reconnect to the leader.

Acceptance Criteria Covered

  1. Raft leader replicates everything via log — quorum-committed writes
  2. Automatic failover with new leader election within 10 seconds
  3. Consumer disconnection and reconnection on node failure
  4. Node rejoin and catch-up via Raft log/snapshot
  5. Integration tests: kill → failover → zero message loss → rejoin
  6. Scheduler state rebuild on leader promotion
  7. Single-node mode behavior unchanged (316/316 tests pass)

Test Summary

  • 316 tests total (up from 313), 0 failures
  • 3 new integration tests:
    • test_cluster_failover_new_leader_elected — kill leader, verify new leader <10s
    • test_cluster_failover_zero_message_loss — enqueue 5, kill leader, consume all 5
    • test_cluster_node_rejoin_catchup — kill node, enqueue, restart, verify catch-up

Files Changed

File Change
crates/fila-core/src/broker/command.rs RecoverQueue, DropQueueConsumers commands
crates/fila-core/src/broker/scheduler/mod.rs Command handlers
crates/fila-core/src/broker/scheduler/recovery.rs recover_queue(), drop_queue_consumers()
crates/fila-core/src/cluster/mod.rs watch_leader_changes()
crates/fila-core/src/cluster/multi_raft.rs broker_storage, snapshot_groups()
crates/fila-core/src/cluster/store.rs apply_to_broker_storage()
crates/fila-core/src/cluster/tests.rs 3 failover/rejoin tests
crates/fila-server/src/service.rs Leader check in consume()
crates/fila-server/src/main.rs Wiring: leader watcher, broker storage

🤖 Generated with Claude Code


Summary by cubic

Adds queue data replication and fast failover with zero‑loss recovery. Queue Raft groups apply committed enqueue/ack/nack to local storage on all nodes; on leader change the new leader rebuilds per‑queue scheduler state and consumers reconnect to the leader. Implements Story 14.4.

  • New Features

    • Queue Raft groups apply committed enqueue/ack/nack to broker RocksDB on all nodes; server wires broker storage into MultiRaft and passes it at construction to MultiRaftManager/FilaRaftStore::for_queue.
    • Leadership watcher: watch_leader_changes() sends RecoverQueue on promotion and DropQueueConsumers on loss; triggers recovery on first‑sight leader and retries on send failure. consume() rejects non‑leaders with UNAVAILABLE and returns NOT_FOUND if the group is missing.
    • Per‑queue recovery rebuilds DRR keys, pending index, and leased keys from RocksDB without touching other queues; delivers pending after recovery.
    • Hardening: propagate storage errors from apply_to_broker_storage(), delete orphaned lease‑expiry entries on ack/nack, warn if a group is created without broker storage, and use explicit match arms for exhaustiveness.
    • Integration tests cover leader election (<10s), zero‑loss failover, and node rejoin/catch‑up; single‑node mode unchanged.
  • Migration

    • Clients must handle UNAVAILABLE from consume() by reconnecting to the queue leader.
    • No config changes; the leader watcher runs only in cluster mode.

Written for commit 149a325. Summary will update on new commits.

vieiralucas added a commit that referenced this pull request Mar 11, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 12 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-core/src/cluster/store.rs">

<violation number="1" location="crates/fila-core/src/cluster/store.rs:413">
P1: `apply_to_broker_storage` silently swallows storage errors (log-only). Since this runs in the Raft `apply_to_state_machine` path, a failed mutation means the local broker storage diverges from the committed log. On failover, the new leader could be missing data. Consider returning a `Result` and propagating the error as a `StorageError` from `apply_to_state_machine`, which would let Raft handle the failure appropriately.</violation>

<violation number="2" location="crates/fila-core/src/cluster/store.rs:607">
P1: Ack and Nack both do a linear scan of all messages in the queue (`list_messages`) to find a single message by ID. This runs in the Raft `apply_to_state_machine` path on every node for every committed entry. Consider adding a secondary index (msg_id → storage key) or including the full storage key in the `ClusterRequest::Ack`/`Nack` variants so followers can do a direct lookup.</violation>

<violation number="3" location="crates/fila-core/src/cluster/store.rs:614">
P1: Lease expiry entries are not cleaned up on Ack (or Nack). The comment says "Also clean up any lease/lease_expiry entries" but only `DeleteLease` is emitted — no `DeleteLeaseExpiry`. Parse the expiry timestamp from the lease value (via `parse_expiry_from_lease_value`) to construct the `lease_expiry_key` and add a `Mutation::DeleteLeaseExpiry` to the batch. Without this, orphaned expiry entries will trigger spurious expiration attempts.</violation>
</file>

<file name="crates/fila-core/src/broker/scheduler/recovery.rs">

<violation number="1" location="crates/fila-core/src/broker/scheduler/recovery.rs:337">
P1: No-op retain on `leased_msg_keys` leaves stale entries after per-queue recovery. Unlike `pending` and `pending_by_id` (which are properly filtered to remove this queue's entries), `retain(|_, _| true)` removes nothing. After the rebuild loop, messages that are no longer leased will still have ghost entries in `leased_msg_keys`, causing inconsistent scheduler state (e.g. wrong lease counts in metrics, stale lookups in `reclaim_expired_leases`).</violation>
</file>

<file name="crates/fila-core/src/cluster/multi_raft.rs">

<violation number="1" location="crates/fila-core/src/cluster/multi_raft.rs:62">
P1: `create_group()` should fail when broker storage is unset; currently it silently creates a queue Raft store that skips applying committed entries to broker storage.</violation>
</file>

<file name="crates/fila-core/src/cluster/mod.rs">

<violation number="1" location="crates/fila-core/src/cluster/mod.rs:347">
P1: `let _ =` silently discards the `send_command` result, then `leading` is unconditionally updated. If the command channel is full during failover load, recovery is lost and never retried because `was_leader` will be `true` on the next poll.

Only update `leading` on success; log and skip the update on failure so the next poll retries.</violation>

<violation number="2" location="crates/fila-core/src/cluster/mod.rs:354">
P2: Same silent-discard pattern: if `DropQueueConsumers` fails to send, `leading` is set to `false` and the drop is never retried. Consumers would remain connected to a non-leader, receiving stale state or errors.</violation>

<violation number="3" location="crates/fila-core/src/cluster/mod.rs:360">
P1: When the watcher first discovers a queue where this node is already leader, it records the state but does not trigger `RecoverQueue`. Any messages replicated to RocksDB between initial startup recovery and the first poll will be missing from the in-memory scheduler (DRR, pending index), so they won't be delivered.

Trigger recovery on first sight when `is_leader` is true, matching the `is_leader && !was_leader` branch.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-core/src/cluster/store.rs Outdated
}
super::types::ClusterRequest::Ack { queue_id, msg_id } => {
// Find the message and its lease, then delete them.
let msg_prefix = crate::storage::keys::message_prefix(queue_id);
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Ack and Nack both do a linear scan of all messages in the queue (list_messages) to find a single message by ID. This runs in the Raft apply_to_state_machine path on every node for every committed entry. Consider adding a secondary index (msg_id → storage key) or including the full storage key in the ClusterRequest::Ack/Nack variants so followers can do a direct lookup.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/fila-core/src/cluster/store.rs, line 607:

<comment>Ack and Nack both do a linear scan of all messages in the queue (`list_messages`) to find a single message by ID. This runs in the Raft `apply_to_state_machine` path on every node for every committed entry. Consider adding a secondary index (msg_id → storage key) or including the full storage key in the `ClusterRequest::Ack`/`Nack` variants so followers can do a direct lookup.</comment>

<file context>
@@ -541,6 +569,108 @@ impl RaftStorage<TypeConfig> for FilaRaftStore {
+            }
+            super::types::ClusterRequest::Ack { queue_id, msg_id } => {
+                // Find the message and its lease, then delete them.
+                let msg_prefix = crate::storage::keys::message_prefix(queue_id);
+                match storage.list_messages(&msg_prefix) {
+                    Ok(messages) => {
</file context>
Fix with Cubic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed — accepted for now.

The ClusterRequest::Ack only carries (queue_id, msg_id) but the storage key requires (queue_id, fairness_key, enqueued_at, msg_id). A proper fix requires either a secondary index or carrying the full key in ClusterRequest, which changes the Raft log format. For current workloads (single-digit to low-thousands messages per queue), this is acceptable. Deferring to a future optimization pass.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood—thanks for the context. I’ll treat this as acceptable for current workloads and leave it as a future optimization.

Comment thread crates/fila-core/src/cluster/store.rs
Comment thread crates/fila-core/src/broker/scheduler/recovery.rs Outdated
Comment thread crates/fila-core/src/cluster/multi_raft.rs Outdated
Comment thread crates/fila-core/src/cluster/mod.rs Outdated
Comment thread crates/fila-core/src/cluster/mod.rs Outdated
Comment thread crates/fila-core/src/cluster/mod.rs Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: 62d6311
Time: 2026-03-11T20:57:59Z

Benchmark Value Unit
compaction_active_p99 0.473112 ms
compaction_idle_p99 0.469832 ms
compaction_p99_delta 0.001124000000000014 ms
consumer_concurrency_100_throughput 1890.3333333333333 msg/s
consumer_concurrency_10_throughput 1975.0 msg/s
consumer_concurrency_1_throughput 356.6666666666667 msg/s
e2e_latency_p50_light 0.404447 ms
e2e_latency_p95_light 0.455877 ms
e2e_latency_p99_light 0.6094040000000001 ms
enqueue_throughput_1kb 2696.7418220281847 msg/s
enqueue_throughput_1kb_mbps 2.633536935574399 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1424.5716548601458 msg/s
fairness_overhead_fifo_throughput 1464.9117913130938 msg/s
fairness_overhead_pct 2.844595961565055 %
key_cardinality_10_throughput 2447.183884629227 msg/s
key_cardinality_10k_throughput 663.230675843288 msg/s
key_cardinality_1k_throughput 1272.2347427741768 msg/s
lua_on_enqueue_overhead_us 10.181648252500167 us
lua_throughput_with_hook 1186.2266369922424 msg/s
memory_per_message_overhead 3133.0304 bytes/msg
memory_rss_idle 166.2421875 MB
memory_rss_loaded_10k 195.85546875 MB

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: 867f1fd
Time: 2026-03-11T20:58:30Z

Benchmark Value Unit
compaction_active_p99 0.490282 ms
compaction_idle_p99 0.510239 ms
compaction_p99_delta -0.01657700000000001 ms
consumer_concurrency_100_throughput 1793.0 msg/s
consumer_concurrency_10_throughput 1702.6666666666667 msg/s
consumer_concurrency_1_throughput 350.6666666666667 msg/s
e2e_latency_p50_light 0.416558 ms
e2e_latency_p95_light 0.5014620000000001 ms
e2e_latency_p99_light 0.631899 ms
enqueue_throughput_1kb 2615.1615512165185 msg/s
enqueue_throughput_1kb_mbps 2.5538687023598814 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1388.190565800054 msg/s
fairness_overhead_fifo_throughput 1428.9010057383225 msg/s
fairness_overhead_pct 3.0067873685114543 %
key_cardinality_10_throughput 2377.567703581095 msg/s
key_cardinality_10k_throughput 653.7513964310592 msg/s
key_cardinality_1k_throughput 1261.9387661039598 msg/s
lua_on_enqueue_overhead_us 17.47565649940202 us
lua_throughput_with_hook 1173.9290324682 msg/s
memory_per_message_overhead 2870.4768 bytes/msg
memory_rss_idle 166.78125 MB
memory_rss_loaded_10k 194.08984375 MB

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: 2b35b79
Time: 2026-03-11T21:08:59Z

Benchmark Value Unit
compaction_active_p99 0.477944 ms
compaction_idle_p99 0.4936400000000001 ms
compaction_p99_delta -0.0156960000000001 ms
consumer_concurrency_100_throughput 1889.0 msg/s
consumer_concurrency_10_throughput 1977.3333333333333 msg/s
consumer_concurrency_1_throughput 360.6666666666667 msg/s
e2e_latency_p50_light 0.409485 ms
e2e_latency_p95_light 0.465177 ms
e2e_latency_p99_light 0.570052 ms
enqueue_throughput_1kb 2640.9942127013487 msg/s
enqueue_throughput_1kb_mbps 2.579095910841161 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1398.1921196577605 msg/s
fairness_overhead_fifo_throughput 1424.9603837263985 msg/s
fairness_overhead_pct 1.9245231522195592 %
key_cardinality_10_throughput 2397.0841516416017 msg/s
key_cardinality_10k_throughput 660.806842855799 msg/s
key_cardinality_1k_throughput 1246.8257274276002 msg/s
lua_on_enqueue_overhead_us 16.625616853850147 us
lua_throughput_with_hook 1171.486309320429 msg/s
memory_per_message_overhead 2862.6944 bytes/msg
memory_rss_idle 166.609375 MB
memory_rss_loaded_10k 193.74609375 MB

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: ffeb0c7
Time: 2026-03-11T21:09:39Z

Benchmark Value Unit
compaction_active_p99 0.503222 ms
compaction_idle_p99 0.504818 ms
compaction_p99_delta 0.012414000000000036 ms
consumer_concurrency_100_throughput 1833.3333333333333 msg/s
consumer_concurrency_10_throughput 1956.6666666666667 msg/s
consumer_concurrency_1_throughput 361.6666666666667 msg/s
e2e_latency_p50_light 0.404147 ms
e2e_latency_p95_light 0.454002 ms
e2e_latency_p99_light 0.564263 ms
enqueue_throughput_1kb 2650.9465931462855 msg/s
enqueue_throughput_1kb_mbps 2.5888150323694195 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1294.1419318985088 msg/s
fairness_overhead_fifo_throughput 1336.0393881901125 msg/s
fairness_overhead_pct 3.294447499028408 %
key_cardinality_10_throughput 2398.7436926372793 msg/s
key_cardinality_10k_throughput 639.6102266447645 msg/s
key_cardinality_1k_throughput 1243.9461868879553 msg/s
lua_on_enqueue_overhead_us 18.80144959855579 us
lua_throughput_with_hook 1070.4383043541943 msg/s
memory_per_message_overhead 2818.4576 bytes/msg
memory_rss_idle 168.70703125 MB
memory_rss_loaded_10k 195.9296875 MB

Copy link
Copy Markdown
Member Author

@vieiralucas vieiralucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cubic findings addressed

Fixed in commit cca7587:

  • #1 (P1): apply_to_broker_storage now returns Result<(), StorageError> and propagates errors instead of silently logging
  • #3 (P1): Added DeleteLeaseExpiry mutations in both Ack and Nack paths
  • #4 (P1): Fixed no-op leased_msg_keys.retain(|_, _| true) — now properly clears entries for the recovering queue
  • #5 (P1): Added warning log when create_group() is called without broker_storage set
  • #6 (P1): send_command result is now checked; leading state only updated on success so next poll retries
  • #7 (P2): Same fix as #6 for DropQueueConsumers
  • #8 (P1): First-sight leader now triggers RecoverQueue to catch messages replicated between startup and first poll

Dismissed

  • #2 (P1): O(n) linear scan in Ack/Nack apply_to_broker_storageaccepted for now. The ClusterRequest::Ack only carries (queue_id, msg_id) but the storage key requires (queue_id, fairness_key, enqueued_at, msg_id). Adding a secondary index or carrying the full key in ClusterRequest would be the proper fix, but is a design change that affects the Raft log format (serialized ClusterRequest). Deferring to a future optimization pass. For the current workloads (single-digit to low-thousands messages per queue), this is acceptable.

@vieiralucas vieiralucas force-pushed the feat/14.3-request-routing-transparent-delivery branch from 95de06f to 8c8bcc1 Compare March 18, 2026 11:43
vieiralucas added a commit that referenced this pull request Mar 18, 2026
@vieiralucas vieiralucas force-pushed the feat/14.4-replication-failover-recovery branch from b1a911e to 157b463 Compare March 18, 2026 11:43
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: 74a854a
Time: 2026-03-18T11:51:15Z

Benchmark Value Unit
compaction_active_p99 0.500209 ms
compaction_idle_p99 0.526993 ms
compaction_p99_delta -0.02678400000000003 ms
consumer_concurrency_100_throughput 1763.0 msg/s
consumer_concurrency_10_throughput 2004.6666666666667 msg/s
consumer_concurrency_1_throughput 339.6666666666667 msg/s
e2e_latency_p50_light 0.421875 ms
e2e_latency_p95_light 0.4987 ms
e2e_latency_p99_light 0.6490680000000001 ms
enqueue_throughput_1kb 2580.231132098345 msg/s
enqueue_throughput_1kb_mbps 2.51975696493979 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1376.3734917722604 msg/s
fairness_overhead_fifo_throughput 1411.4410003042544 msg/s
fairness_overhead_pct 2.48451819979969 %
key_cardinality_10_throughput 2300.6137326121643 msg/s
key_cardinality_10k_throughput 637.1643396817127 msg/s
key_cardinality_1k_throughput 1243.2467847084158 msg/s
lua_on_enqueue_overhead_us 19.055038721792357 us
lua_throughput_with_hook 1149.6036036281848 msg/s
memory_per_message_overhead 2747.1872 bytes/msg
memory_rss_idle 166.56640625 MB
memory_rss_loaded_10k 192.6171875 MB

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: d0048e2
Time: 2026-03-18T11:57:50Z

Benchmark Value Unit
compaction_active_p99 0.507293 ms
compaction_idle_p99 0.506066 ms
compaction_p99_delta 0.0029319999999999347 ms
consumer_concurrency_100_throughput 1305.3333333333333 msg/s
consumer_concurrency_10_throughput 1527.3333333333333 msg/s
consumer_concurrency_1_throughput 273.3333333333333 msg/s
e2e_latency_p50_light 0.418446 ms
e2e_latency_p95_light 0.49676 ms
e2e_latency_p99_light 0.607382 ms
enqueue_throughput_1kb 2588.402565050548 msg/s
enqueue_throughput_1kb_mbps 2.5277368799321756 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1373.7525812071967 msg/s
fairness_overhead_fifo_throughput 1417.8567803961366 msg/s
fairness_overhead_pct 3.110624415578666 %
key_cardinality_10_throughput 2343.9744545850695 msg/s
key_cardinality_10k_throughput 619.9422824207743 msg/s
key_cardinality_1k_throughput 1237.256305879917 msg/s
lua_on_enqueue_overhead_us 26.565044982001837 us
lua_throughput_with_hook 1136.8130211625937 msg/s
memory_per_message_overhead 2782.0032 bytes/msg
memory_rss_idle 167.1171875 MB
memory_rss_loaded_10k 193.0546875 MB

@vieiralucas vieiralucas force-pushed the feat/14.3-request-routing-transparent-delivery branch from 8c8bcc1 to d9a736b Compare March 18, 2026 14:40
Base automatically changed from feat/14.3-request-routing-transparent-delivery to main March 18, 2026 20:45
- Add broker storage replication: queue-level Raft state machines now apply
  committed entries (enqueue, ack, nack) to the broker's RocksDB on ALL nodes,
  not just the leader. Followers have full data for zero-loss failover.

- Add LeaderChangeWatcher: background task monitors queue Raft groups for
  leadership changes. On leader promotion, sends RecoverQueue to rebuild
  in-memory scheduler state. On leader loss, sends DropQueueConsumers to
  close consumer streams so clients reconnect to the new leader.

- Add per-queue scheduler recovery: RecoverQueue command rebuilds DRR keys,
  pending index, and leased_msg_keys for a single queue from RocksDB without
  disrupting other queues.

- Add consumer stream leader-awareness: consume() handler rejects requests
  on non-leader nodes with UNAVAILABLE status.

- 3 new integration tests: failover leader election, zero message loss after
  failover, node rejoin and catchup.
- apply_to_broker_storage now returns Result and propagates StorageError
  instead of silently swallowing storage failures (cubic #1)
- add DeleteLeaseExpiry mutation in ack/nack replication paths to clean up
  orphaned lease expiry entries (cubic #3)
- fix no-op leased_msg_keys.retain in recovery — now properly clears
  entries for the recovering queue before rebuild (cubic #4)
- warn when create_group is called without broker_storage set (cubic #5)
- check send_command result in watch_leader_changes — only update leading
  state on success so next poll retries on failure (cubic #6, #7)
- trigger RecoverQueue on first-sight leader state to catch messages
  replicated between startup and first poll (cubic #8)
- replace catch-all _ => {} with explicit variant listing in
  apply_to_broker_storage for compiler-enforced exhaustiveness
…eLock

RocksDB exists before both Broker and ClusterManager, so there's no
chicken-and-egg problem. Pass Arc<dyn StorageEngine> directly to
MultiRaftManager::new and make FilaRaftStore::for_queue take it
non-optionally.
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (median of 3 runs)

Commit: 62241aa
Time: 2026-03-18T21:29:03Z

Benchmark Value Unit
compaction_active_p99 0.491021 ms
compaction_idle_p99 0.482967 ms
compaction_p99_delta 0.015488000000000002 ms
consumer_concurrency_100_throughput 1792.6666666666667 msg/s
consumer_concurrency_10_throughput 1945.0 msg/s
consumer_concurrency_1_throughput 352.0 msg/s
e2e_latency_p50_light 0.416752 ms
e2e_latency_p95_light 0.483415 ms
e2e_latency_p99_light 0.6693680000000001 ms
enqueue_throughput_1kb 2636.0487771399385 msg/s
enqueue_throughput_1kb_mbps 2.574266383925721 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1395.6475839772509 msg/s
fairness_overhead_fifo_throughput 1430.5437157259312 msg/s
fairness_overhead_pct 2.324509129365193 %
key_cardinality_10_throughput 2402.2694041399504 msg/s
key_cardinality_10k_throughput 653.8055834278072 msg/s
key_cardinality_1k_throughput 1265.2077749572184 msg/s
lua_on_enqueue_overhead_us 18.82712041212744 us
lua_throughput_with_hook 1170.0446041564696 msg/s
memory_per_message_overhead 2960.1792 bytes/msg
memory_rss_idle 166.12109375 MB
memory_rss_loaded_10k 194.37890625 MB

@vieiralucas vieiralucas merged commit 8969ac9 into main Mar 18, 2026
9 checks passed
@vieiralucas vieiralucas deleted the feat/14.4-replication-failover-recovery branch March 18, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant