fix: recover masternode sync from error state on peer reconnect by xdustinface · Pull Request #93 · xdustinface/rust-dashcore

xdustinface · 2026-04-07T12:35:13Z

Override handle_network_event in MasternodesManager to detect Error state when peers reconnect. Resets retry count, clears pending state, and immediately retries QRInfo request.

Previously, if QRInfo timed out during initial sync (e.g., wifi disconnect), the manager entered Error state permanently with no recovery path.

* chore: add Manki AI code review configuration * fix: pass `claude_code_oauth_token` to Manki action

Override handle_network_event in MasternodesManager to detect Error state when peers reconnect. Resets retry count, clears pending state, and immediately retries QRInfo request. Previously, if QRInfo timed out during initial sync (e.g., wifi disconnect), the manager entered Error state permanently with no recovery path.

manki-review · 2026-04-07T12:35:29Z

Manki — Review complete

Planner (18s)
bugfix · 178 lines · 3 agents
review effort: medium · judge effort: medium

Review — 7 findings
    ✓ Correctness & Logic — 2 (381s)
    ✓ Architecture & Design — 2 (112s)
    ✓ Testing & Coverage — 3 (55s)

Judge — 4 kept · 0 dropped (47s)
kept: 1 required · 2 suggestion · 1 nit

Review metadata

Config:

Models: reviewer=claude-sonnet-4-6, judge=claude-sonnet-4-6
Review level: small (auto, 178 lines)
Team: Correctness & Logic, Architecture & Design, Testing & Coverage
Memory: disabled
Nit handling: issues

Judge decisions:

✓ Kept: "Misleading test assertion: waiting_for_qrinfo check only passes due to missing headers" (suggestion, medium confidence) — "Impact: Low-Medium (test gives false confidence — the assertion passes for the wrong reason, not because recovery correctly cleared in-flight state). Likelihood: Certain (empty storage always triggers the early return). The test as written doesn't validate the intended recovery behavior, and a future code change that breaks real recovery might not be caught. One reviewer flagged this, but the description is technically accurate and actionable."
✓ Kept: "chainlock_retry_after not cleared in error recovery path" (required, high confidence) — "Impact: High (silent sync stall — after recovery, tick() short-circuits on stale chainlock_retry_after until the old deadline passes, with no visible indication). Likelihood: Possible (requires Error state to be entered while a chainlock retry was scheduled, which is a realistic timing). Three reviewers independently identified this issue (Findings 2, 3, and 5 merged — same root cause, same fix: replace the manual reset with self.clear_in_flight_state()). Consensus at this level overrides the reviewer's 'always None in current paths' caveat, especially since the fix is a one-liner."
✓ Kept: "Loose state assertion in recovery test allows unreachable branch" (nit, high confidence) — "Merged Findings 4 and 7 — both describe the same WaitForEvents || Syncing assertion where the Syncing branch is unreachable with empty storage. Tightening to assert_eq!(manager.state(), SyncState::WaitForEvents) makes intent explicit, but this is test clarity only with no production impact. Two reviewers flagged it, but the severity stays nit because the test still catches the primary regression (state is no longer Error)."
✓ Kept: "No dashd integration test for the disconnect/reconnect recovery scenario" (suggestion, medium confidence) — "Impact: Medium (unit tests can't verify the full path: timeout → Error state → reconnect → QRInfo retransmitted → sync completes). Likelihood: N/A (missing coverage). CLAUDE.md explicitly calls out that disconnect integration tests catch critical bugs unit tests miss, and the disconnect test module already exists. One reviewer flagged this; the project instructions lend it weight. Not required because the unit tests do cover the state machine transitions, but the gap is real."

Timing:

Parse: 1.2s
Review agents: 400.2s
Judge: 46.5s
Total: 448.0s

codecov-commenter · 2026-04-07T12:38:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.20%. Comparing base (181c0df) to head (309068a).

Additional details and impacted files

@@              Coverage Diff              @@
##           v0.42-dev      #93      +/-   ##
=============================================
+ Coverage      68.10%   68.20%   +0.09%     
=============================================
  Files            318      318              
  Lines          67976    68061      +85     
=============================================
+ Hits           46297    46422     +125     
+ Misses         21679    21639      -40

Flag	Coverage Δ
core	`75.21% <ø> (ø)`
ffi	`40.27% <ø> (+0.17%)`	⬆️
rpc	`19.92% <ø> (ø)`
spv	`85.52% <100.00%> (+0.24%)`	⬆️
wallet	`67.56% <ø> (ø)`

Files with missing lines	Coverage Δ
dash-spv/src/sync/masternodes/manager.rs	`69.90% <100.00%> (+28.86%)`	⬆️
dash-spv/src/sync/masternodes/sync_manager.rs	`17.80% <100.00%> (+9.41%)`	⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

manki-review

The clear_in_flight_state() oversight is the one real issue here — three reviewers independently caught that chainlock_retry_after survives error recovery, which can silently stall sync after a reconnect. The rest is test hygiene.

📊 4 findings (1 required, 2 suggestion, 1 nit) · 178 lines · 448s

Review stats

{
  "model": "claude-sonnet-4-6",
  "reviewTimeMs": 447976,
  "diffLines": 178,
  "diffAdditions": 176,
  "diffDeletions": 2,
  "filesReviewed": 2,
  "agents": [
    "Correctness & Logic",
    "Architecture & Design",
    "Testing & Coverage"
  ],
  "findingsRaw": 7,
  "findingsKept": 4,
  "findingsDropped": 3,
  "severity": {
    "required": 1,
    "suggestion": 2,
    "nit": 1
  },
  "verdict": "REQUEST_CHANGES",
  "prNumber": 93,
  "commitSha": "309068a264ff92d19116d28e96c2de1849577049",
  "agentMetrics": [
    {
      "name": "Correctness & Logic",
      "findingsRaw": 2,
      "findingsKept": 1,
      "responseLength": 2567
    },
    {
      "name": "Architecture & Design",
      "findingsRaw": 2,
      "findingsKept": 2,
      "responseLength": 1770
    },
    {
      "name": "Testing & Coverage",
      "findingsRaw": 3,
      "findingsKept": 3,
      "responseLength": 4035
    }
  ],
  "judgeMetrics": {
    "confidenceDistribution": {
      "high": 2,
      "medium": 2,
      "low": 0
    },
    "severityChanges": 4,
    "mergedDuplicates": 3
  },
  "fileMetrics": {
    "fileTypes": {
      ".rs": 2
    },
    "findingsPerFile": {
      "dash-spv/src/sync/masternodes/manager.rs": 3,
      "dash-spv/src/sync/masternodes/sync_manager.rs": 1
    }
  },
  "reviewerModel": "claude-sonnet-4-6",
  "judgeModel": "claude-sonnet-4-6"
}

manki-review · 2026-04-07T12:42:58Z

+        let event = peers_updated_event(2, Some(500));
+        let _ = manager.handle_network_event(&event, &requests).await;
+
+        // Retry count should be reset


💡 Suggestion _{[medium confidence]}: Misleading test assertion: waiting_for_qrinfo check only passes due to missing headers

The assertion assert!(!manager.sync_state.waiting_for_qrinfo) passes only because the test uses an empty storage (no tip header), causing send_qrinfo_for_tip to return early without calling start_waiting_for_qrinfo(). With real block headers present, send_qrinfo_for_tip would set waiting_for_qrinfo = true (correct behavior — a QRInfo was sent and we're waiting for a response). The test validates a degenerate no-headers path, not the intended recovery behavior, and would fail in a realistic scenario while the production code behaves correctly.

Suggested fix

Suggested change

// Retry count should be reset

Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:

```rust

// With no headers in storage, send_qrinfo_for_tip returns early so waiting stays false.

// In production (with headers), waiting_for_qrinfo would be true after recovery.

assert_eq!(manager.sync_state.qrinfo_retry_count, 0);

// Verify clear_pending() cleared in-flight state:

assert!(manager.sync_state.qrinfo_wait_start.is_none());

```

AI context

{ "file": "dash-spv/src/sync/masternodes/manager.rs", "line": 322, "severity": "suggestion", "confidence": "medium", "flaggedBy": [ "Correctness & Logic" ], "title": "Misleading test assertion: `waiting_for_qrinfo` check only passes due to missing headers", "fix": "Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:\n```rust\n// With no headers in s" }

manki-review · 2026-04-07T12:42:58Z

+            if *connected_count == 0 {
+                self.stop_sync();
+            } else if self.state() == SyncState::WaitingForConnections {
+                return self.start_sync(requests).await;


🚫 Required _{[high confidence]}: chainlock_retry_after not cleared in error recovery path

The new handle_network_event error recovery path calls self.sync_state.clear_pending() and resets qrinfo_retry_count, but clear_pending() does not reset chainlock_retry_after. If the manager entered Error state while a ChainLock retry was scheduled (or if one was already in progress), the stale chainlock_retry_after will persist into the recovery cycle. The tick() handler checks this field when state is Syncing and will short-circuit with return Ok(vec![]) until the old deadline passes, silently stalling the retry. No test covers this case. Fix by adding self.sync_state.chainlock_retry_after = None; in the error recovery branch, and add a test that sets chainlock_retry_after before triggering peer reconnect, then asserts it is None after recovery.

Suggested fix

Suggested change

return self.start_sync(requests).await;

// Replace:

self.sync_state.qrinfo_retry_count = 0;

self.sync_state.clear_pending();

// With:

self.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after

AI context

{ "file": "dash-spv/src/sync/masternodes/sync_manager.rs", "line": 522, "severity": "required", "confidence": "high", "flaggedBy": [ "Architecture & Design", "Testing & Coverage" ], "title": "chainlock_retry_after not cleared in error recovery path", "fix": "// Replace:\nself.sync_state.qrinfo_retry_count = 0;\nself.sync_state.clear_pending();\n\n// With:\nself.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after" }

manki-review · 2026-04-07T12:42:58Z

            assert_eq!(progress.diffs_processed(), 10);
            assert!(progress.last_activity().elapsed().as_secs() < 1);
        } else {
            panic!("Expected SyncManagerProgress::Masternodes");


💡 Suggestion _{[medium confidence]}: No dashd integration test for the disconnect/reconnect recovery scenario

CLAUDE.md explicitly states that integration tests running against a real dashd node catch critical bugs (restart, resync, disconnection) that unit tests miss, and notes that disconnect tests already exist in dash-spv/tests/dashd_sync/. This PR's primary fix is a new error-recovery path triggered by a real network event; the unit tests exercise state machine transitions in isolation but cannot verify the full path: QRInfo times out → Error state → peer reconnects → QRInfo is retransmitted over the wire → sync completes. A dashd integration test in the disconnect module would validate this end-to-end and guard against regressions.

Suggested fix

Suggested change

panic!("Expected SyncManagerProgress::Masternodes");

// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:

// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)

// 2. Block QRInfo at the dashd side or simulate timeout by manipulating the peer

// 3. Observe manager enters Error state

// 4. Reconnect peer

// 5. Assert masternode sync eventually completes (MasternodeStateUpdated event received)

AI context

{ "file": "dash-spv/src/sync/masternodes/manager.rs", "line": 282, "severity": "suggestion", "confidence": "medium", "flaggedBy": [ "Testing & Coverage" ], "title": "No dashd integration test for the disconnect/reconnect recovery scenario", "fix": "// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:\n// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)\n// 2. Block QRInfo at the dashd side or simulate " }

github-actions · 2026-04-27T01:05:05Z

This PR has merge conflicts with the base branch. Please rebase or merge the base branch into your branch to resolve them.

xdustinface and others added 3 commits April 7, 2026 14:29

ci: add Manki AI code review configuration (#73)

917f90e

* chore: add Manki AI code review configuration * fix: pass `claude_code_oauth_token` to Manki action

chore: update manki action reference to manki-review/manki (#88)

181c0df

manki-review Bot requested changes Apr 7, 2026

View reviewed changes

xdustinface force-pushed the v0.42-dev branch 3 times, most recently from 77dd87b to 853d3e3 Compare April 17, 2026 10:24

xdustinface force-pushed the v0.42-dev branch from bda8a95 to 1e2a317 Compare April 27, 2026 01:02

github-actions Bot added the merge-conflict label Apr 27, 2026

xdustinface force-pushed the v0.42-dev branch 2 times, most recently from 66e5614 to 05b14a8 Compare May 8, 2026 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover masternode sync from error state on peer reconnect#93

fix: recover masternode sync from error state on peer reconnect#93
xdustinface wants to merge 3 commits into
v0.42-devfrom
fix/masternode-sync-recovery

xdustinface commented Apr 7, 2026

Uh oh!

manki-review Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 7, 2026

Uh oh!

manki-review Bot left a comment

Uh oh!

manki-review Bot Apr 7, 2026

Uh oh!

manki-review Bot Apr 7, 2026

Uh oh!

manki-review Bot Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        // Retry count should be reset
+Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:
+```rust
+// With no headers in storage, send_qrinfo_for_tip returns early so waiting stays false.
+// In production (with headers), waiting_for_qrinfo would be true after recovery.
+assert_eq!(manager.sync_state.qrinfo_retry_count, 0);
+// Verify clear_pending() cleared in-flight state:
+assert!(manager.sync_state.qrinfo_wait_start.is_none());
+```

-                return self.start_sync(requests).await;
+// Replace:
+self.sync_state.qrinfo_retry_count = 0;
+self.sync_state.clear_pending();
+// With:
+self.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after

-            panic!("Expected SyncManagerProgress::Masternodes");
+// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:
+// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)
+// 2. Block QRInfo at the dashd side or simulate timeout by manipulating the peer
+// 3. Observe manager enters Error state
+// 4. Reconnect peer
+// 5. Assert masternode sync eventually completes (MasternodeStateUpdated event received)

Conversation

xdustinface commented Apr 7, 2026

Uh oh!

manki-review Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 7, 2026

Codecov Report

Uh oh!

manki-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

manki-review Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

manki-review Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

manki-review Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

manki-review Bot commented Apr 7, 2026 •

edited

Loading