Skip to content

fix: recover masternode sync from error state on peer reconnect#93

Open
xdustinface wants to merge 3 commits into
v0.42-devfrom
fix/masternode-sync-recovery
Open

fix: recover masternode sync from error state on peer reconnect#93
xdustinface wants to merge 3 commits into
v0.42-devfrom
fix/masternode-sync-recovery

Conversation

@xdustinface
Copy link
Copy Markdown
Owner

Override handle_network_event in MasternodesManager to detect Error state when peers reconnect. Resets retry count, clears pending state, and immediately retries QRInfo request.

Previously, if QRInfo timed out during initial sync (e.g., wifi disconnect), the manager entered Error state permanently with no recovery path.

xdustinface and others added 3 commits April 7, 2026 14:29
* chore: add Manki AI code review configuration

* fix: pass `claude_code_oauth_token` to Manki action
Override handle_network_event in MasternodesManager to detect Error
state when peers reconnect. Resets retry count, clears pending state,
and immediately retries QRInfo request.

Previously, if QRInfo timed out during initial sync (e.g., wifi
disconnect), the manager entered Error state permanently with no
recovery path.
@manki-review
Copy link
Copy Markdown

manki-review Bot commented Apr 7, 2026

Manki — Review complete

Planner (18s)
    bugfix · 178 lines · 3 agents
    review effort: medium · judge effort: medium

Review — 7 findings
    ✓ Correctness & Logic — 2 (381s)
    ✓ Architecture & Design — 2 (112s)
    ✓ Testing & Coverage — 3 (55s)

Judge — 4 kept · 0 dropped (47s)
    kept: 1 required · 2 suggestion · 1 nit

Review metadata

Config:

  • Models: reviewer=claude-sonnet-4-6, judge=claude-sonnet-4-6
  • Review level: small (auto, 178 lines)
  • Team: Correctness & Logic, Architecture & Design, Testing & Coverage
  • Memory: disabled
  • Nit handling: issues

Judge decisions:

  • ✓ Kept: "Misleading test assertion: waiting_for_qrinfo check only passes due to missing headers" (suggestion, medium confidence) — "Impact: Low-Medium (test gives false confidence — the assertion passes for the wrong reason, not because recovery correctly cleared in-flight state). Likelihood: Certain (empty storage always triggers the early return). The test as written doesn't validate the intended recovery behavior, and a future code change that breaks real recovery might not be caught. One reviewer flagged this, but the description is technically accurate and actionable."
  • ✓ Kept: "chainlock_retry_after not cleared in error recovery path" (required, high confidence) — "Impact: High (silent sync stall — after recovery, tick() short-circuits on stale chainlock_retry_after until the old deadline passes, with no visible indication). Likelihood: Possible (requires Error state to be entered while a chainlock retry was scheduled, which is a realistic timing). Three reviewers independently identified this issue (Findings 2, 3, and 5 merged — same root cause, same fix: replace the manual reset with self.clear_in_flight_state()). Consensus at this level overrides the reviewer's 'always None in current paths' caveat, especially since the fix is a one-liner."
  • ✓ Kept: "Loose state assertion in recovery test allows unreachable branch" (nit, high confidence) — "Merged Findings 4 and 7 — both describe the same WaitForEvents || Syncing assertion where the Syncing branch is unreachable with empty storage. Tightening to assert_eq!(manager.state(), SyncState::WaitForEvents) makes intent explicit, but this is test clarity only with no production impact. Two reviewers flagged it, but the severity stays nit because the test still catches the primary regression (state is no longer Error)."
  • ✓ Kept: "No dashd integration test for the disconnect/reconnect recovery scenario" (suggestion, medium confidence) — "Impact: Medium (unit tests can't verify the full path: timeout → Error state → reconnect → QRInfo retransmitted → sync completes). Likelihood: N/A (missing coverage). CLAUDE.md explicitly calls out that disconnect integration tests catch critical bugs unit tests miss, and the disconnect test module already exists. One reviewer flagged this; the project instructions lend it weight. Not required because the unit tests do cover the state machine transitions, but the gap is real."

Timing:

  • Parse: 1.2s
  • Review agents: 400.2s
  • Judge: 46.5s
  • Total: 448.0s

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.20%. Comparing base (181c0df) to head (309068a).

Additional details and impacted files
@@              Coverage Diff              @@
##           v0.42-dev      #93      +/-   ##
=============================================
+ Coverage      68.10%   68.20%   +0.09%     
=============================================
  Files            318      318              
  Lines          67976    68061      +85     
=============================================
+ Hits           46297    46422     +125     
+ Misses         21679    21639      -40     
Flag Coverage Δ
core 75.21% <ø> (ø)
ffi 40.27% <ø> (+0.17%) ⬆️
rpc 19.92% <ø> (ø)
spv 85.52% <100.00%> (+0.24%) ⬆️
wallet 67.56% <ø> (ø)
Files with missing lines Coverage Δ
dash-spv/src/sync/masternodes/manager.rs 69.90% <100.00%> (+28.86%) ⬆️
dash-spv/src/sync/masternodes/sync_manager.rs 17.80% <100.00%> (+9.41%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@manki-review manki-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clear_in_flight_state() oversight is the one real issue here — three reviewers independently caught that chainlock_retry_after survives error recovery, which can silently stall sync after a reconnect. The rest is test hygiene.

📊 4 findings (1 required, 2 suggestion, 1 nit) · 178 lines · 448s

Review stats
{
  "model": "claude-sonnet-4-6",
  "reviewTimeMs": 447976,
  "diffLines": 178,
  "diffAdditions": 176,
  "diffDeletions": 2,
  "filesReviewed": 2,
  "agents": [
    "Correctness & Logic",
    "Architecture & Design",
    "Testing & Coverage"
  ],
  "findingsRaw": 7,
  "findingsKept": 4,
  "findingsDropped": 3,
  "severity": {
    "required": 1,
    "suggestion": 2,
    "nit": 1
  },
  "verdict": "REQUEST_CHANGES",
  "prNumber": 93,
  "commitSha": "309068a264ff92d19116d28e96c2de1849577049",
  "agentMetrics": [
    {
      "name": "Correctness & Logic",
      "findingsRaw": 2,
      "findingsKept": 1,
      "responseLength": 2567
    },
    {
      "name": "Architecture & Design",
      "findingsRaw": 2,
      "findingsKept": 2,
      "responseLength": 1770
    },
    {
      "name": "Testing & Coverage",
      "findingsRaw": 3,
      "findingsKept": 3,
      "responseLength": 4035
    }
  ],
  "judgeMetrics": {
    "confidenceDistribution": {
      "high": 2,
      "medium": 2,
      "low": 0
    },
    "severityChanges": 4,
    "mergedDuplicates": 3
  },
  "fileMetrics": {
    "fileTypes": {
      ".rs": 2
    },
    "findingsPerFile": {
      "dash-spv/src/sync/masternodes/manager.rs": 3,
      "dash-spv/src/sync/masternodes/sync_manager.rs": 1
    }
  },
  "reviewerModel": "claude-sonnet-4-6",
  "judgeModel": "claude-sonnet-4-6"
}

let event = peers_updated_event(2, Some(500));
let _ = manager.handle_network_event(&event, &requests).await;

// Retry count should be reset
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion [medium confidence]: Misleading test assertion: waiting_for_qrinfo check only passes due to missing headers

The assertion assert!(!manager.sync_state.waiting_for_qrinfo) passes only because the test uses an empty storage (no tip header), causing send_qrinfo_for_tip to return early without calling start_waiting_for_qrinfo(). With real block headers present, send_qrinfo_for_tip would set waiting_for_qrinfo = true (correct behavior — a QRInfo was sent and we're waiting for a response). The test validates a degenerate no-headers path, not the intended recovery behavior, and would fail in a realistic scenario while the production code behaves correctly.

Suggested fix
Suggested change
// Retry count should be reset
Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:
```rust
// With no headers in storage, send_qrinfo_for_tip returns early so waiting stays false.
// In production (with headers), waiting_for_qrinfo would be true after recovery.
assert_eq!(manager.sync_state.qrinfo_retry_count, 0);
// Verify clear_pending() cleared in-flight state:
assert!(manager.sync_state.qrinfo_wait_start.is_none());
```
AI context
{
  "file": "dash-spv/src/sync/masternodes/manager.rs",
  "line": 322,
  "severity": "suggestion",
  "confidence": "medium",
  "flaggedBy": [
    "Correctness & Logic"
  ],
  "title": "Misleading test assertion: `waiting_for_qrinfo` check only passes due to missing headers",
  "fix": "Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:\n```rust\n// With no headers in s"
}

if *connected_count == 0 {
self.stop_sync();
} else if self.state() == SyncState::WaitingForConnections {
return self.start_sync(requests).await;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Required [high confidence]: chainlock_retry_after not cleared in error recovery path

The new handle_network_event error recovery path calls self.sync_state.clear_pending() and resets qrinfo_retry_count, but clear_pending() does not reset chainlock_retry_after. If the manager entered Error state while a ChainLock retry was scheduled (or if one was already in progress), the stale chainlock_retry_after will persist into the recovery cycle. The tick() handler checks this field when state is Syncing and will short-circuit with return Ok(vec![]) until the old deadline passes, silently stalling the retry. No test covers this case. Fix by adding self.sync_state.chainlock_retry_after = None; in the error recovery branch, and add a test that sets chainlock_retry_after before triggering peer reconnect, then asserts it is None after recovery.

Suggested fix
Suggested change
return self.start_sync(requests).await;
// Replace:
self.sync_state.qrinfo_retry_count = 0;
self.sync_state.clear_pending();
// With:
self.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after
AI context
{
  "file": "dash-spv/src/sync/masternodes/sync_manager.rs",
  "line": 522,
  "severity": "required",
  "confidence": "high",
  "flaggedBy": [
    "Architecture & Design",
    "Testing & Coverage"
  ],
  "title": "chainlock_retry_after not cleared in error recovery path",
  "fix": "// Replace:\nself.sync_state.qrinfo_retry_count = 0;\nself.sync_state.clear_pending();\n\n// With:\nself.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after"
}

assert_eq!(progress.diffs_processed(), 10);
assert!(progress.last_activity().elapsed().as_secs() < 1);
} else {
panic!("Expected SyncManagerProgress::Masternodes");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion [medium confidence]: No dashd integration test for the disconnect/reconnect recovery scenario

CLAUDE.md explicitly states that integration tests running against a real dashd node catch critical bugs (restart, resync, disconnection) that unit tests miss, and notes that disconnect tests already exist in dash-spv/tests/dashd_sync/. This PR's primary fix is a new error-recovery path triggered by a real network event; the unit tests exercise state machine transitions in isolation but cannot verify the full path: QRInfo times out → Error state → peer reconnects → QRInfo is retransmitted over the wire → sync completes. A dashd integration test in the disconnect module would validate this end-to-end and guard against regressions.

Suggested fix
Suggested change
panic!("Expected SyncManagerProgress::Masternodes");
// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:
// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)
// 2. Block QRInfo at the dashd side or simulate timeout by manipulating the peer
// 3. Observe manager enters Error state
// 4. Reconnect peer
// 5. Assert masternode sync eventually completes (MasternodeStateUpdated event received)
AI context
{
  "file": "dash-spv/src/sync/masternodes/manager.rs",
  "line": 282,
  "severity": "suggestion",
  "confidence": "medium",
  "flaggedBy": [
    "Testing & Coverage"
  ],
  "title": "No dashd integration test for the disconnect/reconnect recovery scenario",
  "fix": "// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:\n// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)\n// 2. Block QRInfo at the dashd side or simulate "
}

@github-actions
Copy link
Copy Markdown

This PR has merge conflicts with the base branch. Please rebase or merge the base branch into your branch to resolve them.

@xdustinface xdustinface force-pushed the v0.42-dev branch 2 times, most recently from 66e5614 to 05b14a8 Compare May 8, 2026 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants