fix: recover masternode sync from error state on peer reconnect#93
fix: recover masternode sync from error state on peer reconnect#93xdustinface wants to merge 3 commits into
Conversation
* chore: add Manki AI code review configuration * fix: pass `claude_code_oauth_token` to Manki action
Override handle_network_event in MasternodesManager to detect Error state when peers reconnect. Resets retry count, clears pending state, and immediately retries QRInfo request. Previously, if QRInfo timed out during initial sync (e.g., wifi disconnect), the manager entered Error state permanently with no recovery path.
|
Manki — Review complete Planner (18s) Review — 7 findings Judge — 4 kept · 0 dropped (47s) Review metadataConfig:
Judge decisions:
Timing:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v0.42-dev #93 +/- ##
=============================================
+ Coverage 68.10% 68.20% +0.09%
=============================================
Files 318 318
Lines 67976 68061 +85
=============================================
+ Hits 46297 46422 +125
+ Misses 21679 21639 -40
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
The clear_in_flight_state() oversight is the one real issue here — three reviewers independently caught that chainlock_retry_after survives error recovery, which can silently stall sync after a reconnect. The rest is test hygiene.
📊 4 findings (1 required, 2 suggestion, 1 nit) · 178 lines · 448s
Review stats
{
"model": "claude-sonnet-4-6",
"reviewTimeMs": 447976,
"diffLines": 178,
"diffAdditions": 176,
"diffDeletions": 2,
"filesReviewed": 2,
"agents": [
"Correctness & Logic",
"Architecture & Design",
"Testing & Coverage"
],
"findingsRaw": 7,
"findingsKept": 4,
"findingsDropped": 3,
"severity": {
"required": 1,
"suggestion": 2,
"nit": 1
},
"verdict": "REQUEST_CHANGES",
"prNumber": 93,
"commitSha": "309068a264ff92d19116d28e96c2de1849577049",
"agentMetrics": [
{
"name": "Correctness & Logic",
"findingsRaw": 2,
"findingsKept": 1,
"responseLength": 2567
},
{
"name": "Architecture & Design",
"findingsRaw": 2,
"findingsKept": 2,
"responseLength": 1770
},
{
"name": "Testing & Coverage",
"findingsRaw": 3,
"findingsKept": 3,
"responseLength": 4035
}
],
"judgeMetrics": {
"confidenceDistribution": {
"high": 2,
"medium": 2,
"low": 0
},
"severityChanges": 4,
"mergedDuplicates": 3
},
"fileMetrics": {
"fileTypes": {
".rs": 2
},
"findingsPerFile": {
"dash-spv/src/sync/masternodes/manager.rs": 3,
"dash-spv/src/sync/masternodes/sync_manager.rs": 1
}
},
"reviewerModel": "claude-sonnet-4-6",
"judgeModel": "claude-sonnet-4-6"
}| let event = peers_updated_event(2, Some(500)); | ||
| let _ = manager.handle_network_event(&event, &requests).await; | ||
|
|
||
| // Retry count should be reset |
There was a problem hiding this comment.
💡 Suggestion [medium confidence]: Misleading test assertion: waiting_for_qrinfo check only passes due to missing headers
The assertion assert!(!manager.sync_state.waiting_for_qrinfo) passes only because the test uses an empty storage (no tip header), causing send_qrinfo_for_tip to return early without calling start_waiting_for_qrinfo(). With real block headers present, send_qrinfo_for_tip would set waiting_for_qrinfo = true (correct behavior — a QRInfo was sent and we're waiting for a response). The test validates a degenerate no-headers path, not the intended recovery behavior, and would fail in a realistic scenario while the production code behaves correctly.
Suggested fix
| // Retry count should be reset | |
| Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case: | |
| ```rust | |
| // With no headers in storage, send_qrinfo_for_tip returns early so waiting stays false. | |
| // In production (with headers), waiting_for_qrinfo would be true after recovery. | |
| assert_eq!(manager.sync_state.qrinfo_retry_count, 0); | |
| // Verify clear_pending() cleared in-flight state: | |
| assert!(manager.sync_state.qrinfo_wait_start.is_none()); | |
| ``` |
AI context
{
"file": "dash-spv/src/sync/masternodes/manager.rs",
"line": 322,
"severity": "suggestion",
"confidence": "medium",
"flaggedBy": [
"Correctness & Logic"
],
"title": "Misleading test assertion: `waiting_for_qrinfo` check only passes due to missing headers",
"fix": "Either seed block headers and assert `waiting_for_qrinfo == true` (QRInfo was dispatched), or restructure the assertion to explicitly document the no-headers edge case:\n```rust\n// With no headers in s"
}| if *connected_count == 0 { | ||
| self.stop_sync(); | ||
| } else if self.state() == SyncState::WaitingForConnections { | ||
| return self.start_sync(requests).await; |
There was a problem hiding this comment.
🚫 Required [high confidence]: chainlock_retry_after not cleared in error recovery path
The new handle_network_event error recovery path calls self.sync_state.clear_pending() and resets qrinfo_retry_count, but clear_pending() does not reset chainlock_retry_after. If the manager entered Error state while a ChainLock retry was scheduled (or if one was already in progress), the stale chainlock_retry_after will persist into the recovery cycle. The tick() handler checks this field when state is Syncing and will short-circuit with return Ok(vec![]) until the old deadline passes, silently stalling the retry. No test covers this case. Fix by adding self.sync_state.chainlock_retry_after = None; in the error recovery branch, and add a test that sets chainlock_retry_after before triggering peer reconnect, then asserts it is None after recovery.
Suggested fix
| return self.start_sync(requests).await; | |
| // Replace: | |
| self.sync_state.qrinfo_retry_count = 0; | |
| self.sync_state.clear_pending(); | |
| // With: | |
| self.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after |
AI context
{
"file": "dash-spv/src/sync/masternodes/sync_manager.rs",
"line": 522,
"severity": "required",
"confidence": "high",
"flaggedBy": [
"Architecture & Design",
"Testing & Coverage"
],
"title": "chainlock_retry_after not cleared in error recovery path",
"fix": "// Replace:\nself.sync_state.qrinfo_retry_count = 0;\nself.sync_state.clear_pending();\n\n// With:\nself.clear_in_flight_state(); // resets retry_count, clear_pending, AND chainlock_retry_after"
}| assert_eq!(progress.diffs_processed(), 10); | ||
| assert!(progress.last_activity().elapsed().as_secs() < 1); | ||
| } else { | ||
| panic!("Expected SyncManagerProgress::Masternodes"); |
There was a problem hiding this comment.
💡 Suggestion [medium confidence]: No dashd integration test for the disconnect/reconnect recovery scenario
CLAUDE.md explicitly states that integration tests running against a real dashd node catch critical bugs (restart, resync, disconnection) that unit tests miss, and notes that disconnect tests already exist in dash-spv/tests/dashd_sync/. This PR's primary fix is a new error-recovery path triggered by a real network event; the unit tests exercise state machine transitions in isolation but cannot verify the full path: QRInfo times out → Error state → peer reconnects → QRInfo is retransmitted over the wire → sync completes. A dashd integration test in the disconnect module would validate this end-to-end and guard against regressions.
Suggested fix
| panic!("Expected SyncManagerProgress::Masternodes"); | |
| // In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add: | |
| // 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing) | |
| // 2. Block QRInfo at the dashd side or simulate timeout by manipulating the peer | |
| // 3. Observe manager enters Error state | |
| // 4. Reconnect peer | |
| // 5. Assert masternode sync eventually completes (MasternodeStateUpdated event received) |
AI context
{
"file": "dash-spv/src/sync/masternodes/manager.rs",
"line": 282,
"severity": "suggestion",
"confidence": "medium",
"flaggedBy": [
"Testing & Coverage"
],
"title": "No dashd integration test for the disconnect/reconnect recovery scenario",
"fix": "// In dash-spv/tests/dashd_sync/disconnect.rs (or similar), add:\n// 1. Start SPV sync until masternode sync begins (wait for WaitForEvents or Syncing)\n// 2. Block QRInfo at the dashd side or simulate "
}77dd87b to
853d3e3
Compare
|
This PR has merge conflicts with the base branch. Please rebase or merge the base branch into your branch to resolve them. |
66e5614 to
05b14a8
Compare
Override handle_network_event in MasternodesManager to detect Error state when peers reconnect. Resets retry count, clears pending state, and immediately retries QRInfo request.
Previously, if QRInfo timed out during initial sync (e.g., wifi disconnect), the manager entered Error state permanently with no recovery path.