WIP: Harden deferred notebook import/replay to avoid sync stalls and retry thrash#409
WIP: Harden deferred notebook import/replay to avoid sync stalls and retry thrash#409blakebyrnes wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens deferred notebook import and replay logic to prevent sync stalls and retry thrashing during transient notary/archive unavailability. The changes introduce bounded timeouts, header-only fallback behavior when queues are saturated, and separate import-safe vs. background notebook verification modes.
Changes:
- Adds import-safe notebook verification with bounded timeouts and no side effects
- Changes deferred queue saturation behavior from
MissingStateto header-only import - Introduces bounded replay scanning to prevent long-running queue processing
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| node/consensus/src/test.rs | Adds comprehensive test coverage for deferred import, queue saturation, and replay recovery scenarios |
| node/consensus/src/pending_import_replay.rs | Implements pending import replay queue with persistence, bounded scanning, and header-only fallback logic |
| node/consensus/src/notary_client.rs | Refactors notebook audit verification to support bounded timeouts and import-safe mode without notary updates |
| node/consensus/src/mock_importer.rs | Provides test infrastructure including mock chain client and importer factory functions |
| node/consensus/src/lib.rs | Moves test module to separate file and adds pending_import_replay module |
| node/consensus/src/import_queue.rs | Refactors import logic with pre-import gates, deferred import handling, and background replay task |
| node/consensus/src/error.rs | Adds new error variants for deferred audits and pending import constraints, plus retryable error detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7afce34 to
968e9a5
Compare
Add a persisted pending-import replay queue and wire import-gating for deferred full-body execution.\n\n- persist/recover pending full imports via aux storage\n- replay deferred imports when parent state and notebook prerequisites are ready\n- keep header-only imports bounded with queue capacity and explicit queue-full behavior\n- split consensus test scaffolding into dedicated mock and test modules\n- harden notary audit catch-up to defer on unavailable audits instead of misclassifying as invalid digest fix(consensus): harden deferred notebook import and replay Block import had multiple failure modes when notebook audits were temporarily unavailable: import-time verification could pull in notary update/connect work while the import lock was held, deferred queue saturation returned MissingState and could trigger retry churn, and deferred reimport short-circuits could skip justification/finality upgrades. This change adds an import-safe notebook verification path with bounded wait and no inline notary updates, treats deferred queue saturation as header-only import (instead of MissingState), preserves justifications on deferred header split, and allows justified/finalized upgrades to reimport. Replay scanning is also bounded per pass to avoid long blocked scans. Impact: lower risk of sync stalls and peer-thrashing during transient notary/archive outages while preserving consensus safety checks and deterministic recovery behavior. Validation: cargo test -p argon-node-consensus --lib, cargo make fmt, cargo make lint.
968e9a5 to
b335547
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pending_imports.len() >= MAX_PENDING_IMPORTS { | ||
| warn!( | ||
| block_hash = ?pending_import.hash, | ||
| number = ?pending_import.block.header.number(), | ||
| queue_len = pending_imports.len(), | ||
| "Pending replay queue is full while requeueing deferred import; keeping deferred import" | ||
| ); | ||
| } | ||
| pending_imports.push(pending_import); | ||
| sort_pending_imports(&mut pending_imports); | ||
| self.persist_snapshot(&pending_imports); |
There was a problem hiding this comment.
requeue_pending_import logs when the queue is at/above MAX_PENDING_IMPORTS, but it still pushes the entry anyway. This defeats the queue bound and can grow the in-memory and persisted aux payload beyond the intended cap under repeated retries (the warning message also says "queue is full" but behavior still enqueues). Consider returning early (dropping or deferring persistence) when full, or evicting/compacting before pushing so pending_imports.len() never exceeds MAX_PENDING_IMPORTS.
| let mut replayed = 0usize; | ||
| while let Some((pending_import, replay_context)) = self | ||
| .pending_full_import_queue | ||
| .dequeue_ready_for_replay(&self.notary_client) | ||
| .await | ||
| { | ||
| replayed = replayed.saturating_add(1); | ||
| let mut replay_retry_block = |
There was a problem hiding this comment.
replayed is incremented in replay_pending_full_imports but never used. If the project treats warnings as errors (e.g., via clippy/lint settings), this will fail the build. Either remove it or use it (e.g., for a debug/info log or metrics) to justify keeping the counter.
| sp_blockchain::Info { | ||
| finalized_hash: fin.1, | ||
| finalized_number: fin.0, | ||
| finalized_state: None, | ||
| best_hash: best.1, | ||
| best_number: best.0, | ||
| block_gap: None, | ||
| genesis_hash: best.1, | ||
| number_leaves: 0, | ||
| } |
There was a problem hiding this comment.
In the MemChain test client, info().genesis_hash is set to best.1, which changes as best changes. genesis_hash is expected to be stable and refer to the chain's genesis block; using the current best hash can break any logic that relies on client.info().genesis_hash (e.g., compute puzzle key block derivation). Store the genesis hash once in MemChain and return that here instead.
Why
When notebook audits are temporarily unavailable, import was mixing safety checks with network recovery work in ways that could hurt liveness. In particular, the previous path could hold import on notary connection/update work and could emit
MissingStateunder deferred-queue saturation, which can lead to retry churn.What changed
MissingState.Edge cases handled
Impact
Validation
cargo test -p argon-node-consensus --libcargo make fmtcargo make lint