Skip to content

WIP: Harden deferred notebook import/replay to avoid sync stalls and retry thrash#409

Open
blakebyrnes wants to merge 1 commit intomainfrom
pr/import-replay-security-flow-analysis
Open

WIP: Harden deferred notebook import/replay to avoid sync stalls and retry thrash#409
blakebyrnes wants to merge 1 commit intomainfrom
pr/import-replay-security-flow-analysis

Conversation

@blakebyrnes
Copy link
Contributor

Why

When notebook audits are temporarily unavailable, import was mixing safety checks with network recovery work in ways that could hurt liveness. In particular, the previous path could hold import on notary connection/update work and could emit MissingState under deferred-queue saturation, which can lead to retry churn.

What changed

  • Added an import-safe notebook audit verification mode with bounded wait and no inline notary update/connect side effects.
  • Kept background notary update behavior for normal catch-up paths.
  • Changed deferred queue saturation to header-only import instead of MissingState.
  • Preserved justifications on deferred header split and allowed justified/finalized upgrades to reimport.
  • Bounded replay scan work per pass to avoid long blocked scans.

Edge cases handled

  • Missing parent state during execution paths.
  • Deferred queue saturation under sustained load.
  • Replay queues with mixed ready/unready entries.
  • Deferred blocks that later arrive with justification/finality upgrades.
  • Unsafe defer attempts with unresolved intermediates.

Impact

  • Lower risk of import stalls during transient notary/archive outages.
  • Lower risk of retry-thrash behavior from saturation paths.
  • Better deterministic replay recovery while preserving existing consensus safety checks.

Validation

  • cargo test -p argon-node-consensus --lib
  • cargo make fmt
  • cargo make lint

@blakebyrnes blakebyrnes changed the title Harden deferred notebook import/replay to avoid sync stalls and retry thrash WIP: Harden deferred notebook import/replay to avoid sync stalls and retry thrash Feb 23, 2026
@blakebyrnes blakebyrnes marked this pull request as draft February 23, 2026 15:33
@blakebyrnes blakebyrnes marked this pull request as ready for review February 26, 2026 18:15
Copilot AI review requested due to automatic review settings February 26, 2026 18:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens deferred notebook import and replay logic to prevent sync stalls and retry thrashing during transient notary/archive unavailability. The changes introduce bounded timeouts, header-only fallback behavior when queues are saturated, and separate import-safe vs. background notebook verification modes.

Changes:

  • Adds import-safe notebook verification with bounded timeouts and no side effects
  • Changes deferred queue saturation behavior from MissingState to header-only import
  • Introduces bounded replay scanning to prevent long-running queue processing

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
node/consensus/src/test.rs Adds comprehensive test coverage for deferred import, queue saturation, and replay recovery scenarios
node/consensus/src/pending_import_replay.rs Implements pending import replay queue with persistence, bounded scanning, and header-only fallback logic
node/consensus/src/notary_client.rs Refactors notebook audit verification to support bounded timeouts and import-safe mode without notary updates
node/consensus/src/mock_importer.rs Provides test infrastructure including mock chain client and importer factory functions
node/consensus/src/lib.rs Moves test module to separate file and adds pending_import_replay module
node/consensus/src/import_queue.rs Refactors import logic with pre-import gates, deferred import handling, and background replay task
node/consensus/src/error.rs Adds new error variants for deferred audits and pending import constraints, plus retryable error detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add a persisted pending-import replay queue and wire import-gating for deferred full-body execution.\n\n- persist/recover pending full imports via aux storage\n- replay deferred imports when parent state and notebook prerequisites are ready\n- keep header-only imports bounded with queue capacity and explicit queue-full behavior\n- split consensus test scaffolding into dedicated mock and test modules\n- harden notary audit catch-up to defer on unavailable audits instead of misclassifying as invalid digest

fix(consensus): harden deferred notebook import and replay

Block import had multiple failure modes when notebook audits were temporarily unavailable: import-time verification could pull in notary update/connect work while the import lock was held, deferred queue saturation returned MissingState and could trigger retry churn, and deferred reimport short-circuits could skip justification/finality upgrades.

This change adds an import-safe notebook verification path with bounded wait and no inline notary updates, treats deferred queue saturation as header-only import (instead of MissingState), preserves justifications on deferred header split, and allows justified/finalized upgrades to reimport. Replay scanning is also bounded per pass to avoid long blocked scans.

Impact: lower risk of sync stalls and peer-thrashing during transient notary/archive outages while preserving consensus safety checks and deterministic recovery behavior.

Validation: cargo test -p argon-node-consensus --lib, cargo make fmt, cargo make lint.
Copilot AI review requested due to automatic review settings February 26, 2026 19:49
@blakebyrnes blakebyrnes force-pushed the pr/import-replay-security-flow-analysis branch from 968e9a5 to b335547 Compare February 26, 2026 19:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +289 to +299
if pending_imports.len() >= MAX_PENDING_IMPORTS {
warn!(
block_hash = ?pending_import.hash,
number = ?pending_import.block.header.number(),
queue_len = pending_imports.len(),
"Pending replay queue is full while requeueing deferred import; keeping deferred import"
);
}
pending_imports.push(pending_import);
sort_pending_imports(&mut pending_imports);
self.persist_snapshot(&pending_imports);
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requeue_pending_import logs when the queue is at/above MAX_PENDING_IMPORTS, but it still pushes the entry anyway. This defeats the queue bound and can grow the in-memory and persisted aux payload beyond the intended cap under repeated retries (the warning message also says "queue is full" but behavior still enqueues). Consider returning early (dropping or deferring persistence) when full, or evicting/compacting before pushing so pending_imports.len() never exceeds MAX_PENDING_IMPORTS.

Copilot uses AI. Check for mistakes.
Comment on lines +530 to +537
let mut replayed = 0usize;
while let Some((pending_import, replay_context)) = self
.pending_full_import_queue
.dequeue_ready_for_replay(&self.notary_client)
.await
{
replayed = replayed.saturating_add(1);
let mut replay_retry_block =
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replayed is incremented in replay_pending_full_imports but never used. If the project treats warnings as errors (e.g., via clippy/lint settings), this will fail the build. Either remove it or use it (e.g., for a debug/info log or metrics) to justify keeping the counter.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +128
sp_blockchain::Info {
finalized_hash: fin.1,
finalized_number: fin.0,
finalized_state: None,
best_hash: best.1,
best_number: best.0,
block_gap: None,
genesis_hash: best.1,
number_leaves: 0,
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the MemChain test client, info().genesis_hash is set to best.1, which changes as best changes. genesis_hash is expected to be stable and refer to the chain's genesis block; using the current best hash can break any logic that relies on client.info().genesis_hash (e.g., compute puzzle key block derivation). Store the genesis hash once in MemChain and return that here instead.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants