Skip to content

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210

Open
sieniven wants to merge 3 commits intomainfrom
niven/flashblocks-no-reorg
Open

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210
sieniven wants to merge 3 commits intomainfrom
niven/flashblocks-no-reorg

Conversation

@sieniven
Copy link
Contributor

Summary

Fixes the websocket publishing logic on the flashblocks builder into the p2p node service instead. This ensures zero re-org on subscribed flashblocks RPC nodes by ensuring p2p gossip to follower sequencers is a blocking successful operation first before gossiping the flashblocks to RPC nodes.

This allows for atomicity on flashblocks replay during sequencer switches / failures of the leader sequencer, to replay consistent gossiped flashblocks across RPC nodes. This ensures flashblocks re-org is guaranteed protected on RPC nodes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sieniven
Copy link
Contributor Author

@claude review this PR

@github-actions
Copy link
Contributor

github-actions bot commented Mar 22, 2026

Claude finished @sieniven's task in 3m 31s —— View job


PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context and understand the PR
  • Read changed files
  • Analyze logic correctness and atomicity guarantees
  • Check for bugs, edge cases, and error handling
  • Provide review feedback

Summary

Good architectural direction — removing ws_pub from the builder and consolidating publish responsibility into the p2p Node simplifies the builder and keeps WS publishing coordinated with p2p gossip. The Message trait → concrete Message enum refactor also reduces unnecessary generics. However, there are several issues worth addressing:


🔴 Critical: WS publishes before p2p gossip in Node::run() — opposite of the stated guarantee

crates/builder/src/p2p/mod.rs, lines ~157–162:

if let Message::OpFlashblockPayload(ref fb_payload) = message {
    let flashblock_byte_size = ws_pub.publish(fb_payload)?;   // ← WS first
    metrics.flashblock_byte_size_histogram.record(flashblock_byte_size as f64);
}
if let Err(e) = outgoing_streams_handler.broadcast_message(message).await {  // ← p2p second
    warn!(target: "flashblocks-p2p", "...");
}

The PR description states: "p2p gossip to follower sequencers is a blocking successful operation first before gossiping the flashblocks to RPC nodes."

But the code publishes to WebSocket subscribers before broadcast_message() to peers. This means the leader's own WS subscribers (RPC nodes) receive the flashblock before follower sequencers do, which is the exact re-org risk the PR claims to fix. The broadcast_message() call should come first, and ws_pub.publish() only after it returns (or at least returns without error).

Fix this →


🟡 Medium: ws_pub.publish()? kills the p2p node if broadcast channel has no receivers

crates/builder/src/p2p/mod.rs:

let flashblock_byte_size = ws_pub.publish(fb_payload)?;

WebSocketPublisher::publish() calls self.pipe.send(utf8_bytes) (a broadcast::Sender), which returns Err when there are no active receivers. The initial receiver created in WebSocketPublisher::new() is held by the listener_loop task, so in normal operation this is unlikely to fire. However, if the listener_loop task ever panics or exits unexpectedly, subsequent calls to ws_pub.publish() will kill the p2p node entirely via ?.

The previous code in builder.rs handled this with map_err(PayloadBuilderError::other)?, which only terminated the current build job (restartable), not the p2p node service. The handler's follower path (in handler.rs) correctly uses if let Err(e) = ws_pub.publish(...) { warn!(...) }. The Node path should follow the same pattern.

Fix this →


🟡 Medium: flashblock_byte_size_histogram metric silently drops when p2p is disabled

crates/builder/src/flashblocks/builder.rs (removed) vs crates/builder/src/p2p/mod.rs (added):

Previously, ctx.metrics.flashblock_byte_size_histogram.record(...) was called directly in the builder — always, regardless of p2p state. Now it's only recorded inside Node::run(), which only runs when p2p_enabled = true. Deployments with p2p_enabled = false will silently lose this metric after this change.


🟢 Good: behavior fix for no_tx_pool path

crates/builder/src/flashblocks/builder.rs:

Previously, built_fb_payload_tx.try_send(fb_payload) for the initial flashblock was guarded only by !rebuild_external_payload, meaning syncing FCUs (where no_tx_pool=true) would still trigger p2p gossip. The new code correctly gates it with !no_tx_pool && !rebuild_external_payload, consistent with the comment "not emitting flashblock if no_tx_pool in FCU, it's just syncing". This is a correct fix.


🟢 Good: Message trait → enum refactor

Removing the Message generic trait in favor of a concrete Message enum in types.rs is the right call for an internal protocol — reduces boilerplate, removes the generic parameter from Node<M>, and makes deserialization straightforward (serde_json::from_str::<Message>). The types.rs module is well-organized.


Minor: Default removed for NodeBuilder

Since NodeBuilder::new() now requires ws_pub and metrics, removing Default is correct. Worth verifying no external call sites use NodeBuilder::default() (doesn't appear to be any in the diff, but worth a quick grep if not already done).


Test coverage note

The test in p2p/mod.rs now uses OpFlashblockPayload::default() as the round-trip message and binds WebSocketPublisher to 127.0.0.1:0. Consider adding a test case that verifies the WS publish is called (or not) at the right point relative to p2p gossip, once the ordering fix is applied.


TL;DR: The main concern is the WS-before-p2p ordering in Node::run(), which inverts the core atomicity guarantee the PR is trying to establish. Suggest swapping ws_pub.publish() to run after broadcast_message(), and converting the ? to a warn! to avoid killing the p2p node on publish failure.

sieniven and others added 2 commits March 23, 2026 15:22
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… flag

Rename p2p/ module to broadcast/ to better reflect its dual role (p2p
gossip + WS publishing). Move wspub.rs into broadcast/, remove the
p2p_enabled config flag so the broadcast node always runs, fix WS
publish ordering to run after p2p gossip, and improve error handling
so ws_pub failures warn instead of crashing the node. Update stale
comments referencing the removed p2p_enabled flag.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor Author

@sieniven sieniven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude please help to review again, check that all your previous mentioned points have been resolved

@sieniven
Copy link
Contributor Author

@claude please help to review again, check that all your previous mentioned points have been resolved

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

Claude finished @sieniven's task in 2m 13s —— View job


PR Re-Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context and read changed files
  • Check: WS publish ordering vs p2p gossip (🔴 Critical)
  • Check: ws_pub.publish()? error propagation (🟡 Medium)
  • Check: flashblock_byte_size_histogram metric with p2p disabled (🟡 Medium)
  • Check: Minor notes (NodeBuilder Default, test coverage)
  • Provide updated review feedback

All three previously raised issues have been resolved. Details below.

✅ Fixed: WS publish ordering (was 🔴 Critical)

crates/builder/src/broadcast/mod.rs, lines 169–181:

if let Err(e) = outgoing_streams_handler.broadcast_message(message.clone()).await {
    warn!(...)  // p2p first
}
if let Message::OpFlashblockPayload(ref fb_payload) = message {
    match ws_pub.publish(fb_payload) { ... }  // WS second
}

broadcast_message (p2p gossip) now runs before ws_pub.publish (WS). The ordering now matches the stated atomicity guarantee. ✅

✅ Fixed: ws_pub.publish()? error propagation (was 🟡 Medium)

The ? is replaced with a match that logs a warn! on failure — consistent with the follower path in handler.rs. The p2p node will no longer terminate if the WS broadcast channel has no receivers. ✅

✅ Fixed: Metric coverage when p2p disabled (was 🟡 Medium)

The p2p_enabled flag has been removed from args/op.rs. The broadcast node is now always spawned unconditionally in service.rs, so flashblock_byte_size_histogram is always recorded via Node::run(). ✅

✅ Resolved: NodeBuilder::default() call sites

No usages of NodeBuilder::default() exist in the codebase. service.rs correctly uses NodeBuilder::new(ws_pub, metrics). ✅

Data flow looks correct end-to-end

  • Leader path: builder → built_fb_payload_tx → handler built_fb_payload_rxp2p_txNode::run() outgoing_message_rx → p2p gossip first → WS publish second
  • Follower path: Node::run() incoming stream → handler p2p_rx → cache update → WS publish

No double-publishing between the two paths (outgoing vs. incoming are separate channels). ✅

Open (non-blocking): Test coverage

The test in broadcast/mod.rs validates node connectivity and message round-tripping. A dedicated test asserting WS publish is called after p2p gossip would make the ordering guarantee explicit and regression-proof, but this is not a blocker.


Summary: All three critical/medium issues from the previous review are fully addressed. The PR achieves its stated goal — p2p gossip to follower sequencers runs as a blocking operation before WS publish to RPC subscribers. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant