Skip to content

Async handling of Onion Messages with BackpressureQueue#10219

Merged
ziggie1984 merged 3 commits intolightningnetwork:masterfrom
gijswijs:onion-messaging-2
Mar 30, 2026
Merged

Async handling of Onion Messages with BackpressureQueue#10219
ziggie1984 merged 3 commits intolightningnetwork:masterfrom
gijswijs:onion-messaging-2

Conversation

@gijswijs
Copy link
Copy Markdown
Collaborator

@gijswijs gijswijs commented Sep 15, 2025

This pull request refactors the onion message handling to be asynchronous, addressing potential performance bottlenecks in the synchronous design from #10089.

Before this PR: ChannelMailbox (bounded Go channel)

Each onion peer actor was created with the default ChannelMailbox — a simple buffered Go channel. Looking at actor.go:82-86, when no MailboxSize was configured, it defaulted to 1:

mailboxCapacity := cfg.MailboxSize
if mailboxCapacity <= 0 {
    mailboxCapacity = 1
}

When a sender called Tell or Ask, the ChannelMailbox.Send() (mailbox.go:70-89) would do a select on:

  1. m.ch <- env — try to enqueue
  2. <-ctx.Done() — caller's context cancelled
  3. <-m.actorCtx.Done() — actor shutting down

This is a blokcing risk. With a buffer of only 1, a second inbound onion message from the same peer would block the caller until the actor finished processing the first message. The caller here is peer/brontide.go — the peer's message reading goroutine. While blocked waiting to enqueue into the actor's mailbox:

  • The peer's entire read loop would stall — no other wire messages (HTLCs, channel updates, pings, etc.) from that peer could be processed.
  • If the onion message processing was slow (e.g., processOnionMessage involves sphinx decryption, node ID resolution, forwarding to another peer), the tiny buffer would fill immediately.

There was no message shedding — every message either got queued (eventually) or blocked. A malicious or chatty peer could effectively slow down their entire connection by flooding onion messages.

After this PR: BackpressureMailbox with RED

The PR replaces this with:

  1. Larger buffer: DefaultOnionMailboxSize = 50 instead of 1
  2. Random Early Detection (RED): Starting at depth 40 (DefaultMinREDThreshold), messages are probabilistically dropped. At depth 50 (full), all messages are dropped.
  3. Per-peer customization: The OnionActorOpts callback allows different peers to get different drop policies in the future.

This means the peer's read loop will almost never block on onion message delivery — messages are either quickly enqueued (plenty of buffer) or dropped under load rather than causing backpressure into the connection handler.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @gijswijs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Lightning Network node's capability to handle onion messages by introducing an asynchronous processing model. By integrating a BackpressureQueue with Random Early Drop, the system can now gracefully manage high volumes of onion messages, preventing performance degradation and ensuring the stability of peer communication. This makes the node a more robust and efficient router for onion message traffic, laying the groundwork for future privacy-preserving communication features.

Highlights

  • Asynchronous Onion Message Handling: The core onion message processing has been refactored to operate asynchronously, utilizing a new BackpressureQueue to prevent synchronous processing from blocking critical peer communication. This ensures that incoming onion messages are enqueued and processed in a background goroutine, improving node stability and throughput.
  • Introduction of BackpressureQueue: A new generic, bounded queue named BackpressureQueue has been implemented. This queue incorporates a 'Random Early Drop' (RED) predicate, which intelligently discards incoming messages as the queue approaches its capacity. This mechanism provides backpressure to the network under heavy load, preventing resource exhaustion without completely blocking new messages.
  • Onion Message Protocol Support: New lnwire.OnionMessage types, corresponding feature bits (OnionMessagesOptional, OnionMessagesRequired), and RPC endpoints (SendOnionMessage, SubscribeOnionMessages) have been added. This enables nodes to advertise their capability to handle onion messages and provides an interface for sending and receiving them.
  • Blinded Path Enhancements for Onion Messages: The htlcswitch/hop package has been updated to specifically handle onion messages within blinded paths. This includes adding a NextNodeID field to ForwardingInfo for routing messages that don't rely on channel IDs, and adapting payload parsing and validation logic to differentiate between regular payments and onion messages.
  • Integration Tests for Onion Messages: New integration tests have been introduced to validate the functionality of onion messages, covering both basic sending/receiving between two nodes and multi-hop forwarding through a network of three nodes. These tests ensure the robustness and correctness of the new asynchronous handling and routing logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring to handle onion messages asynchronously. The introduction of a generic BackpressureQueue with Random Early Drop (RED) is a great approach to improve node stability and throughput under heavy message load. The code is well-structured, and the inclusion of new integration tests is commendable. My review focuses on a couple of minor areas for code cleanup and simplification to enhance maintainability.

@saubyk saubyk added this to v0.21 Jan 6, 2026
@saubyk saubyk moved this to In review in v0.21 Jan 6, 2026
@gijswijs gijswijs force-pushed the onion-messaging-2 branch from 27d4f31 to 2768035 Compare March 2, 2026 21:33
@lightninglabs-deploy lightninglabs-deploy added the severity-critical Requires expert review - security/consensus critical label Mar 2, 2026
@gijswijs
Copy link
Copy Markdown
Collaborator Author

gijswijs commented Mar 2, 2026

Pushed a new version of this PR. It's based on the latest version of #10089. The BackpressureMailbox now uses queue.BackpressureQueue[envelope[M,R]] as its core buffer. It implements all Mailbox interface methods directly, using BackpressureQueue.Enqueue/TryEnqueue for sends, ReceiveChan() for the receive iterator, and Close() for lifecycle. The Send method merges the caller context with the actor context via context.AfterFunc so the blocking Enqueue respects both cancellation signals.

Would love to hear your thoughts @Roasbeef

@gijswijs gijswijs requested a review from Roasbeef March 2, 2026 21:37
@gijswijs gijswijs force-pushed the onion-messaging-2 branch from 2768035 to 349bd0c Compare March 3, 2026 14:24
@lightninglabs-deploy lightninglabs-deploy added severity-critical Requires expert review - security/consensus critical and removed severity-critical Requires expert review - security/consensus critical labels Mar 3, 2026
@saubyk saubyk requested review from bitromortac and ziggie1984 and removed request for ziggie1984 March 3, 2026 16:20
@gijswijs gijswijs force-pushed the onion-messaging-2 branch from 349bd0c to f9f0e14 Compare March 9, 2026 11:44
@gijswijs gijswijs force-pushed the onion-messaging-2 branch from f9f0e14 to 2337f42 Compare March 9, 2026 12:51
@gijswijs
Copy link
Copy Markdown
Collaborator Author

gijswijs commented Mar 9, 2026

Rebased the PR after the merge of #10089. Also updated the description to better explain what the BackpressureQueue is doing in the context of the onion message actor.

Copy link
Copy Markdown
Contributor

@Abdulkbk Abdulkbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first passand left some comments. Will test using Docker.

Copy link
Copy Markdown
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool work 💯. Looks already quite good to me, which is why I also included some nits

@gijswijs gijswijs force-pushed the onion-messaging-2 branch from 1ec24b0 to b296d74 Compare March 12, 2026 19:36
@gijswijs gijswijs requested a review from Abdulkbk March 12, 2026 19:40
Copy link
Copy Markdown
Collaborator Author

@gijswijs gijswijs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed all comments by @Abdulkbk and @bitromortac. Quite a few were in relation to first commit (6b806f2) which was actually cherry picked from #9838, So we should carry those changes over to that PR, and rebase this one before merging. Or we might forego on that PR alltogether (since it's a single commit PR) and merge this one instead.

@gijswijs gijswijs requested a review from bitromortac March 13, 2026 09:52
Copy link
Copy Markdown
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, it looks very close, but have some remarks.

I think we should also be more defensive in the Brontide's readHandler. We use Tell there which in principle could block (if there's a bug), but we pass in a ctx.TODO. So probably better to set a context with a timeout to prevent the handler from accidental blocking?

Copy link
Copy Markdown
Contributor

@Abdulkbk Abdulkbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending my comment and bitromortac's.

@lightninglabs-deploy
Copy link
Copy Markdown
Collaborator

@Roasbeef: review reminder
@gijswijs, remember to re-request review from reviewers when ready

Copy link
Copy Markdown
Collaborator

@erickcestari erickcestari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I ran some load tests comparing the current master branch with this PR, and it significantly improves how the lnd node handles bursts of onion messages.

Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🫜

@gijswijs gijswijs force-pushed the onion-messaging-2 branch from b296d74 to ca957e5 Compare March 27, 2026 15:39
Copy link
Copy Markdown
Collaborator

@erickcestari erickcestari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🧅

Copy link
Copy Markdown
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

Add a generic BackpressureQueue that uses a DropPredicate to proactively
shed load before the queue is completely full.

Two predicate types are provided:
- DropCheckFunc: length-only drop decision (func(queueLen int) bool)
- DropPredicate[T]: item-aware drop decision

RandomEarlyDrop returns a DropCheckFunc since RED only considers queue
depth. The AsDropPredicate helper adapts it to DropPredicate[T] for use
with BackpressureQueue.

In addition to the blocking Enqueue/Dequeue methods, the queue exposes
TryEnqueue (non-blocking send with drop check), Len, ReceiveChan, and
Close. These are needed by the actor package's BackpressureMailbox which
uses BackpressureQueue as its core buffer while implementing the Mailbox
interface's select-based iteration and lifecycle methods.

Property-based tests using pgregory.net/rapid verify queue invariants
(capacity bounds, FIFO ordering, model consistency) across randomized
enqueue/dequeue sequences with RED enabled.
Add BackpressureMailbox, a Mailbox implementation backed by
queue.BackpressureQueue that consults a queue.DropCheckFunc on every
Send/TrySend to enable RED-style load shedding before the mailbox is
full.

Add MailboxFactory type and ActorOption functional options
(WithMailboxFactory, WithMailboxSize) so callers can inject custom
mailbox implementations when spawning actors via RegisterWithSystem
or ServiceKey.Spawn.
This commit adds per-peer backpressure control to the onion message
actor system by introducing a BackpressureMailbox that uses Random
Early Detection (RED) to probabilistically drop messages when the
per-peer queue depth exceeds a configurable threshold.

The OnionActorFactory type now accepts variadic ActorOptions, allowing
the spawn call site (brontide) to provide per-peer mailbox configuration.
A DefaultOnionActorOpts helper provides the standard RED thresholds so
callers don't need to wire up the BackpressureMailbox manually.

Key changes:
- OnionActorFactory signature extended with ...ActorOption[*Request,
  *Response] so backpressure policy is no longer baked into the factory.
- NewOnionActorFactory drops its shouldDrop parameter; it forwards opts
  through to serviceKey.Spawn.
- DefaultOnionActorOpts(shouldDrop) returns the default backpressure
  options (BackpressureMailbox + DefaultOnionMailboxSize).
- peer.Config gains OnionActorOpts callback for per-peer customization.
- server.go creates default opts once and returns them for every peer.
@gijswijs gijswijs force-pushed the onion-messaging-2 branch from ca957e5 to e87f4bf Compare March 28, 2026 11:43
@ziggie1984 ziggie1984 merged commit d91e19c into lightningnetwork:master Mar 30, 2026
38 of 44 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in v0.21 Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog onion routing severity-critical Requires expert review - security/consensus critical

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants