Skip to content

fix(net): fix RejectedExecutionException during shutdown trxHandlePool#6692

Open
0xbigapple wants to merge 1 commit intotronprotocol:developfrom
0xbigapple:fix/threadpool-submit-issue
Open

fix(net): fix RejectedExecutionException during shutdown trxHandlePool#6692
0xbigapple wants to merge 1 commit intotronprotocol:developfrom
0xbigapple:fix/threadpool-submit-issue

Conversation

@0xbigapple
Copy link
Copy Markdown
Collaborator

What does this PR do?

Fix RejectedExecutionException during node shutdown in TransactionsMsgHandler.

  • Correct shutdown order (core): stop producer smartContractExecutor first, then consumer trxHandlePool, guaranteeing trxHandlePool is alive while the scheduler runs.
  • isClosed flag: early exit in processMessage() at entry and during iteration. Covers in-flight messages in the narrow window between PeerManager.close() and handler close.
  • RejectedExecutionException catch: handles TOCTOU race between isClosed check and submit().
  • Queue cleanup: clear smartContractQueue and queue after both pools terminate.

Why are these changes required?

The original close() shut down the consumer pool (trxHandlePool) before the producer scheduler (smartContractExecutor). During this window, handleSmartContract() was still draining smartContractQueue and calling submit(trxHandlePool, ...) on an already-terminated pool, throwing RejectedExecutionException and polluting shutdown logs.

Thread model

P2P threads ── processMessage() ──┬── submit(trxHandlePool)        [normal txs]
                                  └── smartContractQueue.offer()    [smart contracts]

smartContractExecutor (single-thread, 20ms delay)
    └── handleSmartContract() → take() → submit(trxHandlePool)

close()
    1. isClosed = true
    2. shutdownAndAwaitTermination(smartContractExecutor)  ← producer first
    3. shutdownAndAwaitTermination(trxHandlePool)          ← consumer second
    4. smartContractQueue.clear(); queue.clear()

This PR has been tested by:

  • Unit Tests
  • Manual Testing: Restart the node repeatedly, verified via log grep:
    • No RejectedExecutionException
    • Correct shutdown order (contract-msg-handler shutdown done before trx-msg-handler shutdown done)

Follow up

N/A

Extra details

@github-actions github-actions Bot requested review from 317787106 and xxo1shine April 17, 2026 04:22
} catch (RejectedExecutionException e) {
logger.warn("Submit task to {} failed", trxEsName);
break;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] This exception of handleTransaction is not same as handleSmartContract(), namely the second lacks RejectedExecutionException . They should have the same Exceptions.

Copy link
Copy Markdown
Collaborator Author

@0xbigapple 0xbigapple Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. The two submit() sites are asymmetric on purpose:

  • processMessage() is invoked by the p2p dispatch path. During close(), an already in-flight or concurrently delivered message can still enter processMessage() before peer removal is fully observed, while the pool is already shutting down. That's the real REE window, so this path needs the catch.
  • handleSmartContract() runs inside smartContractExecutor. In close() we deliberately shut the scheduler down before trxHandlePool, so by the time the pool is closed the scheduler has already terminated — no race window. The existing catch (Exception e) already absorbs anything unexpected.

Adding an REE catch here would document a case that cannot happen given the shutdown ordering, and invite future readers to ask "when can this fire?". So, I'd like to keep handleSmartContract() as-is.

Comment on lines +69 to +70
smartContractQueue.clear();
queue.clear();
Copy link
Copy Markdown
Collaborator

@317787106 317787106 Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] These two lines clear() may be unnecessary. After shutdownAndAwaitTermination(trxHandlePool) returns, all tasks have completed, but the queue is the backing queue of the thread pool.

These two comments may be also unnecessary, it's so simple & clear.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — one nuance though: smartContractQueue is not a backing queue. It's an independent LinkedBlockingQueue<TrxEvent>processMessage offers into it, and smartContractExecutor takes from it only when queue.size() < MAX_SMART_CONTRACT_SUBMIT_SIZE=100. So under backpressure (when trxHandlePool's queue is saturated), the scheduler stops draining smartContractQueue and it can hold up to MAX_TRX_SIZE=50_000 pending TrxEvents at shutdown. Explicit cleanup of state that may contain data matches the convention elsewhere in java-tron, so I'd like to keep smartContractQueue.clear().

queue is the backing queue as you said — queue.clear() is technically redundant after shutdownAndAwaitTermination. I added it purely for symmetry with smartContractQueue.clear(). If you think the symmetry isn't worth the redundancy, happy to drop queue.clear() alone.

The latter two (// Then shutdown the worker pool ... and // Discard any remaining items ...) will remove.

int dropSmartContractCount = 0;
for (Transaction trx : transactionsMessage.getTransactions().getTransactionsList()) {
if (isClosed) {
logger.warn("TransactionsMsgHandler is closed during processing, stop submit");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] Some transactions are processed, but some are dropped. Do we need to log the unprocessed tx num?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it's fine to leave as-is — the two break sites only fire during shutdown, so a numeric drop count there can't really drive any operational action. The existing simple warn lines are enough as "this path was exercised" signals.

The existing dropSmartContractCount at L100-104 is a different case — that's runtime backpressure (queue saturated), which is an actionable ops signal, so it's already counted.

@Override
public void processMessage(PeerConnection peer, TronMessage msg) throws P2pException {
if (isClosed) {
logger.warn("TransactionsMsgHandler is closed, drop message");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QUESTION] Both isClosed guards log at WARN level. During a normal node shutdown this will fire once per in-flight message, potentially producing a large burst of WARN entries in the log. Is there a reason to prefer WARN over DEBUG or INFO here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — I wasn't confident about the right level here. I checked the rest of the codebase and java-tron has two conventions that both have a claim on these lines:

  • "Drop" runtime events → WARN (AdvService.java:200, P2pEventHandlerImpl.java:158, InventoryMsgHandler.java:54/60/66, and the other two Drop lines in this same file at L117/L155).
  • Lifecycle close/shutdown normal path → INFO (TronNetService.java:126 "Net service closed successfully", HistoryEventService.java:59, ConsensusService.java:89/91, ExecutorServiceManager.java:67/84, AbstractService.java:36/41, BackupServer.java:107).

L80/L94 sit on the boundary. The existing WARN followed the local "Drop" convention in this file. But your actionability argument is fair — during shutdown ops can't act on these, whereas the other Drop WARNs are actual runtime anomalies worth investigating.

Happy to flip both to INFO to match Convention B (closing-window expected event) if you agree, or keep WARN to stay consistent with the other Drop lines in this file. DEBUG feels too quiet since a one-time record of "messages were dropped during shutdown" is still useful for post-mortem. What do you think?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — I considered that change L80, L94, and L109 to INFO is a better choice. During normal node shutdown these are expected events, so WARN is too noisy and risks drowning out real runtime issues. The other Drop WARNs in this file (runtime backpressure) should stay as-is.


@Override
public void processMessage(PeerConnection peer, TronMessage msg) throws P2pException {
if (isClosed) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD]Based on your problem description, the exception should occur in the handleSmartContract() function. You don't need to modify the processMessage function much; this check can be removed.

Copy link
Copy Markdown
Collaborator Author

@0xbigapple 0xbigapple Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — the PR description's "Why" section only covers the handleSmartContract() REE and doesn't spell out the second race, that's my oversight. The other scenario is the processMessage() side race, which I described in the reply to @317787106 #6692 (comment), during shutdown, an already in-flight or concurrently delivered TransactionsMessage can still enter processMessage() before peer removal is fully observed, while trxHandlePool is already shutting down. That's the real REE window that with the isClosed checks as defense-in-depth.

int trxHandlePoolQueueSize = 0;
int dropSmartContractCount = 0;
for (Transaction trx : transactionsMessage.getTransactions().getTransactionsList()) {
if (isClosed) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD]Duplicate checks can be deleted.

@halibobo1205 halibobo1205 added the topic:net p2p net work, synchronization label Apr 18, 2026
@halibobo1205 halibobo1205 added this to the GreatVoyage-v4.8.2 milestone Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:net p2p net work, synchronization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants