KAFKA-9796; Broker shutdown could be stuck forever under certain conditions#8448
KAFKA-9796; Broker shutdown could be stuck forever under certain conditions#8448rajinisivaram merged 6 commits intoapache:trunkfrom
Conversation
I would have thought we want to stop accepting before we stop processing. |
Indeed, it is a bit counter intuitive. The reason behind is that the Acceptor can be blocked by the Processor and thus can't be shutdown when it happens. We could probably keep a more intuitive ordering by decoupling the shutdown and the awaiting of the shutdown. Let me check this. |
|
@rajinisivaram The PR is ready to be reviewed. |
|
ok to test |
rajinisivaram
left a comment
There was a problem hiding this comment.
@dajac Thanks for the PR, looks good. Left some minor comments.
| * is used to delay processing client connections until server is fully initialized, e.g. | ||
| * to ensure that all credentials have been loaded before authentications are performed. | ||
| * Acceptors are always started during `startup` so that the bound port is known when this | ||
| * method completes even when ephemeral ports are used. Incoming connections on this server |
There was a problem hiding this comment.
These two lines are still true, but removed from the comment?
There was a problem hiding this comment.
Partially. The acceptors are not started but start to listen. Let me rework the comment to include the part about the bound port though.
| * is used to delay processing client connections until server is fully initialized, e.g. | ||
| * to ensure that all credentials have been loaded before authentications are performed. | ||
| * Acceptors are always started during `startup` so that the bound port is known when this | ||
| * method completes even when ephemeral ports are used. Incoming connections on this server |
There was a problem hiding this comment.
These two lines are still true, but removed from the comment?
| * This method is used for delayed starting of data-plane processors if [[kafka.network.SocketServer#startup]] | ||
| * was invoked with `startupProcessors=false`. | ||
| * Start processing requests and new connections. This method is used for delayed starting of | ||
| * data-plane processors if [[kafka.network.SocketServer#startup]] was invoked with |
There was a problem hiding this comment.
this is not just data-plane processors?
There was a problem hiding this comment.
Correct. Let me rework the comment.
| * listener before other listeners. This allows authorization metadata for other listeners to be | ||
| * stored in Kafka topics in this cluster. | ||
| * | ||
| * @param authorizerFutures |
| */ | ||
| private def closeAll(): Unit = { | ||
| // Clear to unblock blocked acceptors | ||
| newConnections.asScala.foreach(_.close()) |
There was a problem hiding this comment.
The blocked acceptor would then add another connection to this list right? Do we close that one?
There was a problem hiding this comment.
No, we don't close that one. Let me rework this.
| externalReadyFuture.complete(null) | ||
| TestUtils.waitUntilTrue(() => listenerStarted(externalListener), "External listener not started") | ||
| } finally { | ||
| externalReadyFuture.complete(null) |
There was a problem hiding this comment.
Why? If it for the failure case, then perhaps it should be in a catch block?
There was a problem hiding this comment.
This is not needed. It is a left over from my debugging. Let me remove it.
| connect(testableServer, new ListenerName("EXTERNAL"), localAddr = InetAddress.getLocalHost) | ||
|
|
||
| // Wait to let the acceptor accepts the connections | ||
| Thread.sleep(100) |
There was a problem hiding this comment.
Can we replace sleep with some condition?
There was a problem hiding this comment.
I have reworked this test. It still consistently fails without this patch and it does not have the sleep any more.
|
retest this please |
|
@rajinisivaram Thanks for the review! I have addressed on your comments. Could you please have another look at it? |
rajinisivaram
left a comment
There was a problem hiding this comment.
@dajac Thanks for the updates, looks good. Left just a couple of minor comments.
| * Initiates a graceful shutdown by signaling to stop and waiting for the shutdown to complete | ||
| * Initiates a graceful shutdown by signaling to stop | ||
| */ | ||
| def shutdown(): Unit = { |
There was a problem hiding this comment.
Should we rename this method to be initiateShutdown() to be consistent with kafka.utils.ShutdownableThread?
| while (!newConnections.isEmpty) { | ||
| newConnections.poll().close() | ||
| } | ||
| newConnections.clear() |
There was a problem hiding this comment.
clear() is unnecessary since we would expect the loop to clear (i.e. we shouldn't have code that clears without closing).
|
retest this please |
|
@rajinisivaram Thanks. I have addressed your comments. |
rajinisivaram
left a comment
There was a problem hiding this comment.
@dajac Thanks for the updates, LGTM.
|
ok to test |
|
retest this please |
|
Test failure not related, merging to trunk. |
* 'trunk' of github.com:apache/kafka: (28 commits) MINOR: cleanup RocksDBStore tests (apache#8510) KAFKA-9818: Fix flaky test in RecordCollectorTest (apache#8507) MINOR: reduce impact of trace logging in replica hot path (apache#8468) KAFKA-6145: KIP-441: Add test scenarios to ensure rebalance convergence (apache#8475) KAFKA-9881: Convert integration test to verify measurements from RocksDB to unit test (apache#8501) MINOR: improve test coverage for dynamic LogConfig(s) (apache#7616) MINOR: Switch order of sections on tumbling and hopping windows in streams doc. Tumbling windows are defined as "special case of hopping time windows" - but hopping windows currently only explained later in the docs. (apache#8505) KAFKA-9819: Fix flaky test in StoreChangelogReaderTest (apache#8488) HOTFIX: fix active task process ratio metric recording KAFKA-9796; Ensure broker shutdown is not stuck when Acceptor is waiting on connection queue (apache#8448) MINOR: Use streaming iterator with decompression buffer when building offset map (apache#8494) Add log message in release.py (apache#8461) KAFKA-9854 Re-authenticating causes mismatched parse of response (apache#8471) KAFKA-9838; Add log concurrency test and fix minor race condition (apache#8476) KAFKA-9703; Free up compression buffer after splitting a large batch KAFKA-9779: Add Stream system test for 2.5 release (apache#8378) KAFKA-7885: TopologyDescription violates equals-hashCode contract. (apache#6210) MINOR: KafkaApis#handleOffsetDeleteRequest does not group result correctly (apache#8485) HOTFIX: don't close or wipe out someone else's state (apache#8478) MINOR: add process(Test)Messages to the README (apache#8480) ...
This patch reworks the SocketServer to always start the acceptor threads after the processor threads and to always stop the acceptor threads before the processor threads. It ensures that the acceptor shutdown is not blocked waiting on the processors to be fully shutdown by decoupling the shutdown signal and the awaiting. It also ensure that the processor threads drain its newConnection queue to unblock acceptors that may be waiting. However, the acceptors still bind during the startup, only the processing of new connections and requests is further delayed.
The flow looks like this now:
Committer Checklist (excluded from commit message)