-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349
Conversation
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
This fix might change some behaviors, I will try to fix the root cause |
|
Thank you for fixing these issues. Regarding the last broker shutdown issue, Since there is no broker available to transfer ownerships, we could just simply shutdown the last broker without waiting too long(after trying to clean the ownerships) -- after the load balancer is shutdown, no new assignment will happen during shutdown too. Also, even if there are some orphan ownerships in the channel, when the fist broker(leader) starts, it will fix any orphan ones immediately. Regarding the skip message issue, I think the current skip logic can return lookups too soon, and I dont see a good reason to keep it. For example, when there are concurrent Assign events, it could return deferred lookups too soon by the skip msg logic, before Own event. I think it can just wait for the final Own event. Ideally, the channel logic shouldn't rely on skipped messages for its state changes. |
...src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Show resolved
Hide resolved
.../main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateData.java
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
|
I reran the |
pulsar-client/src/main/java/org/apache/pulsar/client/impl/TableViewImpl.java
Show resolved
Hide resolved
|
It seems there are some failed tests in |
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349)

Motivation
I observed an issue that broker was stuck at close for a long time. It's stuck at
BrokerService#unloadNamespaceBundlesGracefully, which callsdisableBrokeronce andunloadNamespaceBundleAsyncfor all owned namespace bundles synchronously. Most issues happen when the broker is the last broker.Issue 1: Free events won't be sent in
overrideOwnershipIn
overrideOwnership, if no broker is available, aFreeevent will be created.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Line 1363 in 4ce0c75
However, since the
dstBrokerandsourceBrokerfields are null in theFreeevent, an exception will be thrown so that theFreeevent won't be created and sent.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateData.java
Line 37 in 4ce0c75
Issue 2: Free events could be skipped due to the same version id
The
Freeevent is created inoverrideOwnershipbased on a previous event on the same bundle from the table view. However, there might be inflight events that are not in the table view yet. InServiceUnitStateDataConflictResolver#shouldKeepLeft, if the version id is the same, theFreeevent will be skipped. Then, if the last event is theOwnedevent whose target broker is the current broker in close,waitForCleanupswill wait until the timeout exceeds.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Lines 1394 to 1396 in 4ce0c75
Issue 3:
__change_eventstopic preventswaitForCleanupsfrom finishingThe
__change_eventstopic's reader, which is managed by the system topic based topic policies service, will try acquire the ownership. So that inwaitForCleanups, there will always be aOwnedevent for this topic's bundle. If the target broker is the broker itself,waitForCleanupswill never have a chance to exit until the timeout exceeds.Issue 4: unloadNamespaceBundleAsync will be stuck at getOwnershipAsync if there is no available broker
Broker unregisters itself in
disableBroker, if it's the last broker, then no brokers will be available after that. However,unloadNamespaceBundleAsyncneeds to publish aUnloadedevent inunloadAsync.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java
Lines 700 to 701 in 4ce0c75
Modifications
ServiceUnitStateDatato all nulldstBrokerandsourceBrokerforFreeevents.TableView#refreshAsyncto refresh the entry set inServiceUnitStateTableViewImpl#flushand call it beforeoverrideOwnership.PulsarService#closeAsync.unregisterIn addition, in
disableBroker, cancel the load data report tasks and shutdown theLoadDataStoreobjects to avoid being affected by the producers and readers on these two non-persistent topics.Since
LoadDataStore#getis still used inLeastResourceUsageWithWeight#select, don't throw an exception and return an empty inget. And handle the case thatselectmight throw an exception inExtensibleLoadManagerImpl#selectAsync.To handle the specific case when the broker is the last broker, close the broker in advance if there is no available broker in the metadata store. Then any namespace bundle's unload will also be skipped because the state was restored to INIT.
Add
testLookupto cover the changes above.Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: