-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fix rack awareness cache expiration data race #16825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rack awareness cache expiration data race #16825
Conversation
|
@merlimat - one of the core assumptions for this PR is that zk notifications won't get missed. Initially, I thought that was a valid assumption, but now I am thinking that might not be true in the event that a zk connection is dropped and an update to the |
|
Given the documentation here, https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html, I think we're likely fine, though I'm not sure how a crashed zk will affect watches and changes made while disconnected. Here is the relevant section from the above docs:
|
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good catch @michaeljmarshall
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(cherry picked from commit e451806)
(cherry picked from commit e451806)
|
@BewareMyPower - yes, I will take care of cherry picking this PR. |
|
Hi @michaeljmarshall |
|
Hi @michaeljmarshall |
|
@michaeljmarshall hi, I move this PR to release/2.9.5, if you have any questions, please ping me. thanks. |
|
This issue is introduced by #12841, which was only released in branch-2.10+. Does this Pr only need to be cherry-picked to branch-2.10? @michaeljmarshall |
|
@hangc0276 - I haven't verified the other releases, but I'm not able to cherry pick it right now. We can probably just drop the older release lines since the conditions that lead to this bug are unlikely in any production system. |
Motivation
The
BookieRackAffinityMappingclass relies on a metadata cache that expires entries after 10 minutes. When an entry expires, the next call toBookieRackAffinityMapping#getRackreturns an incomplete future (because the entry expired) and theTopologyAwareEnsemblePlacementPolicy(bookkeeper class) stores the bookie's network location asdefault-rack.It is trivial to reproduce the issue. Start a Pulsar cluster, define a rack topology, wait for at least 10 minutes, kill one of the bookies that is not in the default-rack, and observe the broker logs as the bookie comes back. The broker will log that the bookie is a member of the default-rack. When
bookkeeperClientEnforceMinNumRacksPerWriteQuorumis enabled in the broker, this bug becomes a blocking issue where the only way to resolve the bad state is to restart the broker (or to restart the bookie assuming the broker still has the right mapping in the cache).This PR changes the design of the
BookieRackAffinityMappingby removing cache expiration. When the broker starts up, it will discover the mapping from zookeeper and store that mapping until the broker observes an update from a ZK watch.Modifications
BookieRackAffinityMapping, instead of relying on a metadata cache, which is defined to have an entry expiration.synchronizedkeyword to all relevant methods that modify mutable state from multiple threads. Based on my reading of the code, there is not a risk for deadlock with this change. Making these methods synchronized also prevents certain races that could negatively affect bookie network location resolution. The only potential problem is that this synchronization could block a zk callback thread briefly. Because the operations in these methods do not contain any blocking io (other than on initialization), I view blocking a zk thread as unlikely.volatilekeyword for two maps that are now only updated withinsynchronizedblocks.registerListenercall to before getting the value from zookeeper. This ensures that an update is not missed in the very short time between getting the value and registering the listener. Because the method is synchronized, the event will properly be observed after the original initialization.rackawarePolicynull check to later in the sequence to make tests pass. Note that we always use arackawarePolicy, so this is a trivial change.Verifying this change
This change is covered by existing tests. Note that the original bug is challenging to reproduce in a unit test because the bug relies on cache expiration, which is hard coded at 10 minutes in the
MetadataCacheImpl. By removing any chance for cache expiration, we remove the possibility for this bug.Additional Context
Here are sample logs from a reproduction of the issue:
Alternative Solution
An alternative solution is to add a callback to the metadata store's result when the future is not complete. The callback would trigger the logic in the
BookieRackAffinityMapping#handleUpdates. While this change would be smaller in terms of lines of code touched, I view it as suboptimal because it necessarily leads to misclassification of bookies as members of thedefault-rack, which is both confusing to users and could lead to temporary errors.Does this pull request potentially affect one of the following parts:
This PR does not introduce any breaking changes. It might not easily get cherry picked to older release branches.
Documentation
doc-not-neededDocs are not needed because this is just an internal bug fix.