Fix rack awareness cache expiration data race #16825

michaeljmarshall · 2022-07-27T16:18:42Z

Motivation

The BookieRackAffinityMapping class relies on a metadata cache that expires entries after 10 minutes. When an entry expires, the next call to BookieRackAffinityMapping#getRack returns an incomplete future (because the entry expired) and the TopologyAwareEnsemblePlacementPolicy (bookkeeper class) stores the bookie's network location as default-rack.

It is trivial to reproduce the issue. Start a Pulsar cluster, define a rack topology, wait for at least 10 minutes, kill one of the bookies that is not in the default-rack, and observe the broker logs as the bookie comes back. The broker will log that the bookie is a member of the default-rack. When bookkeeperClientEnforceMinNumRacksPerWriteQuorum is enabled in the broker, this bug becomes a blocking issue where the only way to resolve the bad state is to restart the broker (or to restart the bookie assuming the broker still has the right mapping in the cache).

This PR changes the design of the BookieRackAffinityMapping by removing cache expiration. When the broker starts up, it will discover the mapping from zookeeper and store that mapping until the broker observes an update from a ZK watch.

Modifications

Rely on an indefinitely cached rack mapping in BookieRackAffinityMapping, instead of relying on a metadata cache, which is defined to have an entry expiration.
Eagerly resolve the bookie mapping. This was removed in Bugfix: Fix rackaware placement policy init error #12097, but now that fix-npe-when-pulsar-ZkBookieRackAffinityMapping-getBookieAddressResolver bookkeeper#2788 is merged and available in the bookkeeper client, we can safely resolve the addresses early.
Add synchronized keyword to all relevant methods that modify mutable state from multiple threads. Based on my reading of the code, there is not a risk for deadlock with this change. Making these methods synchronized also prevents certain races that could negatively affect bookie network location resolution. The only potential problem is that this synchronization could block a zk callback thread briefly. Because the operations in these methods do not contain any blocking io (other than on initialization), I view blocking a zk thread as unlikely.
Remove the volatile keyword for two maps that are now only updated within synchronized blocks.
Move the registerListener call to before getting the value from zookeeper. This ensures that an update is not missed in the very short time between getting the value and registering the listener. Because the method is synchronized, the event will properly be observed after the original initialization.
Update a test to use Awaitility to account for the asynchronous nature of metadata store notifications.
Move the rackawarePolicy null check to later in the sequence to make tests pass. Note that we always use a rackawarePolicy, so this is a trivial change.

Verifying this change

This change is covered by existing tests. Note that the original bug is challenging to reproduce in a unit test because the bug relies on cache expiration, which is hard coded at 10 minutes in the MetadataCacheImpl. By removing any chance for cache expiration, we remove the possibility for this bug.

Additional Context

Here are sample logs from a reproduction of the issue:

2022-07-27T15:20:55,352+0000 [main-EventThread] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Removing a node: /az1/pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local:3181
2022-07-27T15:20:55,353+0000 [main-EventThread] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Removing a node: /az1/pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local:3181
2022-07-27T15:20:59,310+0000 [main-EventThread] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local, using default rack for it : /default-rack.
2022-07-27T15:20:59,310+0000 [main-EventThread] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local:3181
2022-07-27T15:20:59,311+0000 [main-EventThread] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local, using default rack for it : /default-rack.
2022-07-27T15:20:59,311+0000 [main-EventThread] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/pulsar-bookkeeper-3.pulsar-bookkeeper.michael-test.svc.cluster.local:3181

Alternative Solution

An alternative solution is to add a callback to the metadata store's result when the future is not complete. The callback would trigger the logic in the BookieRackAffinityMapping#handleUpdates. While this change would be smaller in terms of lines of code touched, I view it as suboptimal because it necessarily leads to misclassification of bookies as members of the default-rack, which is both confusing to users and could lead to temporary errors.

Does this pull request potentially affect one of the following parts:

This PR does not introduce any breaking changes. It might not easily get cherry picked to older release branches.

Documentation

doc-not-needed

Docs are not needed because this is just an internal bug fix.

michaeljmarshall · 2022-07-27T16:34:04Z

@merlimat - one of the core assumptions for this PR is that zk notifications won't get missed. Initially, I thought that was a valid assumption, but now I am thinking that might not be true in the event that a zk connection is dropped and an update to the /bookies zk node is performed while the broker is disconnected. Will you take a look and let me know if my design is valid? Thanks.

michaeljmarshall · 2022-07-27T16:55:19Z

Given the documentation here, https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html, I think we're likely fine, though I'm not sure how a crashed zk will affect watches and changes made while disconnected.

Here is the relevant section from the above docs:

Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

lhotari

LGTM. Good catch @michaeljmarshall

eolivelli

LGTM

(cherry picked from commit e451806)

BewareMyPower · 2022-08-01T09:23:33Z

Could you cherry-pick this PR (or open an independent PR) to branch-2.8? It relies on #14708 but when I cherry-picked #14708, there were still many conflicts.

(cherry picked from commit e451806)

michaeljmarshall · 2022-08-09T03:43:44Z

@BewareMyPower - yes, I will take care of cherry picking this PR.

mattisonchao · 2022-08-10T14:21:58Z

Hi @michaeljmarshall
Could you help cherry-pick this PR to branch-2.9? thanks a lot!!!

congbobo184 · 2022-11-10T15:07:37Z

Hi @michaeljmarshall
Could you help cherry-pick this PR to branch-2.9? thanks

congbobo184 · 2022-11-17T11:56:09Z

@michaeljmarshall hi, I move this PR to release/2.9.5, if you have any questions, please ping me. thanks.

hangc0276 · 2022-12-14T15:19:24Z

This issue is introduced by #12841, which was only released in branch-2.10+. Does this Pr only need to be cherry-picked to branch-2.10? @michaeljmarshall

michaeljmarshall · 2022-12-16T05:55:33Z

@hangc0276 - I haven't verified the other releases, but I'm not able to cherry pick it right now. We can probably just drop the older release lines since the conditions that lead to this bug are unlikely in any production system.

Fix rack awareness cache expiration race condition

13b9fb2

michaeljmarshall added type/bug The PR fixed a bug or issue reported a bug area/broker doc-not-needed Your PR changes do not impact docs release/2.8.4 release/2.10.2 release/2.9.4 labels Jul 27, 2022

michaeljmarshall added this to the 2.11.0 milestone Jul 27, 2022

michaeljmarshall requested review from addisonj, aloyszhang, dlg99, eolivelli, gaozhangmin and merlimat July 27, 2022 16:18

michaeljmarshall self-assigned this Jul 27, 2022

michaeljmarshall changed the title ~~Fix rack awareness cache expiration race condition~~ Fix rack awareness cache expiration data race Jul 27, 2022

michaeljmarshall requested a review from lhotari July 27, 2022 16:34

lhotari approved these changes Jul 28, 2022

View reviewed changes

eolivelli approved these changes Jul 29, 2022

View reviewed changes

eolivelli merged commit e451806 into apache:master Jul 29, 2022

eolivelli pushed a commit to datastax/pulsar that referenced this pull request Jul 29, 2022

Fix rack awareness cache expiration race condition (apache#16825)

cc2b899

(cherry picked from commit e451806)

michaeljmarshall deleted the refactor-bookie-rack-awareness-and-fix-cache-bug branch July 29, 2022 14:35

Gleiphir2769 pushed a commit to Gleiphir2769/pulsar that referenced this pull request Aug 4, 2022

Fix rack awareness cache expiration race condition (apache#16825)

0ddb0ca

BewareMyPower added release/2.8.5 and removed release/2.8.4 labels Aug 5, 2022

codelipenghui pushed a commit that referenced this pull request Aug 8, 2022

Fix rack awareness cache expiration race condition (#16825)

9cb41fd

(cherry picked from commit e451806)

codelipenghui added the cherry-picked/branch-2.10 label Aug 8, 2022

congbobo184 added release/2.9.5 and removed release/2.9.4 labels Nov 17, 2022

michaeljmarshall removed release/2.8.5 release/2.9.5 labels Dec 16, 2022

horizonzy mentioned this pull request Dec 16, 2022

[fix][broker] Fix PulsarRegistrationClient and ZkRegistrationClient not aware rack info problem. #18672

Merged

14 tasks

TakaHiR07 mentioned this pull request Aug 7, 2023

[fix][broker] rackaware policy is ineffective when delete zk rack info after bkclient initialize #20944

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rack awareness cache expiration data race #16825

Fix rack awareness cache expiration data race #16825

Uh oh!

michaeljmarshall commented Jul 27, 2022 •

edited

Loading

Uh oh!

michaeljmarshall commented Jul 27, 2022

Uh oh!

michaeljmarshall commented Jul 27, 2022

Uh oh!

lhotari left a comment

Uh oh!

eolivelli left a comment

Uh oh!

BewareMyPower commented Aug 1, 2022

Uh oh!

michaeljmarshall commented Aug 9, 2022

Uh oh!

mattisonchao commented Aug 10, 2022

Uh oh!

congbobo184 commented Nov 10, 2022

Uh oh!

congbobo184 commented Nov 17, 2022

Uh oh!

hangc0276 commented Dec 14, 2022

Uh oh!

michaeljmarshall commented Dec 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Fix rack awareness cache expiration data race #16825

Fix rack awareness cache expiration data race #16825

Uh oh!

Conversation

michaeljmarshall commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Additional Context

Alternative Solution

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

michaeljmarshall commented Jul 27, 2022

Uh oh!

michaeljmarshall commented Jul 27, 2022

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

BewareMyPower commented Aug 1, 2022

Uh oh!

michaeljmarshall commented Aug 9, 2022

Uh oh!

mattisonchao commented Aug 10, 2022

Uh oh!

congbobo184 commented Nov 10, 2022

Uh oh!

congbobo184 commented Nov 17, 2022

Uh oh!

hangc0276 commented Dec 14, 2022

Uh oh!

michaeljmarshall commented Dec 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

michaeljmarshall commented Jul 27, 2022 •

edited

Loading