Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Sep 26, 2024

Motivation

#23298 introduce a regression that once the metadata node of this broker was deleted (e.g. by session timeout), the broker register would never have a chance to recover. In this case, the clients whose owner is this broker would never be able to produce or consume.

Modifications

Add a default listener to BrokerRegisterImpl that will register itself again in an asynchronous way if the deleted node is the current broker. Add a new state Unregistering to prevent the broker from registering itself again after unregister() is called.

Add BrokerRegistryIntegrationTest to verify this fix and the behavior introduced from #23298

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug area/broker release/3.3.2 labels Sep 26, 2024
@BewareMyPower BewareMyPower added this to the 4.0.0 milestone Sep 26, 2024
@BewareMyPower BewareMyPower self-assigned this Sep 26, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 26, 2024
@BewareMyPower BewareMyPower changed the title [fix][broker] Fix the broker register cannot recover from the metadata node deletion [fix][broker] Fix the broker registery cannot recover from the metadata node deletion Sep 26, 2024
@heesung-sohn
Copy link
Contributor

I wonder if we need to periodically check this broker registration lock in the monitor thread as well for better fault tolerance.

@BewareMyPower
Copy link
Contributor Author

periodically check this broker registration lock in the monitor thread as well for better fault tolerance.

Yeah it can handle the case when there is a bug with the metadata store client. But even if it has a bug, we can easily find it from the alerts and fix it by manually restarting the broker or creating the metadata node. Therefore, I neither support nor oppose this idea.

@BewareMyPower BewareMyPower force-pushed the bewaremypower/broker-registry-session-timeout branch from 33d630b to 5c720d8 Compare September 27, 2024 05:51
@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.

Project coverage is 74.56%. Comparing base (bbc6224) to head (5c720d8).
Report is 606 commits behind head on master.

Files with missing lines Patch % Lines
...ker/loadbalance/extensions/BrokerRegistryImpl.java 71.42% 5 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23359      +/-   ##
============================================
+ Coverage     73.57%   74.56%   +0.99%     
- Complexity    32624    33942    +1318     
============================================
  Files          1877     1934      +57     
  Lines        139502   145035    +5533     
  Branches      15299    15848     +549     
============================================
+ Hits         102638   108150    +5512     
+ Misses        28908    28605     -303     
- Partials       7956     8280     +324     
Flag Coverage Δ
inttests 27.75% <46.42%> (+3.16%) ⬆️
systests 24.56% <0.00%> (+0.24%) ⬆️
unittests 73.91% <71.42%> (+1.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ker/loadbalance/extensions/BrokerRegistryImpl.java 81.41% <71.42%> (-3.35%) ⬇️

... and 600 files with indirect coverage changes

@BewareMyPower BewareMyPower merged commit 95bd1d1 into apache:master Sep 27, 2024
@BewareMyPower BewareMyPower deleted the bewaremypower/broker-registry-session-timeout branch September 27, 2024 11:49
@lhotari
Copy link
Member

lhotari commented Sep 28, 2024

This PR introduced a flaky test #23365, @BewareMyPower do you have a chance to fix it?

@lhotari
Copy link
Member

lhotari commented Feb 28, 2025

@BewareMyPower Is the current solution sufficient for register for the deletion event?

In LockManagerImpl, there's handling for session events:
https://github.com/lhotari/pulsar/blob/83b86abcb74595d7e8aa31b238a7dbb19a04dde2/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/coordination/impl/LockManagerImpl.java#L118-L141

Before #23984 & #23988 changes, Zookeeper would be always creating persistent nodes even when ephemeral node creation was requested, in the case of using the API that is used in this case. That's why the session disconnection might not have been a visible issue before.

@lhotari
Copy link
Member

lhotari commented Mar 5, 2025

I created #24057 which is a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs release/3.3.2 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants