-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][meta] Fix deadlock in AutoRecovery. #21010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][meta] Fix deadlock in AutoRecovery. #21010
Conversation
| @@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep | |||
| @Override | |||
| public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the metrics update task runs in another thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary, the callback is heavy for the zookeeper, and we plan to remove UnderReplicatedLedgersChangedCb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I think sending an email to notify others why we did this change is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I draft a discussion in bk dev. https://lists.apache.org/thread/1xl3hr2cpyd5xh9kozbx5xlfsjsg3f4h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks explain this to me.
| @@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep | |||
| @Override | |||
| public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I think sending an email to notify others why we did this change is better.
| @@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep | |||
| @Override | |||
| public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should consider deprecating this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface is defined in BookKeeper and we will deprecate it in next BookKeeper major version.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #21010 +/- ##
=============================================
+ Coverage 33.56% 73.01% +39.45%
- Complexity 12198 32322 +20124
=============================================
Files 1621 1875 +254
Lines 126970 140742 +13772
Branches 13857 15662 +1805
=============================================
+ Hits 42618 102769 +60151
+ Misses 78748 29817 -48931
- Partials 5604 8156 +2552
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| //The store listener callback executor is metadata-store executor, | ||
| //in cb.operationComplete(0, null), it will get all underreplication ledgers from metadata-store, it's sync | ||
| //operation. So it's a deadlock. | ||
| // store.registerListener(e -> { | ||
| // if (e.getType() == NotificationType.Deleted && ID_EXTRACTION_PATTERN.matcher(e.getPath()).find()) { | ||
| // cb.operationComplete(0, null); | ||
| // } | ||
| // }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove these lines and add an explanation of why we should keep the method empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want the user to know there may be a deadlock when they see this code. I agree with you that we should left the comments for user to let them know that PulsarLedgerUnderreplicationManager#notifyUnderReplicationLedgerChanged will be removed at next bk major release. But it's already merged, so if user confused, they could see this pr link to get the details.
(cherry picked from commit deeb8a2)
Motivation
When the user config
-Dbookkeeper.metadata.client.drivers=org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver, the AutoRecovery will use PulsarLedgerUnderreplicationManager.https://github.com/apache/bookkeeper/blob/f30ff4f2ad4778f1f73b29872e2a95adb22ca116/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L396-L397.
In Auditor start, it will register a UnderReplicatedLedgersChangedCb to PulsarLedgerUnderreplicationManager,
PulsarLedgerUnderreplicationManager will register a watcher to watch the zk event.
When the event path matches
underreplication, callback UnderReplicatedLedgersChangedCb, the callback executor is metadata-store executor.pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/AbstractMetadataStore.java
Lines 307 to 323 in 0cb1c78
In UnderReplicatedLedgersChangedCb, it will get all the
underreplicationledgers from the metadata store, and this is a sync operation.https://github.com/apache/bookkeeper/blob/f30ff4f2ad4778f1f73b29872e2a95adb22ca116/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L561-L569
pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarLedgerUnderreplicationManager.java
Lines 427 to 479 in 0cb1c78
line_448, use future.get(), it's sync.
There is the stack file:
jastack.txt
Motivation
The UnderReplicatedLedgersChangedCb is aim to record metrics, it is unnecessary and brings heavy pressure for the zookeeper, so here we cancel the UnderReplicatedLedgersChangedCb.
And we discuss to revert
UnderReplicatedLedgersChangedCb,.apache/bookkeeper#2805 (review)
Documentation
docdoc-requireddoc-not-neededdoc-complete