Skip to content

Conversation

@horizonzy
Copy link
Member

@horizonzy horizonzy commented Aug 17, 2023

Motivation

When the user config -Dbookkeeper.metadata.client.drivers=org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver, the AutoRecovery will use PulsarLedgerUnderreplicationManager.

https://github.com/apache/bookkeeper/blob/f30ff4f2ad4778f1f73b29872e2a95adb22ca116/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L396-L397.

In Auditor start, it will register a UnderReplicatedLedgersChangedCb to PulsarLedgerUnderreplicationManager,
PulsarLedgerUnderreplicationManager will register a watcher to watch the zk event.

When the event path matches underreplication, callback UnderReplicatedLedgersChangedCb, the callback executor is metadata-store executor.

protected CompletableFuture<Void> receivedNotification(Notification notification) {
try {
return CompletableFuture.supplyAsync(() -> {
listeners.forEach(listener -> {
try {
listener.accept(notification);
} catch (Throwable t) {
log.error("Failed to process metadata store notification", t);
}
});
return null;
}, executor);
} catch (RejectedExecutionException e) {
return FutureUtil.failedFuture(e);
}
}

In UnderReplicatedLedgersChangedCb, it will get all the underreplication ledgers from the metadata store, and this is a sync operation.

https://github.com/apache/bookkeeper/blob/f30ff4f2ad4778f1f73b29872e2a95adb22ca116/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L561-L569

public Iterator<UnderreplicatedLedger> listLedgersToRereplicate(final Predicate<List<String>> predicate) {
final Queue<String> queue = new LinkedList<>();
queue.add(urLedgerPath);
return new Iterator<UnderreplicatedLedger>() {
final Queue<UnderreplicatedLedger> curBatch = new LinkedList<>();
@Override
public void remove() {
throw new UnsupportedOperationException();
}
@Override
public boolean hasNext() {
if (curBatch.size() > 0) {
return true;
}
while (queue.size() > 0 && curBatch.size() == 0) {
String parent = queue.remove();
try {
for (String c : store.getChildren(parent).get()) {
String child = parent + "/" + c;
if (c.startsWith("urL")) {
long ledgerId = getLedgerId(child);
UnderreplicatedLedger underreplicatedLedger = getLedgerUnreplicationInfo(ledgerId);
if (underreplicatedLedger != null) {
List<String> replicaList = underreplicatedLedger.getReplicaList();
if ((predicate == null) || predicate.test(replicaList)) {
curBatch.add(underreplicatedLedger);
}
}
} else {
queue.add(child);
}
}
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
return false;
} catch (Exception e) {
throw new RuntimeException("Error reading list", e);
}
}
return curBatch.size() > 0;
}
@Override
public UnderreplicatedLedger next() {
assert curBatch.size() > 0;
return curBatch.remove();
}
};
}

line_448, use future.get(), it's sync.

There is the stack file:
jastack.txt

Motivation

The UnderReplicatedLedgersChangedCb is aim to record metrics, it is unnecessary and brings heavy pressure for the zookeeper, so here we cancel the UnderReplicatedLedgersChangedCb.
And we discuss to revert
UnderReplicatedLedgersChangedCb,.apache/bookkeeper#2805 (review)

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 17, 2023
@hangc0276 hangc0276 added this to the 3.2.0 milestone Aug 17, 2023
@@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep
@Override
public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the metrics update task runs in another thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary, the callback is heavy for the zookeeper, and we plan to remove UnderReplicatedLedgersChangedCb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I think sending an email to notify others why we did this change is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks explain this to me.

@@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep
@Override
public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I think sending an email to notify others why we did this change is better.

@@ -978,11 +978,13 @@ public long getReplicasCheckCTime() throws ReplicationException.UnavailableExcep
@Override
public void notifyUnderReplicationLedgerChanged(BookkeeperInternalCallbacks.GenericCallback<Void> cb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should consider deprecating this method?

Copy link
Contributor

@hangc0276 hangc0276 Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is defined in BookKeeper and we will deprecate it in next BookKeeper major version.

@Technoboy- Technoboy- changed the title [fix] [auto-recovery] Fix deadlock in AutoRecovery. [fix][auto-recovery] Fix deadlock in AutoRecovery. Aug 19, 2023
@Technoboy- Technoboy- closed this Aug 20, 2023
@Technoboy- Technoboy- reopened this Aug 20, 2023
@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2023

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.01%. Comparing base (0cb1c78) to head (9c140cc).
⚠️ Report is 1708 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #21010       +/-   ##
=============================================
+ Coverage     33.56%   73.01%   +39.45%     
- Complexity    12198    32322    +20124     
=============================================
  Files          1621     1875      +254     
  Lines        126970   140742    +13772     
  Branches      13857    15662     +1805     
=============================================
+ Hits          42618   102769    +60151     
+ Misses        78748    29817    -48931     
- Partials       5604     8156     +2552     
Flag Coverage Δ
inttests 24.54% <ø> (+0.30%) ⬆️
systests 25.23% <ø> (?)
unittests 72.27% <ø> (+40.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ookkeeper/PulsarLedgerUnderreplicationManager.java 51.05% <ø> (+51.05%) ⬆️

... and 1520 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Technoboy- Technoboy- changed the title [fix][auto-recovery] Fix deadlock in AutoRecovery. [fix][meta] Fix deadlock in AutoRecovery. Aug 21, 2023
@Technoboy- Technoboy- merged commit deeb8a2 into apache:master Aug 21, 2023
Comment on lines +981 to +988
//The store listener callback executor is metadata-store executor,
//in cb.operationComplete(0, null), it will get all underreplication ledgers from metadata-store, it's sync
//operation. So it's a deadlock.
// store.registerListener(e -> {
// if (e.getType() == NotificationType.Deleted && ID_EXTRACTION_PATTERN.matcher(e.getPath()).find()) {
// cb.operationComplete(0, null);
// }
// });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove these lines and add an explanation of why we should keep the method empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the user to know there may be a deadlock when they see this code. I agree with you that we should left the comments for user to let them know that PulsarLedgerUnderreplicationManager#notifyUnderReplicationLedgerChanged will be removed at next bk major release. But it's already merged, so if user confused, they could see this pr link to get the details.

@hangc0276 hangc0276 added the category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost label Aug 24, 2023
liangyepianzhou added a commit that referenced this pull request Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metadata category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-2.11 cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.11.3 release/3.0.2 release/3.1.1 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants