Skip to content

[Branch-2.7] Fixed deadlock on metadata cache missing while doing che…#12484

Merged
Jason918 merged 1 commit intoapache:branch-2.7from
merlimat:fix-deadlock-check-replication
Jul 28, 2022
Merged

[Branch-2.7] Fixed deadlock on metadata cache missing while doing che…#12484
Jason918 merged 1 commit intoapache:branch-2.7from
merlimat:fix-deadlock-check-replication

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Motivation

After the changes in #12340, there were still a couple of places making blocking calls. These calls occupy all the ordered scheduler threads preventing the callbacks to complete, until the 30 seconds timeout expire.

"bookkeeper-ml-scheduler-OrderedScheduler-7-0" #50 prio=5 os_prio=0 tid=0x00007f2d40050000 nid=0xe5 waiting on condition [0x00007f2d998d0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f38940080e0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
	at org.apache.pulsar.broker.service.persistent.PersistentTopic.checkReplication(PersistentTopic.java:1152)
	at org.apache.pulsar.broker.service.BrokerService$3.openLedgerComplete(BrokerService.java:1107)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$asyncOpen$8(ManagedLedgerFactoryImpl.java:425)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$581/978469035.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$2.initializeComplete(ManagedLedgerFactoryImpl.java:397)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$3$1.operationComplete(ManagedLedgerImpl.java:498)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$1.operationComplete(ManagedCursorImpl.java:316)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$1.operationComplete(ManagedCursorImpl.java:289)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$asyncGetCursorInfo$11(MetaStoreImpl.java:170)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl$$Lambda$679/542144696.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"pulsar-ordered-OrderedExecutor-0-0" #13 prio=5 os_prio=0 tid=0x00007f3f73dac800 nid=0xc1 waiting on condition [0x00007f2de07e1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f38940388f8> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
	at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$43(BrokerService.java:1199)
	at org.apache.pulsar.broker.service.BrokerService$$Lambda$455/163843091.run(Unknown Source)
	at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49)
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Instead converted the code to use getAsync().

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug doc-not-needed Your PR changes do not impact docs release/2.7.4 labels Oct 25, 2021
@merlimat merlimat self-assigned this Oct 25, 2021
@315157973
Copy link
Copy Markdown
Contributor

/pulsarbot run-failure-checks

@codelipenghui
Copy link
Copy Markdown
Contributor

/pulsarbot run-failure-checks

2 similar comments
@hangc0276
Copy link
Copy Markdown
Contributor

/pulsarbot run-failure-checks

@codelipenghui
Copy link
Copy Markdown
Contributor

/pulsarbot run-failure-checks

@lhotari lhotari force-pushed the fix-deadlock-check-replication branch from 531a596 to f53ddc1 Compare February 10, 2022 13:53
@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 10, 2022

I rebased the changes. Let's see what the test failures are.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 11, 2022

There are too many failures that I'm not confident to pick this in 2.7.5 release.

@github-actions
Copy link
Copy Markdown

The pr had no activity for 30 days, mark with Stale label.

@Jason918 Jason918 merged commit 32fe228 into apache:branch-2.7 Jul 28, 2022
Jason918 pushed a commit to Jason918/pulsar that referenced this pull request Jul 30, 2022
Jason918 added a commit that referenced this pull request Jul 31, 2022
* Revert "[fix][proxy] Fix client service url (#16834)"

This reverts commit 10b4e99.

* Revert "[Build] Use grpc-bom to align grpc library versions (#15234)"

This reverts commit 99c93d2.

* Revert "upgrade aircompressor to 0.20 (#11790)"

This reverts commit 5ad16b6.

* Revert "[Branch-2.7] Fixed deadlock on metadata cache missing while doing checkReplication (#12484)"

This reverts commit 32fe228.

* Revert changes of PersistentTopic#getMessageTTL in #12339.

Co-authored-by: JiangHaiting <janghaiting@apache.org>
Jason918 pushed a commit to Jason918/pulsar that referenced this pull request Jul 31, 2022
@Jason918
Copy link
Copy Markdown
Contributor

This PR breaks branch 2.7 and reverted.
I opened a new PR to fix this, see #16889
@merlimat @codelipenghui @lhotari

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs lifecycle/stale release/2.7.5 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants