Skip to content

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Jan 25, 2021

Motivation

The main motivation for making this change to shut down port listeners synchronously in BrokerService.close is to reduce test flakiness.

While investigating the flaky test MessageIdTest, these type of exceptions were seen in logs:

Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.pulsar.metadata.impl.AbstractMetadataStore$$Lambda$669/1529307342@2f790c2a rejected from java.util.concurrent.ThreadPoolExecutor@ad9a9ac[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 9]
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_275]
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_275]
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_275]
        at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) ~[?:1.8.0_275]
        at org.apache.pulsar.metadata.impl.AbstractMetadataStore.receivedNotification(AbstractMetadataStore.java:128) ~[pulsar-metadata-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.process(ZKMetadataStore.java:320) ~[pulsar-metadata-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        at org.apache.zookeeper.MockZooKeeper.lambda$setData$16(MockZooKeeper.java:728) ~[testmocks-2.8.0-SNAPSHOT.jar:3.5.7]
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321) ~[guava-30.1-jre.jar:?]
        at org.apache.zookeeper.MockZooKeeper.setData(MockZooKeeper.java:683) ~[testmocks-2.8.0-SNAPSHOT.jar:3.5.7]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.put(ZKMetadataStore.java:207) ~[pulsar-metadata-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.asyncUpdateLedgerIds(MetaStoreImpl.java:106) ~[managed-ledger-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.lambda$null$1(ManagedLedgerImpl.java:472) ~[managed-ledger-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[managed-ledger-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
        ... 6 more

It seems that this problem could occur when the test code is able to access the broker instance that is already terminated.

Modifications

To understand the modifications:
The original solution was to wait up to 10 seconds until the port listeners were closed.
A change was requested to add a new setting for this. After adding this, another change was requested to combine these operations in a way that the existing brokerShutdownTimeoutMs setting would also apply to the closing of the port listeners.
This solution for this adds BrokerService.closeAsync and PulsarService.closeAsync methods so that it's possible to combine the closing of the ports to the close operation that uses brokerShutdownTimeoutMs setting.

@sijie sijie added this to the 2.8.0 milestone Jan 25, 2021
@lhotari lhotari requested a review from sijie January 26, 2021 13:41
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari
Copy link
Member Author

lhotari commented Jan 26, 2021

/pulsarbot run-failure-checks

3 similar comments
@lhotari
Copy link
Member Author

lhotari commented Jan 27, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Jan 27, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Jan 27, 2021

/pulsarbot run-failure-checks

@lhotari lhotari force-pushed the lh-fix-broker-shutdown branch from f43e286 to 0714b29 Compare March 22, 2021 05:49
@lhotari
Copy link
Member Author

lhotari commented Mar 22, 2021

@merlimat I have revisited the solution by adding closeAsync methods so that it's possible to wait for shutdown using the existing brokerShutdownTimeoutMs setting. PTAL

@lhotari lhotari changed the title Wait for broker port listeners to shutdown in BrokerService.close Wait for the async broker port listener close operations to complete at shutdown Mar 22, 2021
@lhotari lhotari requested review from eolivelli and merlimat March 22, 2021 06:00
@lhotari
Copy link
Member Author

lhotari commented Mar 22, 2021

/pulsarbot run-failure-checks

@lhotari lhotari force-pushed the lh-fix-broker-shutdown branch from 0714b29 to 2faceb6 Compare March 25, 2021 08:38
@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

@eolivelli I have made the change to closeAsync so that it doesn't throw checked exceptions.

I took another look at MessagingServiceShutdownHook logic and unfortunately I didn't find a way to simplify the logic. There the intention is to run the shutdown in a background thread so that the actual shutdown can continue if the shutdown takes longer than the timeout. Although the code might look bad, there isn't a real alternative. Catching Exceptions is necessary to catch also possible runtime exceptions that could happen at shutdown. You can try to suggest a better way, I tried and simply didn't find one. :)

@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

/pulsarbot run-failure-checks

3 similar comments
@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 25, 2021

/pulsarbot run-failure-checks

lhotari added 2 commits March 26, 2021 08:42
- add BrokerService.closeAsync and PulsarService.closeAsync
  so that shutdown can handle asynchronous closing operations
@lhotari lhotari force-pushed the lh-fix-broker-shutdown branch from 2faceb6 to 81b5945 Compare March 26, 2021 06:56
@lhotari lhotari requested a review from eolivelli March 26, 2021 07:12
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me

@lhotari
Copy link
Member Author

lhotari commented Mar 26, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 26, 2021

@merlimat Please review this PR since it has changed since you approved it.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of comments

lhotari added 3 commits March 27, 2021 14:12
Catching RuntimeExceptions isn't sufficient since checked exceptions
can be thrown in the JVM with solutions like "sneaky throws"

This reverts commit 01b5f6f.
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good for me

@lhotari
Copy link
Member Author

lhotari commented Mar 27, 2021

/pulsarbot run-failure-checks

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lhotari
Copy link
Member Author

lhotari commented Apr 15, 2021

@sijie @merlimat I have continued the work started in this PR in another PR . Please review #10199 since it will also help improve CI stability when the asynchronous tasks of broker shutdown can be controlled in tests. This prevents problems which are caused by too many brokers being active at the same time. This could currently happen when previous brokers are asynchronously shutting down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants