Skip to content

Conversation

@eolivelli
Copy link
Contributor

Fix a StackOverFlowException that we have seen on Pulsar in some non-OSS user application.

Problem
We are calling readMoreEntries() recursively.

Modifications
The fix is to schedule the read in another thread, like we did in #10696

Details
The stacktrace is taken from the DataStax fork of Pulsar version 2.8.0_1.1.16, but the problem still applies to master, 2.9 and 2.8 (I didn't check 2.7)

> 2022-02-02_17:56:18.396 [pulsar] STDOUT: MultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:286) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:473) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$11.readEntryComplete(ManagedCursorImpl.java:1251) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.EntryCacheImpl.asyncReadEntry0(EntryCacheImpl.java:211) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.EntryCacheImpl.asyncReadEntry(EntryCacheImpl.java:190) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntry(ManagedLedgerImpl.java:1916) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntry(ManagedLedgerImpl.java:1837) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.lambda$asyncReplayEntries$9(ManagedCursorImpl.java:1276) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
2022-02-02_17:56:18.396 [pulsar] STDOUT: 	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2739) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncReplayEntries(ManagedCursorImpl.java:1270) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.asyncReplayEntriesInOrder(PersistentDispatcherMultipleConsumers.java:365) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readMoreEntries(PersistentDispatcherMultipleConsumers.java:247) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:286) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:473) ~[com.datastax.oss-pulsar-broker-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$11.readEntryComplete(ManagedCursorImpl.java:1251) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.EntryCacheImpl.asyncReadEntry0(EntryCacheImpl.java:211) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.EntryCacheImpl.asyncReadEntry(EntryCacheImpl.java:190) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntry(ManagedLedgerImpl.java:1916) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntry(ManagedLedgerImpl.java:1837) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.lambda$asyncReplayEntries$9(ManagedCursorImpl.java:1276) ~[com.datastax.oss-managed-ledger-2.8.0.1.1.16.jar:2.8.0.1.1.16]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2739) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
2022-02-02_17:56:18.397 [pulsar] STDOUT: 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
2022-02-02_17:56:18.398 [pulsar] STDOUT: 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]

@eolivelli eolivelli added area/broker doc-not-needed Your PR changes do not impact docs labels Feb 4, 2022
@eolivelli eolivelli self-assigned this Feb 4, 2022
@eolivelli eolivelli requested review from Jason918, massakam and merlimat and removed request for Jason918 February 4, 2022 14:18
@eolivelli
Copy link
Contributor Author

@massakam you did a similar fix, please take a look.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix LGTM.

However I wonder if it's better just make readMoreEntries run in async executors, like

    public void readMoreEntries() {
        topic.getBrokerService().executor().execute(this::doReadMoreEntries);
    }

    public synchronized void doReadMoreEntries() {
        // previous codes in readMoreEntries
    }

As there are other 5 places call readMoreEntries() directly in PersistentStickyKeyDispatcherMultipleConsumers#sendMessagesToConsumers and 2 places in PersistentDispatcherMultipleConsumers#sendMessagesToConsumers and 1 place in PersistentDispatcherMultipleConsumers#readEntriesComplete. They may (or may not) cause the same issue like this.

@eolivelli
Copy link
Contributor Author

@Jason918 I fixed the only case I have seen.
You are right, but I am not sure that we will fall into that case in all of those points.
Probably we should rework readMoreEntries() in general, but this would be outside the scope of this patch.
You know the motto "don't fix things that are not broken"

I would commit this patch as I found a way (in an integration test of a non OSS client application) to trigger the problem systematically, and I am sure that it must be fixed in this point. (But I am not able to contribute easily a reproducer)

@eolivelli eolivelli merged commit f97a76e into apache:master Feb 6, 2022
@eolivelli eolivelli deleted the fix/keyshared-stackoverflow branch February 6, 2022 19:56
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2022
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
…request !68)

Prevent StackOverFlowException in KEY_SHARED subscription (apache#14121)
mattisonchao pushed a commit that referenced this pull request Nov 1, 2022
@mattisonchao mattisonchao added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.5 release/2.9.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants