Skip to content

Conversation

@merlimat
Copy link
Contributor

@merlimat merlimat commented Nov 8, 2018

Motivation

There is a race condition when producers and consumers are connecting to a new partitioned topic concurrently and try to initialize the schema.

That results in consumers getting subscribe error (upon application retry, they will succeed).

The exception is like:

1541537636157/test-pythonpartitiontopictest-output-edot-partition-1][test-subs-edot] Failed to create consumer: No such ledger exists
java.util.concurrent.CompletionException: org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such ledger exists
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_181]
	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$15(BookkeeperSchemaStorage.java:441) ~[org.apache.pulsar-pulsar-broker-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]

The main issue is that getOrCreateSchemaLocator() is creating the z-node with a dummy marker (ledgerId=-1) and then creates a new ledger and finally updates the z-node with the real ledger id.
Because of that, consumers might see the z-node pointing to ledger -1 and hence the error.

Modifications

  • Added more information in the BK exception reporting (eg: which operation we are trying to do and ledger id).
  • Removed getOrCreateSchemaLocator(). Instead, we do get(), then create ledger and then try to create z-node with real ledger id. There would not be incomplete state visible.
  • Handle concurrent create conflicts (eg: across multiple brokers) by retrying from the get operation again.

@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Nov 8, 2018
@merlimat merlimat added this to the 2.2.1 milestone Nov 8, 2018
@merlimat merlimat self-assigned this Nov 8, 2018
@merlimat
Copy link
Contributor Author

merlimat commented Nov 8, 2018

run java8 tests

@merlimat merlimat merged commit b708b49 into apache:master Nov 8, 2018
merlimat added a commit that referenced this pull request Dec 13, 2018
…2959)

* Fixed race condition in schema initialization in partitioned topics

* Removed lombok log

* Fixed tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants