KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends by hachikuji · Pull Request #8782 · apache/kafka

hachikuji · 2020-06-02T16:54:28Z

The method maybeWriteTxnCompletion is unsafe for concurrent calls. This can cause duplicate attempts to write the completion record to the log, which can ultimately lead to illegal state errors and possible to correctness violations if another transaction had been started before the duplicate was written. This patch fixes the problem by ensuring only one thread can successfully remove the pending completion from the map.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…licate appends

hachikuji · 2020-06-02T18:39:41Z

  }

-  private def maybeWriteTxnCompletion(transactionalId: String): Unit = {
-    Option(transactionsWithPendingMarkers.get(transactionalId)).foreach { pendingCommitTxn =>


Multiple threads may see the transaction still as pending and attempt completion.

hachikuji · 2020-06-02T20:21:49Z

    EasyMock.replay(metadataCache)

    channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata1, txnMetadata1.prepareComplete(time.milliseconds()))
-    channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata2, txnMetadata2.prepareComplete(time.milliseconds()))


The change to use mock instead of niceMock led to a failure because of an unexpected append to the log. It seemed like this call was not necessary to test the behavior we were interested in here, so I removed it rather than adding the expected call to append.

guozhangwang

LGTM. Just question on the unit test.

guozhangwang · 2020-06-02T21:01:36Z

  }

+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {


Does this test cover concurrent calls to maybeWriteTxnCompletion?

It does. I was trying to setup this test to fit how we're likely hitting this in practice. In the call to addTxnMarkersToSend, before calling maybeWriteTxnCompletion, we have to acquire the lock. It is possible that the caller fails to acquire the lock before the markers finish getting written and the transaction gets completed in the request completion handler.

Got it, so when the bug still exist this test would probably not fail consistently, but would be flaky, right?

It fails deterministically without the fix.

Ah yes, I missed the txnMetadata2.lock.lock() before starting the scheduler, thanks.

guozhangwang · 2020-06-02T21:02:03Z

    val response = new WriteTxnMarkersResponse(createPidErrorMap(Errors.NONE))
    for (requestAndHandler <- requestAndHandlers) {
-      requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.PRODUCE, 0, "client", 1),
+      requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.WRITE_TXN_MARKERS, 0, "client", 1),


Nice catch. Not sure why they did not fail before :)

hachikuji · 2020-06-03T04:52:56Z

retest this please

hachikuji · 2020-06-03T04:54:21Z

retest this please

hachikuji · 2020-06-03T04:54:34Z

retest this please

chia7712 · 2020-06-03T14:18:08Z

retest this please

…licate appends (#8782) The method `maybeWriteTxnCompletion` is unsafe for concurrent calls. This can cause duplicate attempts to write the completion record to the log, which can ultimately lead to illegal state errors and possible to correctness violations if another transaction had been started before the duplicate was written. This patch fixes the problem by ensuring only one thread can successfully remove the pending completion from the map. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>

…use_all_dns_ips-as-default * apache-github/trunk: KAFKA-9788; Use distinct names for transaction and group load time sensors (#8784) KAFKA-9514; The protocol generator generated useless condition when a field is made nullable and flexible version is used (#8793) MINOR: Update to Gradle 6.5 and tweak build jvm config (#8751) MINOR: Upgrade spotbugs and spotbugsPlugin (#8790) KAFKA-10089 The stale ssl engine factory is not closed after reconfigure (#8792) KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends (#8782) KAFKA-10084: Fix EosTestDriver end offset (#8785) KAFKA-10083: fix failed testReassignmentWithRandomSubscriptionsAndChanges tests (#8786)

KAFKA-10080; Fix race condition on txn completion which can cause dup…

44d2815

…licate appends

chia7712 reviewed Jun 2, 2020

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala Outdated

Rename pendingCommitTxn to pendingCompleteTxn

9d1d560

hachikuji commented Jun 2, 2020

View reviewed changes

Fix failing test due to nice mock change

c52dc75

hachikuji commented Jun 2, 2020

View reviewed changes

guozhangwang approved these changes Jun 2, 2020

View reviewed changes

chia7712 approved these changes Jun 3, 2020

View reviewed changes

hachikuji merged commit 0ffbc6e into apache:trunk Jun 3, 2020

Conversation

hachikuji commented Jun 2, 2020

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Jun 3, 2020

Uh oh!

hachikuji commented Jun 3, 2020

Uh oh!

hachikuji commented Jun 3, 2020

Uh oh!

chia7712 commented Jun 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants