Skip to content

KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends#8782

Merged
hachikuji merged 3 commits intoapache:trunkfrom
hachikuji:KAFKA-10080
Jun 3, 2020
Merged

KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends#8782
hachikuji merged 3 commits intoapache:trunkfrom
hachikuji:KAFKA-10080

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

The method maybeWriteTxnCompletion is unsafe for concurrent calls. This can cause duplicate attempts to write the completion record to the log, which can ultimately lead to illegal state errors and possible to correctness violations if another transaction had been started before the duplicate was written. This patch fixes the problem by ensuring only one thread can successfully remove the pending completion from the map.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

}

private def maybeWriteTxnCompletion(transactionalId: String): Unit = {
Option(transactionsWithPendingMarkers.get(transactionalId)).foreach { pendingCommitTxn =>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple threads may see the transaction still as pending and attempt completion.

EasyMock.replay(metadataCache)

channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata1, txnMetadata1.prepareComplete(time.milliseconds()))
channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata2, txnMetadata2.prepareComplete(time.milliseconds()))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to use mock instead of niceMock led to a failure because of an unexpected append to the log. It seemed like this call was not necessary to test the behavior we were interested in here, so I removed it rather than adding the expected call to append.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just question on the unit test.

}

@Test
def shouldOnlyWriteTxnCompletionOnce(): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test cover concurrent calls to maybeWriteTxnCompletion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. I was trying to setup this test to fit how we're likely hitting this in practice. In the call to addTxnMarkersToSend, before calling maybeWriteTxnCompletion, we have to acquire the lock. It is possible that the caller fails to acquire the lock before the markers finish getting written and the transaction gets completed in the request completion handler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so when the bug still exist this test would probably not fail consistently, but would be flaky, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails deterministically without the fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I missed the txnMetadata2.lock.lock() before starting the scheduler, thanks.

val response = new WriteTxnMarkersResponse(createPidErrorMap(Errors.NONE))
for (requestAndHandler <- requestAndHandlers) {
requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.PRODUCE, 0, "client", 1),
requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.WRITE_TXN_MARKERS, 0, "client", 1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Not sure why they did not fail before :)

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

2 similar comments
@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Jun 3, 2020

retest this please

@hachikuji hachikuji merged commit 0ffbc6e into apache:trunk Jun 3, 2020
hachikuji added a commit that referenced this pull request Jun 3, 2020
…licate appends (#8782)

The method `maybeWriteTxnCompletion` is unsafe for concurrent calls. This can cause duplicate attempts to write the completion record to the log, which can ultimately lead to illegal state errors and possible to correctness violations if another transaction had been started before the duplicate was written. This patch fixes the problem by ensuring only one thread can successfully remove the pending completion from the map.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
ijuma added a commit that referenced this pull request Jun 4, 2020
…use_all_dns_ips-as-default

* apache-github/trunk:
  KAFKA-9788; Use distinct names for transaction and group load time sensors (#8784)
  KAFKA-9514; The protocol generator generated useless condition when a field is made nullable and flexible version is used (#8793)
  MINOR: Update to Gradle 6.5 and tweak build jvm config (#8751)
  MINOR: Upgrade spotbugs and spotbugsPlugin (#8790)
  KAFKA-10089 The stale ssl engine factory is not closed after reconfigure (#8792)
  KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends (#8782)
  KAFKA-10084: Fix EosTestDriver end offset (#8785)
  KAFKA-10083: fix failed testReassignmentWithRandomSubscriptionsAndChanges tests (#8786)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants