Skip to content

KAFKA-9582: Do not abort transaction in unclean close#8143

Merged
guozhangwang merged 4 commits intoapache:2.5from
abbccdda:no_abort
Feb 21, 2020
Merged

KAFKA-9582: Do not abort transaction in unclean close#8143
guozhangwang merged 4 commits intoapache:2.5from
abbccdda:no_abort

Conversation

@abbccdda
Copy link
Copy Markdown

@abbccdda abbccdda commented Feb 20, 2020

In order to avoid hitting the fatal exception during unclean close, we should avoid calling the abortTransaction() call.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@abbccdda abbccdda force-pushed the no_abort branch 2 times, most recently from 0985f56 to f5fa343 Compare February 20, 2020 23:21

assertTrue(producer.transactionAborted());
assertFalse(producer.transactionInFlight());
assertTrue(producer.transactionInFlight());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one in particular might be worth a comment that we're specifically checking because we don't want to call methods on the producer during an unclean close.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@vvcephei
Copy link
Copy Markdown
Contributor

Thanks @abbccdda ! It LGTM.

@guozhangwang , do you mind taking a look to confirm that this is what you had in mind?

@vvcephei
Copy link
Copy Markdown
Contributor

Ok to test.

@abbccdda
Copy link
Copy Markdown
Author

abbccdda commented Feb 21, 2020

The two test failures are just flaky:
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/815/
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/815/

Both are failing the following 2 tests:
kafka.admin.DescribeConsumerGroupTest.testDescribeGroupMembersWithShortInitializationTimeout
kafka.admin.DescribeConsumerGroupTest.testDescribeGroupWithShortInitializationTimeout

and a fixing PR is in review stage: #8094

@vvcephei
Copy link
Copy Markdown
Contributor

cc @mumrah

@vvcephei
Copy link
Copy Markdown
Contributor

ok to test

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass on the PR and left a question about simply removing isZombie and always skip calling abort txn.

}

private void maybeAbortTransactionAndCloseRecordCollector(final boolean isZombie) {
if (!isZombie) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some callers of suspend that would set clean to false while isZombie to false as well, e.g. suspendRunningTasks in this case should we still call abortTxn?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That call is inside the catch block of task.suspend. For simplicity, I think calling abortTxn is not very tempting at this point as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After another thought, I'm pretty sure the purpose of this PR is not to maintain the caller of abortTxn in either mode to simplify our error handling. The other txn mechanism like InitPid will make sure to cleanup the pending transactions or through the txn timeout. However the risk of calling abortTxn during close is much higher than a normal processing loop, at least for today's producer. If the producer is in FATAL_ERROR, it becomes a bomb for any caller touching on its APIs except close().

We could think of doing some safe abort operations for producer internally instead of externally, which could be done by either getting a new API or just change the default behavior of producer.close to do the transaction abortion when time allowed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I think I am convinced, let's just ignore the isZombie flag then.

@guozhangwang guozhangwang merged commit 121df46 into apache:2.5 Feb 21, 2020
@guozhangwang
Copy link
Copy Markdown
Contributor

Merged to 2.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants