Skip to content

KAFKA-7605; Retry async commit failures in integration test cases to fix flaky tests#5890

Merged
ijuma merged 6 commits intoapache:trunkfrom
hachikuji:KAFKA-7605
Nov 13, 2018
Merged

KAFKA-7605; Retry async commit failures in integration test cases to fix flaky tests#5890
ijuma merged 6 commits intoapache:trunkfrom
hachikuji:KAFKA-7605

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji commented Nov 8, 2018

We are seeing some timeouts in tests which depend on the awaitCommitCallback (e.g. SaslMultiMechanismConsumerTest.testCoordinatorFailover). After some investigation, we found that it is caused by a disconnect when attempting the async commit. To fix the problem, we have added simple retry logic to the test utility.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji
Copy link
Copy Markdown
Contributor Author

Well, the one failure in the jdk8 build was the test I'm trying to fix, so just increasing the timeout is not enough 😞. Let me try again to reproduce.

val numFailuresBeforePoll = commitCallback.failCount
TestUtils.pollUntilTrue(consumer, () => commitCallback.successCount >= count || commitCallback.failCount > numFailuresBeforePoll,
"Failed to observe commit callback before timeout", waitTimeMs = 10000)
assertEquals("Unexpected async commit failure", numFailuresBeforePoll, commitCallback.failCount)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flakiness comes from the async commit failing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what is timing out. I was wondering if the commit is actually failing.

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

The latest failure confirmed my suspicion:

java.lang.AssertionError: Unexpected async commit failure expected:<0> but was:<1>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:200)
	at kafka.api.BaseConsumerTest.ensureNoRebalance(BaseConsumerTest.scala:216)
	at kafka.api.BaseConsumerTest.testCoordinatorFailover(BaseConsumerTest.scala:117)

That likely means that an unexpected rebalance has taken place. I will try to enable some additional logging to get to the cause of the rebalance.

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

2 similar comments
@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

3 similar comments
@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

At long last, the cause is clear:

[2018-11-10 21:29:20,414] ERROR Commit callback failed unexpectedly (kafka.api.SaslMultiMechanismConsumerTest:76)
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
Caused by: org.apache.kafka.common.errors.DisconnectException

I will add some logic to retry when we get RetriableCommitFailedException.

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 12, 2018

So this was just a test bug?

@hachikuji
Copy link
Copy Markdown
Contributor Author

@ijuma Yep, the test seemed to have regressed in a recent refactor. We need not name the guilty party though 😝

@hachikuji
Copy link
Copy Markdown
Contributor Author

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor Author

Three straight successful jdk8 builds. Before it was failing nearly every time with this particular test case failing every couple times.

@hachikuji hachikuji changed the title KAFKA-7605; Increase timeout for commit callback in testing KAFKA-7605; Retry async commit failures in integration test cases to fix flaky tests Nov 12, 2018
override def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata], exception: Exception): Unit = {
exception match {
case null =>
isComplete = true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would it be slightly better if we removed this and in the case e clause, changed the last line to error = Option(e)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good.

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, LGTM.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 12, 2018

Java 11 job failure is unrelated: kafka.server.RequestQuotaTest.testResponseThrottleTimeWhenBothFetchAndRequestQuotasViolated

Java 8 job passed.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 12, 2018

@hachikuji does this need to be cherry-picked to 2.1 or any other branch?

@ijuma ijuma merged commit e08d721 into apache:trunk Nov 13, 2018
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 13, 2018

I merged to trunk, please cherry-pick if needed @hachikuji.

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…fix flaky tests (apache#5890)

We are seeing some timeouts in tests which depend on the awaitCommitCallback (e.g.
SaslMultiMechanismConsumerTest.testCoordinatorFailover). After some investigation,
we found that it is caused by a disconnect when attempting the async commit.
To fix the problem, we have added simple retry logic to the test utility.

Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants