Skip to content

KAFKA-8164: Improve test passing rate by rerunning flaky tests#8019

Merged
ijuma merged 8 commits intoapache:trunkfrom
viktorsomogyi:gradle-flaky-retry
Feb 6, 2020
Merged

KAFKA-8164: Improve test passing rate by rerunning flaky tests#8019
ijuma merged 8 commits intoapache:trunkfrom
viktorsomogyi:gradle-flaky-retry

Conversation

@viktorsomogyi
Copy link
Copy Markdown
Contributor

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@viktorsomogyi viktorsomogyi requested a review from ijuma January 29, 2020 13:53
@viktorsomogyi
Copy link
Copy Markdown
Contributor Author

viktorsomogyi commented Jan 29, 2020

@ijuma I've created the new PR about flaky tests. Currently it lets 3 retries for a test and fails the build after 50 test failure. Both numbers are a bit arbitrary (especially the 50 :) ), but if you know a good constant for them, I'd be happy to change them.

Furthermore I ran into a few cases where the WhateverClass.classMethod failed which usually happens if there were unexpected running threads after the test. These obviously can't be retried and the plugin will fail as well (and subsequently fails the build too):

* What went wrong:
Execution failed for task ':core:test'.
> org.gradle.test-retry was unable to retry the following test methods, which is unexpected. Please file a bug report at https://github.com/gradle/test-retry-gradle-plugin/issues
     kafka.api.MetricsTest#classMethod
     kafka.api.PlaintextAdminIntegrationTest#classMethod
     kafka.api.SslAdminIntegrationTest#classMethod
     kafka.api.ConsumerTopicCreationTest#classMethod
     kafka.api.SaslSslAdminIntegrationTest#classMethod

Not sure if this is a big problem but I wanted to let you know. The old custom flaky retry didn't handle these either.

Comment thread build.gradle Outdated

retry {
maxRetries = 3
maxFailures = 50
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with something conservative: 1 retry and max 5 failures.We want to fix flaky tests, but if something just fails once, it's not worth failing the build given Jenkins flakiness.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adjust after we see how it works in practice, but this would be a good starting point IMO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I figured we can also make these configurable. Updated the PR.

Comment thread build.gradle Outdated
userMaxForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : null

userFlakyRetries = project.hasProperty('flakyRetries') ? flakyRetries.toInteger() : 1
userMaxTestFailures = project.hasProperty('maxTestFailures') ? maxTestFailures.toInteger() : 5
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about maxTestRetries and maxTestRetryFailures? Also, we need to update the Readme to mention these.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@viktorsomogyi
Copy link
Copy Markdown
Contributor Author

Updated the common configs section

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

ok to test

Comment thread gradle/dependencies.gradle Outdated
spotbugs: "3.1.12",
spotbugsPlugin: "3.0.0",
spotlessPlugin: "3.27.1",
testRetryPlugin: "1.0.2",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using latest version (1.1.0)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed it was released shortly after this PR was submitted. Updated.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

ok to test

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

@viktorsomogyi I made some minor changes, please check that you're happy with them.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

ok to test

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

retest this please

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

Thinking about this some more and after some discussion, I think we should probably default to 0 retries and tweak the PR builder to pass 1 and 5. Will update the PR if @viktorsomogyi doesn't beat me to it.

@viktorsomogyi
Copy link
Copy Markdown
Contributor Author

viktorsomogyi commented Feb 5, 2020

@ijuma I was behind you 10 minutes or so, waited for the local build to run and I saw it only after that you already pushed. :)
I'm happy with the changes.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

retest this please

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 5, 2020

Updated Jenkins PR jobs to be:

./jenkins.sh -PmaxParallelForks=2 -PscalaVersion=2.12 -PmaxTestRetries=1 -PmaxTestRetryFailures=5

Branch builds have not been changed.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 6, 2020

One flaky failure with retries (2.12 job):

kafka.admin.ResetConsumerGroupOffsetTest.testResetOffsetsExportImportPlan

Three flaky failures without retries (2.13 job):

kafka.admin.DescribeConsumerGroupTest.testDescribeGroupWithShortInitializationTimeout
org.apache.kafka.streams.kstream.internals.KStreamImplTest.shouldSupportTriggerMaterializedWithKTableFromKStream
org.apache.kafka.streams.processor.internals.StoreChangelogReaderTest.shouldRequestCommittedOffsetsAndHandleTimeoutException[1]

I'll go ahead and merge this so that we can evaluate how well it works in practice.

@ijuma ijuma merged commit 987f0ee into apache:trunk Feb 6, 2020
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 6, 2020

Thanks for the contribution!

@viktorsomogyi viktorsomogyi deleted the gradle-flaky-retry branch February 6, 2020 11:07
@viktorsomogyi
Copy link
Copy Markdown
Contributor Author

@ijuma thank you for reviewing this change!

ijuma added a commit to confluentinc/kafka that referenced this pull request Feb 7, 2020
Conflicts:
* build.gradle: moved avro plugin definition below
newly added test retry plugin.

* apache-github/trunk:
  MINOR: further InternalTopologyBuilder cleanup  (apache#8046)
  MINOR: Add timer for update limit offsets (apache#8047)
  HOTFIX: Fix spotsbug failure in Kafka examples (apache#8051)
  KAFKA-9447: Add new customized EOS model example (apache#8031)
  KAFKA-8164: Add support for retrying failed (apache#8019)
  HOTFIX: checkstyle for newly added unit test
  KAFKA-9261; Client should handle unavailable leader metadata (apache#7770)
  MINOR: Fix typos introduced in KIP-559 (apache#8042)
  MINOR: Fixing null handilg in ValueAndTimestampSerializer (apache#7679)
  KAFKA-9113: Clean up task management and state management (apache#7997)
  MINOR: fix checkstyle issue in ConsumerConfig.java (apache#8038)
  KAFKA-9491; Increment high watermark after full log truncation (apache#8037)
  KAFKA-9477 Document RoundRobinAssignor as an option for partition.assignment.strategy (apache#8007)
  KAFKA-9074: Correct Connect’s `Values.parseString` to properly parse a time and timestamp literal (apache#7568)
  KAFKA-9492; Ignore record errors in ProduceResponse for older versions (apache#8030)
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 28, 2020
…t-for-generated-requests

* apache-github/trunk: (410 commits)
  KAFKA-8843: KIP-515: Zookeeper TLS support
  MINOR: Add missing quote for malformed line content (apache#8070)
  MINOR: Simplify KafkaProducerTest (apache#8044)
  KAFKA-9507; AdminClient should check for missing committed offsets (apache#8057)
  KAFKA-9519: Deprecate the --zookeeper flag in ConfigCommand (apache#8056)
  KAFKA-9509; Fixing flakiness of MirrorConnectorsIntegrationTest.testReplication (apache#8048)
  HOTFIX: Fix two test failures in JDK11 (apache#8063)
  DOCS - clarify transactionalID and idempotent behavior (apache#7821)
  MINOR: further InternalTopologyBuilder cleanup  (apache#8046)
  MINOR: Add timer for update limit offsets (apache#8047)
  HOTFIX: Fix spotsbug failure in Kafka examples (apache#8051)
  KAFKA-9447: Add new customized EOS model example (apache#8031)
  KAFKA-8164: Add support for retrying failed (apache#8019)
  HOTFIX: checkstyle for newly added unit test
  KAFKA-9261; Client should handle unavailable leader metadata (apache#7770)
  MINOR: Fix typos introduced in KIP-559 (apache#8042)
  MINOR: Fixing null handilg in ValueAndTimestampSerializer (apache#7679)
  KAFKA-9113: Clean up task management and state management (apache#7997)
  MINOR: fix checkstyle issue in ConsumerConfig.java (apache#8038)
  KAFKA-9491; Increment high watermark after full log truncation (apache#8037)
  ...
Comment thread README.md
./gradlew clients:test --tests RequestResponseTest

### Specifying test retries ###
By default, each failed test is retried once up to a maximum of five retries per test run. Tests are retried at the end of the test task. Adjust these parameters in the following way:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should update this line that by default it is not retrying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants