Skip to content

KAFKA-9509: Fixing flakiness of MirrorConnectorsIntegrationTest.testReplication#8048

Merged
hachikuji merged 8 commits intoapache:trunkfrom
skaundinya15:KAFKA-9509
Feb 7, 2020
Merged

KAFKA-9509: Fixing flakiness of MirrorConnectorsIntegrationTest.testReplication#8048
hachikuji merged 8 commits intoapache:trunkfrom
skaundinya15:KAFKA-9509

Conversation

@skaundinya15
Copy link
Copy Markdown
Contributor

@skaundinya15 skaundinya15 commented Feb 5, 2020

JIRA: https://issues.apache.org/jira/browse/KAFKA-9509

As the JIRA indicates, org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication has shown to be an increasingly flaky test recently. This PR aims to make this test more deterministic. Specifically, the flakiness was due to a timing issue between the tasks not starting up in time for the test to start running. This PR remediates that by introducing a status check after every connector is started up. These status checks include that the connector is found on the connect cluster as well as there are tasks created and up and running for that connector. These checks are introduced before the test starts running so that there is a confidence that the connectors and tasks are started up correctly before the test runs.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a few comments.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skaundinya15
This test is indeed flaky.

After a first pass on the PR I have a suggestion regarding your assertion code. We have a similar assertion elsewhere.

@skaundinya15
Copy link
Copy Markdown
Contributor Author

Thanks for the initial reviews @hachikuji and @kkonstantine! I've updated as per your comments by using the method posted by @kkonstantine and checking to see if the connector and its tasks were up after each connector is configured.

@hachikuji Now that the logic has changed a bit, do you think it's okay to do the connector + task check after each connector is configured or do you think we should refactor it out so we can do it per EmbeddedConnectCluster that's configured?

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skaundinya15
Almost there I think. One more comment regarding how the assertion is validated.

Additionally, I'd also change the jenkins.sh file locally and would run this test repeatedly (like 100 times at least) to see if it ever fails again (locally and in a jenkins branch builder).

Copy link
Copy Markdown
Contributor

@ryannedolan ryannedolan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, big improvement :)

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. LGTM

@skaundinya15
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews @kkonstantine and @hachikuji!

@hachikuji I just updated the PR with your suggested changes, so ready for another round of reviews whenever you are.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few small comments.

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

@hachikuji
Copy link
Copy Markdown
Contributor

Checkstyle failed:

22:31:45 [ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.13/connect/mirror/src/test/java/org/apache/kafka/connect/mirror/MirrorConnectorsIntegrationTest.java:194:40: ')' is preceded with whitespace. [ParenPad]

@skaundinya15
Copy link
Copy Markdown
Contributor Author

@hachikuji Checkstyle issues should be fixed

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

2 similar comments
@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@hachikuji hachikuji merged commit 4d86b19 into apache:trunk Feb 7, 2020
@skaundinya15 skaundinya15 deleted the KAFKA-9509 branch February 7, 2020 22:02
hachikuji pushed a commit that referenced this pull request Feb 7, 2020
…eplication (#8048)

The test case `org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication` has shown to be increasingly flaky recently. This PR aims to make this test more deterministic. Specifically, the flakiness was due to a timing issue between the tasks not starting up in time for the test to start running. This PR remediates that by introducing a status check after every connector is started up. These status checks include that the connector is found on the connect cluster as well as there are tasks created and up and running for that connector. These checks are introduced before the test starts running so that there is a confidence that the connectors and tasks are started up correctly before the test runs.

Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Jason Gustafson <jason@confluent.io>
stanislavkozlovski pushed a commit to stanislavkozlovski/kafka that referenced this pull request Feb 18, 2020
…eplication (apache#8048)

The test case `org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication` has shown to be increasingly flaky recently. This PR aims to make this test more deterministic. Specifically, the flakiness was due to a timing issue between the tasks not starting up in time for the test to start running. This PR remediates that by introducing a status check after every connector is started up. These status checks include that the connector is found on the connect cluster as well as there are tasks created and up and running for that connector. These checks are introduced before the test starts running so that there is a confidence that the connectors and tasks are started up correctly before the test runs.

Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Jason Gustafson <jason@confluent.io>
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 28, 2020
…t-for-generated-requests

* apache-github/trunk: (410 commits)
  KAFKA-8843: KIP-515: Zookeeper TLS support
  MINOR: Add missing quote for malformed line content (apache#8070)
  MINOR: Simplify KafkaProducerTest (apache#8044)
  KAFKA-9507; AdminClient should check for missing committed offsets (apache#8057)
  KAFKA-9519: Deprecate the --zookeeper flag in ConfigCommand (apache#8056)
  KAFKA-9509; Fixing flakiness of MirrorConnectorsIntegrationTest.testReplication (apache#8048)
  HOTFIX: Fix two test failures in JDK11 (apache#8063)
  DOCS - clarify transactionalID and idempotent behavior (apache#7821)
  MINOR: further InternalTopologyBuilder cleanup  (apache#8046)
  MINOR: Add timer for update limit offsets (apache#8047)
  HOTFIX: Fix spotsbug failure in Kafka examples (apache#8051)
  KAFKA-9447: Add new customized EOS model example (apache#8031)
  KAFKA-8164: Add support for retrying failed (apache#8019)
  HOTFIX: checkstyle for newly added unit test
  KAFKA-9261; Client should handle unavailable leader metadata (apache#7770)
  MINOR: Fix typos introduced in KIP-559 (apache#8042)
  MINOR: Fixing null handilg in ValueAndTimestampSerializer (apache#7679)
  KAFKA-9113: Clean up task management and state management (apache#7997)
  MINOR: fix checkstyle issue in ConsumerConfig.java (apache#8038)
  KAFKA-9491; Increment high watermark after full log truncation (apache#8037)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants