KAFKA-12455: OffsetValidationTest.test_broker_rolling_bounce fail: Raft by rondagostino · Pull Request #10322 · apache/kafka

rondagostino · 2021-03-15T21:50:52Z

OffsetValidationTest.test_broker_rolling_bounce was failing when used with a Raft-based metadata quorum but succeeding with a ZooKeeper-based quorum. This patch increases the consumers' session timeouts to 30 seconds, which fixes the Raft case and also eliminates flakiness that has historically existed in the Zookeeper case. This patch also fixes a minor logging bug in RaftReplicaManager.endMetadataChangeDeferral() that was discovered during the debugging of this issue, and it adds an extra logging statement in RaftReplicaManager.handleMetadataRecords() when a single metadata batch is applied to mirror the same logging statement that occurs when deferred metadata changes are applied.

In the Raft system test case the consumer was sometimes receiving a METADATA response with just 1 alive broker, and then when that broker rolled the consumer wouldn't know about any alive nodes. It would have to wait until the broker returned before it could reconnect, and by that time the group coordinator on the second broker would have timed-out the client and initiated a group rebalance. The test explicitly checks that no rebalances occur, so the test would fail. It turns out that the reason why the ZooKeeper configuration wasn't seeing rebalances was just plain luck. The brokers' metadata caches in the ZooKeeper configuration show 1 alive broker even more frequently than the Raft configuration does. If we tweak the metadata.max.age.ms value on the consumers we can easily get the ZooKeeper test to fail, and in fact this system test has historically been flaky for the ZooKeeper configuration. We can get the test to pass by setting session.timeout.ms=30000 (which is longer than the roll time of any broker), or we can increase the broker count so that the client never sees a METADATA response with just a single alive broker and therefore never loses contact with the cluster for an extended period of time. We have plenty of system tests with 3+ brokers, so we choose to keep this test with 2 brokers and increase the session timeout.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

rondagostino · 2021-03-15T21:51:08Z

This patch needs to be cherry-picked to 2.8

rondagostino · 2021-03-15T21:56:15Z

As per an offline conversation, since 2 brokers is a supported cluster size, we would prefer that this system test keep 2 brokers instead of bumping it to 3 -- we have lots of tests that run with 3 brokers already. So I will change the test to use the session.timeout.ms=30000 solution instead.

ijuma

LGTM, thanks. Can we please update the PR description to summarize things first and include the details later?

rondagostino · 2021-03-16T15:21:08Z

@ijuma Thanks for the suggestions -- all set I think.

…ure with Raft (#10322) This test was failing when used with a Raft-based metadata quorum but succeeding with a ZooKeeper-based quorum. This patch increases the consumers' session timeouts to 30 seconds, which fixes the Raft case and also eliminates flakiness that has historically existed in the Zookeeper case. This patch also fixes a minor logging bug in RaftReplicaManager.endMetadataChangeDeferral() that was discovered during the debugging of this issue, and it adds an extra logging statement in RaftReplicaManager.handleMetadataRecords() when a single metadata batch is applied to mirror the same logging statement that occurs when deferred metadata changes are applied. In the Raft system test case the consumer was sometimes receiving a METADATA response with just 1 alive broker, and then when that broker rolled the consumer wouldn't know about any alive nodes. It would have to wait until the broker returned before it could reconnect, and by that time the group coordinator on the second broker would have timed-out the client and initiated a group rebalance. The test explicitly checks that no rebalances occur, so the test would fail. It turns out that the reason why the ZooKeeper configuration wasn't seeing rebalances was just plain luck. The brokers' metadata caches in the ZooKeeper configuration show 1 alive broker even more frequently than the Raft configuration does. If we tweak the metadata.max.age.ms value on the consumers we can easily get the ZooKeeper test to fail, and in fact this system test has historically been flaky for the ZooKeeper configuration. We can get the test to pass by setting session.timeout.ms=30000 (which is longer than the roll time of any broker), or we can increase the broker count so that the client never sees a METADATA response with just a single alive broker and therefore never loses contact with the cluster for an extended period of time. We have plenty of system tests with 3+ brokers, so we choose to keep this test with 2 brokers and increase the session timeout. Reviewers: Ismael Juma <ismael@juma.me.uk>

Conflicts: * Jenkinsfile: `install` -> `publishToMavenLocal`, drop ARM build and other changes that don't make sense for Confluent's version of `Jenkinsfile`. * build.gradle: keep Confluent changes for automatic skipping signing for specific version patterns (upstream only does it if the version ends with `SNAPSHOT`). Commits: * apache-github/trunk: (59 commits) MINOR: Remove redundant allows in import-control.xml (apache#10339) MINOR: remove some specifying types in tool command (apache#10329) KAFKA-12455: Fix OffsetValidationTest.test_broker_rolling_bounce failure with Raft (apache#10322) MINOR: Add toString to various Kafka Metrics classes (apache#10330) KAFKA-12330; FetchSessionCache may cause starvation for partitions when FetchResponse is full (apache#10318) KAFKA-12427: Don't update connection idle time for muted connections (apache#10267) MINOR; Various code cleanups (apache#10319) HOTFIX: timeout issue in removeStreamThread() (apache#10321) revert stream logging level back to ERROR (apache#10320) KAFKA-12352: Make sure all rejoin group and reset state has a reason (apache#10232) KAFKA-10348: Share client channel between forwarding and auto creation manager (apache#10135) MINOR: Update year in NOTICE (apache#10308) KAFKA-12398: Fix flaky test `ConsumerBounceTest.testClose` (apache#10243) MINOR: Remove redundant inheritance from FilteringJmxReporter #onMetricRemoved (apache#10303) KAFKA-12462: proceed with task revocation in case of thread in PENDING_SHUTDOWN (apache#10311) KAFKA-12460; Do not allow raft truncation below high watermark (apache#10310) MINOR: Log project, gradle, java and scala versions at the start of the build (apache#10307) KAFKA-10357: Add missing repartition topic validation (apache#10305) MINOR: Improve error message in MirrorConnectorsIntegrationBaseTest (apache#10268) MINOR: Add missing unit tests for Mirror Connect (apache#10192) ...

KAFKA-12455: OffsetValidationTest.test_broker_rolling_bounce fail: Raft

f73e164

Fix via 30 second session timeout instead of 3 brokers

1b9e25e

ijuma reviewed Mar 16, 2021

View reviewed changes

Comment thread tests/kafkatest/tests/client/consumer_test.py

Add comment

4bcf768

ijuma approved these changes Mar 16, 2021

View reviewed changes

ijuma merged commit b96fc78 into apache:trunk Mar 16, 2021

ijuma mentioned this pull request Mar 17, 2021

CONFLUENT: Sync from apache/kafka/trunk (17 March 2021) confluentinc/kafka#536

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-12455: OffsetValidationTest.test_broker_rolling_bounce fail: Raft#10322

KAFKA-12455: OffsetValidationTest.test_broker_rolling_bounce fail: Raft#10322
ijuma merged 3 commits intoapache:trunkfrom
rondagostino:KAFKA-12455

rondagostino commented Mar 15, 2021 •

edited

Loading

Uh oh!

rondagostino commented Mar 15, 2021

Uh oh!

rondagostino commented Mar 15, 2021

Uh oh!

Uh oh!

ijuma left a comment

Uh oh!

rondagostino commented Mar 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rondagostino commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

rondagostino commented Mar 15, 2021

Uh oh!

rondagostino commented Mar 15, 2021

Uh oh!

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

rondagostino commented Mar 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rondagostino commented Mar 15, 2021 •

edited

Loading