KAFKA-10286: Connect system tests should wait for workers to join group by gharris1727 · Pull Request #9040 · apache/kafka

gharris1727 · 2020-07-18T08:18:46Z

Signed-off-by: Greg Harris gregh@confluent.io

Currently, the system tests connect_distributed_test and connect_rest_test only wait for the REST api to come up.
The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.

This changes the ConnectDistributedTest to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.

This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with Joined group at the beginning of the log message. This change should be safe to backport to any branch.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Signed-off-by: Greg Harris <gregh@confluent.io>

gharris1727 · 2020-07-18T20:49:55Z

@aakashnshah can you please take a look at this?

gharris1727 · 2020-07-19T23:03:23Z

@rhauch cc for review, thanks!

rhauch · 2020-07-20T12:28:49Z

@gharris1727 Thanks for the proposed fix. I've started a system test build for this branch: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4047/

rhauch · 2020-07-20T13:41:10Z

All 69 Connect system tests passed: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-20--001.1595250901--gharris1727--wait-for-join--2d2265f3b/report.html

rhauch

LGTM. Thanks, @gharris1727!

…up (#9040) Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up. The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers. There are some situations in which this sync takes an unusually long time, and the test continues without all workers up. This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly. This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before. This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch. Signed-off-by: Greg Harris <gregh@confluent.io> Author: Greg Harris <gregh@confluent.io> Reviewer: Randall Hauch <rhauch@gmail.com>

kkonstantine · 2020-07-20T17:37:48Z

Thanks for the fix @gharris1727. It's good to stabilize these tests, they've been relatively flaky for some time.

W/r/t the actual fix I have to notice that we've been moving away from assertions that depend on log messages for some time. Yet, if that's the only way to stabilize these tests maybe we can maintain an exception here.

Having said that, I'm curious why a stable group with all the members joined makes this test successful? I'd imagine that this would not be a requirement (a cluster with 1 worker is as good as a cluster with 3 workers module capacity requirements which these tests should not assert on). There might be value in retaining our expectations while the workers group is formed, because assuming things work only with a stable full group might make us miss bugs in during phases of failure and recovery.

If your investigation revealed some more specific reasons around why these tests were flaky while the group was forming, it'd be good to know.

gharris1727 · 2020-07-20T19:45:52Z

@kkonstantine In my investigation for this fix, I noticed that the long join was happening in the first worker to join the group, as it was creating the connect-offsets topic. The broker logs indicated that the topic was created in a timely manner, and that it was visible to the other workers that joined afterwards. The first worker remained waiting for the topic creation result after the other two workers had been started, causing the test to fail.

I could only pick out one suspicious thing about the create topic operation on the broker, as I am not very familiar with broker logs. For the successful create topic operations, these unblocked messages appeared:

[2020-07-05 09:47:32,423] DEBUG Request key TopicKey(connect-status) unblocked 1 topic operations (kafka.server.DelayedOperationPurgatory)
[2020-07-05 09:47:32,423] DEBUG [Admin Manager on Broker 1]: Request key connect-status unblocked 1 topic requests. (kafka.server.AdminManager)

These did not appear for the excessively long create topic operation. Reading the log message literally, it's possible that the operation is either never entering the DelayedOperationPurgatory, or never released from it, and thus the operation times out on the client side without the worker finding out that the request was filled.

I think this is benign in our case, and a retry will be able to recover the test by discovering the topic has already been created.

Logs for that run with the long join

* MINOR: Filter out quota configs for ConfigCommand using --bootstrap-server (apache#9030) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, David Jacot <djacot@confluent.io>, Ron Dagostino <rdagostino@confluent.io> * MINOR: Fix flaky system test assertion after static member fencing (apache#9033) The test case `OffsetValidationTest.test_fencing_static_consumer` fails periodically due to this error: ``` Traceback (most recent call last): File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 134, in run data = self.run_test() File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 192, in run_test return self.test_context.function(self.test) File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/mark/_mark.py", line 429, in wrapper return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/tests/kafkatest/tests/client/consumer_test.py", line 257, in test_fencing_static_consumer assert len(consumer.dead_nodes()) == num_conflict_consumers AssertionError ``` When a consumer stops, there is some latency between when the shutdown is observed by the service and when the node is added to the dead nodes. This patch fixes the problem by giving some time for the assertion to be satisfied. Reviewers: Boyang Chen <boyang@confluent.io> * MINOR: Improve log4j for per-consumer assignment (apache#8997) Add log4j entry summarizing the assignment (previous owned and assigned) at the consumer level. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io> * KAFKA-10223; Use NOT_LEADER_OR_FOLLOWER instead of non-retriable REPLICA_NOT_AVAILABLE for consumers (apache#8979) Brokers currently return NOT_LEADER_FOR_PARTITION to producers and REPLICA_NOT_AVAILABLE to consumers if a replica is not available on the broker during reassignments. Non-Java clients treat REPLICA_NOT_AVAILABLE as a non-retriable exception, Java consumers handle this error by explicitly matching the error code even though it is not an InvalidMetadataException. This PR renames NOT_LEADER_FOR_PARTITION to NOT_LEADER_OR_FOLLOWER and uses the same error for producers and consumers. This is compatible with both Java and non-Java clients since all clients handle this error code (6) as retriable exception. The PR also makes ReplicaNotAvailableException a subclass of InvalidMetadataException. - ALTER_REPLICA_LOG_DIRS continues to return REPLICA_NOT_AVAILABLE. Retained this for compatibility since this request never returned NOT_LEADER_FOR_PARTITION earlier. - MetadataRequest version 0 also returns REPLICA_NOT_AVAILABLE as topic-level error code for compatibility. Newer versions filter these out and return Errors.NONE, so didn't change this. - Partition responses in MetadataRequest return REPLICA_NOT_AVAILABLE to indicate that one of the replicas is not available. Did not change this since NOT_LEADER_FOR_PARTITION is not suitable in this case. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Bob Barrett <bob.barrett@confluent.io> * MINOR: Improved code quality for various files. (apache#9037) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com> * MINOR: Fixed some resource leaks. (apache#8922) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com> * MINOR: Enable broker/client compatibility tests for 2.5.0 release - Add missing broker/client compatibility tests for 2.5.0 release Author: Manikumar Reddy <manikumar.reddy@gmail.com> Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com> Closes apache#9041 from omkreddy/compat * KAFKA-10286: Connect system tests should wait for workers to join group (apache#9040) Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up. The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers. There are some situations in which this sync takes an unusually long time, and the test continues without all workers up. This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly. This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before. This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch. Signed-off-by: Greg Harris <gregh@confluent.io> Author: Greg Harris <gregh@confluent.io> Reviewer: Randall Hauch <rhauch@gmail.com> * KAFKA-10295: Wait for connector recovery in test_bounce (apache#9043) Signed-off-by: Greg Harris <gregh@confluent.io> Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com> Co-authored-by: Jason Gustafson <jason@confluent.io> Co-authored-by: Guozhang Wang <wangguoz@gmail.com> Co-authored-by: Leonard Ge <62600326+leonardge@users.noreply.github.com> Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com> Co-authored-by: Greg Harris <gregh@confluent.io>

kkonstantine · 2020-07-22T18:10:35Z

Given this situation during startup, it makes sense. Thanks for the additional details @gharris1727

KAFKA-10286: Connect system tests should wait for workers to join group

2d2265f

Signed-off-by: Greg Harris <gregh@confluent.io>

kkonstantine added the connect label Jul 18, 2020

rhauch approved these changes Jul 20, 2020

View reviewed changes

rhauch merged commit 5a2a7c6 into apache:trunk Jul 20, 2020

gharris1727 deleted the wait-for-join branch July 20, 2020 19:45

gharris1727 mentioned this pull request Jan 9, 2023

KAFKA-14598: Fix flaky ConnectRestApiTest #13084

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-10286: Connect system tests should wait for workers to join group#9040

KAFKA-10286: Connect system tests should wait for workers to join group#9040
rhauch merged 1 commit intoapache:trunkfrom
gharris1727:wait-for-join

gharris1727 commented Jul 18, 2020

Uh oh!

gharris1727 commented Jul 18, 2020

Uh oh!

gharris1727 commented Jul 19, 2020

Uh oh!

rhauch commented Jul 20, 2020 •

edited

Loading

Uh oh!

rhauch commented Jul 20, 2020

Uh oh!

rhauch left a comment

Uh oh!

kkonstantine commented Jul 20, 2020

Uh oh!

gharris1727 commented Jul 20, 2020

Uh oh!

kkonstantine commented Jul 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gharris1727 commented Jul 18, 2020

Committer Checklist (excluded from commit message)

Uh oh!

gharris1727 commented Jul 18, 2020

Uh oh!

gharris1727 commented Jul 19, 2020

Uh oh!

rhauch commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhauch commented Jul 20, 2020

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

kkonstantine commented Jul 20, 2020

Uh oh!

gharris1727 commented Jul 20, 2020

Uh oh!

kkonstantine commented Jul 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhauch commented Jul 20, 2020 •

edited

Loading