Skip to content

KAFKA-10295: Wait for connector recovery in test_bounce#9043

Merged
rhauch merged 1 commit intoapache:trunkfrom
gharris1727:bounce-recovery
Jul 20, 2020
Merged

KAFKA-10295: Wait for connector recovery in test_bounce#9043
rhauch merged 1 commit intoapache:trunkfrom
gharris1727:bounce-recovery

Conversation

@gharris1727
Copy link
Copy Markdown
Contributor

@gharris1727 gharris1727 commented Jul 20, 2020

Signed-off-by: Greg Harris gregh@confluent.io

This additional waiting is for:

  • The scheduled.rebalance.max.time.ms to expire and for tasks to be reassigned
  • The source connector to start and commit offsets at least once beyond it's last commit
  • The sink connector to have adequate time to read any of the messages written by the source

Prevents flakey test failures with messages like "Found sink sequence number greater than any generated sink sequence number for task %d: %d > %d" when test_bounce is running with hard bounces and Incremental Cooperative Rebalancing enabled.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Signed-off-by: Greg Harris <gregh@confluent.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

cc @rhauch for review

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented Jul 20, 2020

@gharris1727 Thanks for the proposed fix. I've started a system test build for this branch: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4046/

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented Jul 20, 2020

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @gharris1727!

@rhauch rhauch merged commit f4944ee into apache:trunk Jul 20, 2020
rhauch pushed a commit that referenced this pull request Jul 20, 2020
Signed-off-by: Greg Harris <gregh@confluent.io>
rhauch pushed a commit that referenced this pull request Jul 20, 2020
Signed-off-by: Greg Harris <gregh@confluent.io>
rhauch pushed a commit that referenced this pull request Jul 20, 2020
Signed-off-by: Greg Harris <gregh@confluent.io>
rhauch pushed a commit that referenced this pull request Jul 20, 2020
Signed-off-by: Greg Harris <gregh@confluent.io>
@gharris1727 gharris1727 deleted the bounce-recovery branch July 20, 2020 19:46
splett2 added a commit to confluentinc/kafka that referenced this pull request Jul 20, 2020
* MINOR: Filter out quota configs for ConfigCommand using --bootstrap-server (apache#9030)

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, David Jacot <djacot@confluent.io>, Ron Dagostino <rdagostino@confluent.io>

* MINOR: Fix flaky system test assertion after static member fencing (apache#9033)

The test case `OffsetValidationTest.test_fencing_static_consumer` fails periodically due to this error:
```
Traceback (most recent call last):
  File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 134, in run
    data = self.run_test()
  File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 192, in run_test
    return self.test_context.function(self.test)
  File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/mark/_mark.py", line 429, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/tests/kafkatest/tests/client/consumer_test.py", line 257, in test_fencing_static_consumer
    assert len(consumer.dead_nodes()) == num_conflict_consumers
AssertionError
```
When a consumer stops, there is some latency between when the shutdown is observed by the service and when the node is added to the dead nodes. This patch fixes the problem by giving some time for the assertion to be satisfied.

Reviewers: Boyang Chen <boyang@confluent.io>

* MINOR: Improve log4j for per-consumer assignment (apache#8997)

Add log4j entry summarizing the assignment (previous owned and assigned) at the consumer level.

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>

* KAFKA-10223; Use NOT_LEADER_OR_FOLLOWER instead of non-retriable REPLICA_NOT_AVAILABLE for consumers (apache#8979)

Brokers currently return NOT_LEADER_FOR_PARTITION to producers and REPLICA_NOT_AVAILABLE to consumers if a replica is not available on the broker during reassignments. Non-Java clients treat REPLICA_NOT_AVAILABLE as a non-retriable exception, Java consumers handle this error by explicitly matching the error code even though it is not an InvalidMetadataException. This PR renames NOT_LEADER_FOR_PARTITION to NOT_LEADER_OR_FOLLOWER and uses the same error for producers and consumers. This is compatible with both Java and non-Java clients since all clients handle this error code (6) as retriable exception. The PR also makes ReplicaNotAvailableException a subclass of InvalidMetadataException.
    - ALTER_REPLICA_LOG_DIRS continues to return REPLICA_NOT_AVAILABLE. Retained this for compatibility since this request never returned NOT_LEADER_FOR_PARTITION earlier. 
   -  MetadataRequest version 0 also returns REPLICA_NOT_AVAILABLE as topic-level error code for compatibility. Newer versions filter these out and return Errors.NONE, so didn't change this.
   - Partition responses in MetadataRequest return REPLICA_NOT_AVAILABLE to indicate that one of the replicas is not available. Did not change this since NOT_LEADER_FOR_PARTITION is not suitable in this case.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Bob Barrett <bob.barrett@confluent.io>

* MINOR: Improved code quality for various files. (apache#9037)

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>

* MINOR: Fixed some resource leaks. (apache#8922)

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>

* MINOR: Enable broker/client compatibility tests for 2.5.0 release

- Add missing broker/client compatibility tests for 2.5.0 release

Author: Manikumar Reddy <manikumar.reddy@gmail.com>

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>

Closes apache#9041 from omkreddy/compat

* KAFKA-10286: Connect system tests should wait for workers to join group (apache#9040)

Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up.
The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.

This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.

This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch.

Signed-off-by: Greg Harris <gregh@confluent.io>
Author: Greg Harris <gregh@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>

* KAFKA-10295: Wait for connector recovery in test_bounce (apache#9043)

Signed-off-by: Greg Harris <gregh@confluent.io>

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
Co-authored-by: Jason Gustafson <jason@confluent.io>
Co-authored-by: Guozhang Wang <wangguoz@gmail.com>
Co-authored-by: Leonard Ge <62600326+leonardge@users.noreply.github.com>
Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com>
Co-authored-by: Greg Harris <gregh@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants