KAFKA-8262, KAFKA-8263: Fix flaky test `MetricsIntegrationTest` by cadonna · Pull Request #6922 · apache/kafka

cadonna · 2019-06-12T11:48:45Z

Timeout occurred due to initial slow rebalancing.
Added code to wait until KafkaStreams instance is in state RUNNING to check registration of
metrics and in state NOT_RUNNING to check deregistration of metrics.
I removed all other wait conditions, because they are not needed if KafkaStreams instance is in
the right state.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

- Timeout occurred due to initial slow rebalancing. - Added code to wait until `KafkaStreams` instance is in state RUNNING to check registration of metrics and in state NOT_RUNNING to check deregistration of metrics. - I removed all other wait conditions, because they are not needed if `KafkaStreams` instance is in the right state.

cadonna · 2019-06-12T11:50:21Z

Call for quick review: @mjsax @abbccdda @ableegoldman @vvcephei @bbejeck @guozhangwang

guozhangwang

The refactoring lgtm. I just have one question if there are still any lazily registered metrics left, and if yes it's probably not safe to always assume all metrics are there after transit to RUNNING.

guozhangwang · 2019-06-12T18:21:01Z

-            return listMetricAfterClosingApp.size() == 0;
-        }, 10000, "de-registration of metrics");
+    private void checkMetricDeregistration() {
+        final List<Metric> listMetricAfterClosingApp = new ArrayList<Metric>(kafkaStreams.metrics().values()).stream()


I remember some of the metrics were lazily registered, i.e. they would only be registered if the corresponding action is called for the first time. Have we refactored it to always register all metrics up starting the task / process-node etc? Otherwise waiting for the stream state to transit to RUNNING may not guarantee all metrics should be already registered.

Good to know! Will check that.

@guozhangwang, I think it is fine just to wait until the Kafka Streams client is in state RUNNING. The test -- now and before the changes in this PR -- does not produce any records to the input topic. Hence, there is no other later event than the change to state RUNNING that could trigger a lazy metric registration. Does this make sense or am I missing something?

I also checked where the metrics are registered and they are registered either in constructors or init-methods. As far as I saw in the code, both types of methods are called during initialisation of the Kafka client, i.e., before the state change to RUNNING.

Thanks for double checking this! Then it lgtm.

guozhangwang · 2019-06-12T21:10:46Z

Merged to trunk.

…he#6922) - Timeout occurred due to initial slow rebalancing. - Added code to wait until `KafkaStreams` instance is in state RUNNING to check registration of metrics and in state NOT_RUNNING to check deregistration of metrics. - I removed all other wait conditions, because they are not needed if `KafkaStreams` instance is in the right state. Reviewers: Guozhang Wang <wangguoz@gmail.com>

… (#7457) - Timeout occurred due to initial slow rebalancing. - Added code to wait until `KafkaStreams` instance is in state RUNNING to check registration of metrics and in state NOT_RUNNING to check deregistration of metrics. - I removed all other wait conditions, because they are not needed if `KafkaStreams` instance is in the right state. Reviewers: Guozhang Wang <wangguoz@gmail.com>

* KAFKA-8649: send latest commonly supported version in assignment (apache#7425) Reviewers: Matthias J. Sax <matthias@confluent.io> * KAFKA-8262, KAFKA-8263: Fix flaky test `MetricsIntegrationTest` (apache#6922) (apache#7457) - Timeout occurred due to initial slow rebalancing. - Added code to wait until `KafkaStreams` instance is in state RUNNING to check registration of metrics and in state NOT_RUNNING to check deregistration of metrics. - I removed all other wait conditions, because they are not needed if `KafkaStreams` instance is in the right state. Reviewers: Guozhang Wang <wangguoz@gmail.com> * MINOR: clarify wording around fault-tolerant state stores (apache#7510) Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io> * MINOR: Port changes from trunk for test stability to 2.3 branch (apache#7424) Reviewer: Matthias J. Sax <matthias@confluent.io> * KAFKA-9014: Fix AssertionError when SourceTask.poll returns an empty list (apache#7491) Author: Konstantine Karantasis <konstantine@confluent.io> Reviewer: Randall Hauch <rhauch@gmail.com> * KAFKA-8945/KAFKA-8947: Fix bugs in Connect REST extension API (apache#7392) Fix bug in Connect REST extension API caused by invalid constructor parameter validation, and update integration test to play nicely with Jenkins Fix instantiation of TaskState objects by Connect framework. Author: Chris Egerton <chrise@confluent.io> Reviewers: Magesh Nandakumar <mageshn@confluent.io>, Randall Hauch <rhauch@gmail.com> * KAFKA-8340, KAFKA-8819: Use PluginClassLoader while statically initializing plugins (apache#7315) Added plugin isolation unit tests for various scenarios, with a `TestPlugins` class that compiles and builds multiple test plugins without them being on the classpath and verifies that the Plugins and DelegatingClassLoader behave properly. These initially failed for several cases, but now pass since the issues have been fixed. KAFKA-8340 and KAFKA-8819 are closely related, and this fix corrects the problems reported in both issues. Author: Greg Harris <gregh@confluent.io> Reviewers: Chris Egerton <chrise@confluent.io>, Magesh Nandakumar <mageshn@confluent.io>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com> * MINOR: log reason for fatal error in locking state dir (apache#7534) Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io> * KAKFA-8950: Fix KafkaConsumer Fetcher breaking on concurrent disconnect (apache#7511) The KafkaConsumer Fetcher can sometimes get into an invalid state where it believes that there are ongoing fetch requests, but in fact there are none. This may be caused by the heartbeat thread concurrently handling a disconnection event just after the fetcher thread submits a request which would cause the Fetcher to enter an invalid state where it believes it has ongoing requests to the disconnected node but in fact it does not. This is due to a thread safety issue in the Fetcher where it was possible for the ordering of the modifications to the nodesWithPendingFetchRequests to be incorrect - the Fetcher was adding it after the listener had already been invoked, which would mean that pending node never gets removed again. This PR addresses that thread safety issue by ensuring that the pending node is added to the nodesWithPendingFetchRequests before the listener is added to the future, ensuring the finally block is called after the node is added. Reviewers: Tom Lee, Jason Gustafson <jason@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com> * Fix bug in AssignmentInfo#encode and add additional logging (apache#7545) Same as apache#7537 but targeted at 2.3 for cherry-pick Reviewers: Bill Bejeck <bbejeck@gmail.com> * Bump version to 2.3.1 * Update versions to 2.3.2-SNAPSHOT * HOTFIX: fix bug in VP test where it greps for the wrong log message (apache#7643) Reviewers: Guozhang Wang <wangguoz@gmail.com>

guozhangwang reviewed Jun 12, 2019

View reviewed changes

cadonna mentioned this pull request Jun 12, 2019

MINOR: Refactor MetricsIntegrationTest #6924

Closed

3 tasks

guozhangwang merged commit df9ea61 into apache:trunk Jun 12, 2019

cadonna deleted the AK8262-Flaky_MetricsIntegrationTest branch October 21, 2019 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-8262, KAFKA-8263: Fix flaky test `MetricsIntegrationTest`#6922

KAFKA-8262, KAFKA-8263: Fix flaky test `MetricsIntegrationTest`#6922
guozhangwang merged 1 commit intoapache:trunkfrom
cadonna:AK8262-Flaky_MetricsIntegrationTest

cadonna commented Jun 12, 2019 •

edited

Loading

Uh oh!

cadonna commented Jun 12, 2019

Uh oh!

guozhangwang left a comment

Uh oh!

guozhangwang Jun 12, 2019

Uh oh!

cadonna Jun 12, 2019

Uh oh!

cadonna Jun 12, 2019

Uh oh!

guozhangwang Jun 12, 2019

Uh oh!

guozhangwang commented Jun 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cadonna commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

cadonna commented Jun 12, 2019

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

guozhangwang Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

cadonna Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

cadonna Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

guozhangwang Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jun 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cadonna commented Jun 12, 2019 •

edited

Loading