Skip to content

KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams#6080

Merged
guozhangwang merged 28 commits intoapache:trunkfrom
khaireddine120:KAFKA-7243
Mar 22, 2019
Merged

KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams#6080
guozhangwang merged 28 commits intoapache:trunkfrom
khaireddine120:KAFKA-7243

Conversation

@khaireddine120
Copy link
Copy Markdown
Contributor

@khaireddine120 khaireddine120 commented Dec 31, 2018

The goal of this task is to implement an integration test for the kafka stream metrics.
We have to check 2 things:
1. After streams application are started, all metrics from different levels (thread, task, processor, store, cache) are correctly created and displaying recorded values.
2. When streams application are shutdown, all metrics are correctly de-registered and removed.

@khaireddine120
Copy link
Copy Markdown
Contributor Author

khaireddine120 commented Jan 1, 2019

Hi @mjsax @guozhangwang can you check the pull request ?

@guozhangwang guozhangwang changed the title Kafka-7243: Add unit integration tests to validate metrics in Kafka Streams KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams Jan 3, 2019
@guozhangwang
Copy link
Copy Markdown
Contributor

@vvcephei could you take a look when you have time?

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khaireddine120 ,

Thanks for this PR.

When you get the chance, please fill out the PR description, as it will become the commit message when it's merged.

Looking at the ticket, I think the intent was actually to enumerate the expected metrics and make sure they are all actually registered.

I think your strategy to make sure that we de-register all metrics on shutdown works. But after startup, we need to verify each of the metrics we have documented: https://kafka.apache.org/documentation/#kafka_streams_monitoring

Thanks,
-John

@khaireddine120
Copy link
Copy Markdown
Contributor Author

Hi @vvcephei, i have 3 questions regarding this test:

  1. do we have to execute a producer ?
  2. do we have to checkout the 5 stores metric type, i tested only 3 (inMemory, inMemoryLru and persistent key/value), there is also ( window and session) but i didn't figure out how can i integrate them in the test
  3. i notice that there is also metrics other than Thread, task, processor, store and cache.
    Do we have to test them also ?
    Thanks

@khaireddine120
Copy link
Copy Markdown
Contributor Author

Hi @vvcephei, any update ?

@vvcephei
Copy link
Copy Markdown
Contributor

vvcephei commented Feb 18, 2019

Hi @khaireddine120 ,

Sorry for the delay. Here are my thoughts...

  1. If you're asking about executing a producer so that we can verify producer metrics, I don't think Streams needs to make guarantees about which producer metrics are present. On the other hand, if we need to process some data in order to verify streams metrics, then we should do so.
  2. Yes, but we can add the additional metrics coverage my adding more test methods; we don't have to wedge everything in one test. Testing the metrics for window and session stores should be as simple as having some materialized windowed aggregations in the topology under test.
  3. To answer, I'd have to know exactly what metrics you have in mind. Do note, however, that some metrics are already verified in existing tests. Of course, it might be nice to make sure that every documented metric is at least covered by this test.

I looked at your updates to the test, and it looks good to me.

Thanks!
-John

@khaireddine120
Copy link
Copy Markdown
Contributor Author

khaireddine120 commented Mar 5, 2019

Hi @vvcephei @guozhangwang,
Can you recheck the pull request ?
Thanks

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @khaireddine120 , I've made a pass on it. LGTM overall.

Thread.sleep(10000);

final List<Metric> listMetricAfterStartingApp = new ArrayList<Metric>(kafkaStreams.metrics().values()).stream().filter(m -> m.metricName().group().contains("stream")).collect(Collectors.toList());
Assert.assertTrue(listMetricAfterStartingApp.size() > 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking there are some metrics with stream is very loose.. I'd suggest we:

  1. have a simple topology that reads from topic1, access a store1, and then write to a sink topic2.
  2. get the list of metrics from kafkaStreams.metrics() which contains producer, consumer, admin and streams' own metrics (thread, task, processor node, store, cache), and check that they all exist with the exact number of metrics (with the PR for KIP-414 is in it should be easy to get the corresponding client id for different modules).
  3. Close the app; and wait for the metrics to be all closed with a timeout (see comment below).


closeApplication();

Thread.sleep(10000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid using time-based operations in integrations since they usually leads to flaky tests.

Consider using TestUtils.waitForCondition() with a timeout.

}

@Test
public void testStreamMetricOfWindowStore() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally thinking about getting the list of all metrics from kafkaStreams.metrics() which contains producer, consumer, admin and streams' own metrics (thread, task, processor node, store, cache), and check that they all exist with the exact number of metrics (with the PR for KIP-414 is in it should be easy to get the corresponding client id for different modules).

But after reading @vvcephei 's comment I think I'm convinced that we can save on getting non-streams embedded client's metrics later, and also the actual metrics name validation may better not in Streams metrics test, so we'd probably only check that the corresponding groups with clientIds existed rather than checking each metric exists. So I'm fine with the current scope of this PR.

kafkaStreams = new KafkaStreams(builder.build(), streamsConfiguration);
kafkaStreams.start();

Thread.sleep(10000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid using time-based operations in integrations since they usually leads to flaky tests.

Consider using TestUtils.waitForCondition() with a timeout.

Copy link
Copy Markdown
Contributor Author

@khaireddine120 khaireddine120 Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang the issue is that i have no condition to be sure that the metrics has been registred to begin the test, i just tested with 5 sec (no all metric are registred) then tested with 10 sec and the test never fails. any suggestions ?

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was suggesting is that we can 1) remove the sleep in startApplication function, and 2) in the test itself, use TestUtils.waitForCondition() after startApplication for each testXX check, such that if it is not yet satisfied, instead of failing the test immediately, it will backoff a bit and re-execute the check and see if it is good now -- so we are giving the check continuously than doing a one-time shot.

Of course, this requires you to replace the assertions in testMetricByName, and instead to return booleans plus propagating the error message bottom-up so that it can shows which exact metric name did not find when we've exhausted the wait time and finally decide to fail (otherwise it will just say "fail to meet condition" but the internal information like which test function's which metric name caused it, will be lost).

Does that make sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you.

@khaireddine120
Copy link
Copy Markdown
Contributor Author

Hi @guozhangwang , can you confirm the fix ?
BTW, i forgot the process of merge, did someone from PMC will take it in charge ?
Thanks

@guozhangwang
Copy link
Copy Markdown
Contributor

Don't worry, I will squash / merge when needed.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made another pass, and left a couple minor comments.

Assert.assertNotNull("Metric:'" + m.metricName() + "' must be not null", m.metricValue());
}
} catch (final Throwable e) {
throw e;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in the caller we will capture throwable and swallow it by returning false, would the actual error message would be lost?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a logger, but the checkstyle refuse it, should i find a way to log the error ? Or ignore it ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaireddine120 I thought about it again and now I realized adding a logger is not ideal, since it may fail a couple of time within the timeout until it succeeds, but the assert failure would still log an record before succeeds.

So how about doing this?

  1. Create an error string variable (initialized as null) which will be passed to testXX function, which will also pass it to testMetricByName as the additional parameter, then inside testMetricByName replace the assertion with the size / not null conditions directly; and if failed modify the passed-in string reference with the necessary information like which metric name contains unexpected numbers, or name equals to null.

  2. And this string can be passed in as the second parameter if waitForCondition, this is because that parameter is actually used as a Supplier<String> which is evaluated lazily. So if the condition did not met indeed, that evaluation will then take whatever the current string reference to construct the final message.

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it make sense. thanks

for (final Metric m : metrics) {
Assert.assertNotNull("Metric:'" + m.metricName() + "' must be not null", m.metricValue());
}
} catch (final Throwable e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to capture-and-rethrow here? Since the caller will always capture and swallow, this seems unnecessary to me.

@guozhangwang
Copy link
Copy Markdown
Contributor

@khaireddine120 Please ping me whenever it is ready for reviews again.

@khaireddine120
Copy link
Copy Markdown
Contributor Author

khaireddine120 commented Mar 13, 2019 via email

@khaireddine120
Copy link
Copy Markdown
Contributor Author

khaireddine120 commented Mar 13, 2019

Hi @guozhangwang, can you take a look on the last update ?
Thanks.

@guozhangwang
Copy link
Copy Markdown
Contributor

Hmm.. the current approach will still be printing multiple error messages for previous runs right? For example let's say in waitForCondition you first failed on testing metricA, and then backoff a bit an retry and failed on testing metricB and then backoff and retry and failed on testing metricC, your error message will print as

error message for MessageA;error message for MessageB;error message for MessageC;

while what really want is just

error message for MessageC;

Does that make sense?

@khaireddine120
Copy link
Copy Markdown
Contributor Author

khaireddine120 commented Mar 13, 2019

I will recheck this case

@guozhangwang
Copy link
Copy Markdown
Contributor

Ah I see. Yeah errorMessage.setLength(0); should work.

@guozhangwang
Copy link
Copy Markdown
Contributor

retest this please

@khaireddine120
Copy link
Copy Markdown
Contributor Author

retest this please

@guozhangwang
Copy link
Copy Markdown
Contributor

LGTM. @vvcephei could you take another look as well?

@guozhangwang
Copy link
Copy Markdown
Contributor

retest this please

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @khaireddine120 , I just have a couple of small remarks...

streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
}

@Before
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be @After?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch :)

.collect(Collectors.toList());
testMetricByName(listMetricStore, PUT_LATENCY_AVG, 2);
testMetricByName(listMetricStore, PUT_LATENCY_MAX, 2);
testMetricByName(listMetricStore, PUT_IF_ABSENT_LATENCY_AVG, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read this correctly: it's verifying that we have 0 metrics registered for PUT_IF_ABSENT_LATENCY_AVG?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what i found after starting the app. I don't know the required "given" to got this type of metrics :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For window / session stores, there's no putIfAbsent function and hence no metrics would be registered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remove it or let the test on 0 ?

}
}

private boolean testStoreMetricByType(final String storeType, final StringBuilder errorMessage) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to testStoreMetricKeyValueByType

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaireddine120 could you address the comments and ping me again?

@khaireddine120
Copy link
Copy Markdown
Contributor Author

retest this please

@khaireddine120
Copy link
Copy Markdown
Contributor Author

Hi @guozhangwang, can you recheck the pull request ?

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @khaireddine120 !

@guozhangwang guozhangwang merged commit 3124b07 into apache:trunk Mar 22, 2019
@khaireddine120
Copy link
Copy Markdown
Contributor Author

@guozhangwang @vvcephei , thanks for your help guys.

jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* apache/trunk: (23 commits)
  KAFKA-7986: Distinguish logging from different ZooKeeperClient instances (apache#6493)
  KAFKA-8102: Add an interval-based Trogdor transaction generator (apache#6444)
  MINOR: Fix misspelling in protocol documentation
  KAFKA-8150: Fix bugs in handling null arrays in generated RPC code (apache#6489)
  KAFKA-8014: Extend Connect integration tests to add and remove workers dynamically (apache#6342)
  MINOR: Remove line for testing repartition topic name (apache#6488)
  MINOR: add MacOS requirement to Streams docs
  MINOR: fix message protocol help text for ElectPreferredLeadersResult (apache#6479)
  MINOR: list-topics should not require topic param
  MINOR: Clean up ThreadCacheTest (apache#6485)
  MINOR: Avoid unnecessary collection copy in MetadataCache (apache#6397)
  KAFKA-8142: Fix NPE for nulls in Headers (apache#6484)
  KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams (apache#6080)
  MINOR: Add verification step for Streams archetype to Jenkins build (apache#6431)
  KAFKA-7819: Improve RoundTripWorker (apache#6187)
  KAFKA-7989: RequestQuotaTest should wait for quota config change before running tests (apache#6482)
  KAFKA-8098: Fix Flaky Test testConsumerGroups
  KAFKA-6958: Add new NamedOperation interface to enforce consistency in naming operations (apache#6409)
  MINOR: capture result timestamps in Kafka Streams DSL tests (apache#6447)
  MINOR: updated names for deprecated streams constants (apache#6466)
  ...
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…treams (apache#6080)

The goal of this task is to implement an integration test for the kafka stream metrics.

We have to check 2 things:
1. After streams application are started, all metrics from different levels (thread, task, processor, store, cache) are correctly created and displaying recorded values.
2. When streams application are shutdown, all metrics are correctly de-registered and removed.

Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants