KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams by khaireddine120 · Pull Request #6080 · apache/kafka

khaireddine120 · 2018-12-31T08:28:21Z

The goal of this task is to implement an integration test for the kafka stream metrics.
We have to check 2 things:
1. After streams application are started, all metrics from different levels (thread, task, processor, store, cache) are correctly created and displaying recorded values.
2. When streams application are shutdown, all metrics are correctly de-registered and removed.

synck wit origin

khaireddine120 · 2019-01-01T13:59:16Z

Hi @mjsax @guozhangwang can you check the pull request ?

guozhangwang · 2019-01-09T21:24:09Z

@vvcephei could you take a look when you have time?

vvcephei

Hi @khaireddine120 ,

Thanks for this PR.

When you get the chance, please fill out the PR description, as it will become the commit message when it's merged.

Looking at the ticket, I think the intent was actually to enumerate the expected metrics and make sure they are all actually registered.

I think your strategy to make sure that we de-register all metrics on shutdown works. But after startup, we need to verify each of the metrics we have documented: https://kafka.apache.org/documentation/#kafka_streams_monitoring

Thanks,
-John

khaireddine120 · 2019-01-30T10:51:13Z

Hi @vvcephei, i have 3 questions regarding this test:

do we have to execute a producer ?
do we have to checkout the 5 stores metric type, i tested only 3 (inMemory, inMemoryLru and persistent key/value), there is also ( window and session) but i didn't figure out how can i integrate them in the test
i notice that there is also metrics other than Thread, task, processor, store and cache.
Do we have to test them also ?
Thanks

khaireddine120 · 2019-02-12T08:42:05Z

Hi @vvcephei, any update ?

vvcephei · 2019-02-18T18:18:14Z

Hi @khaireddine120 ,

Sorry for the delay. Here are my thoughts...

If you're asking about executing a producer so that we can verify producer metrics, I don't think Streams needs to make guarantees about which producer metrics are present. On the other hand, if we need to process some data in order to verify streams metrics, then we should do so.
Yes, but we can add the additional metrics coverage my adding more test methods; we don't have to wedge everything in one test. Testing the metrics for window and session stores should be as simple as having some materialized windowed aggregations in the topology under test.
To answer, I'd have to know exactly what metrics you have in mind. Do note, however, that some metrics are already verified in existing tests. Of course, it might be nice to make sure that every documented metric is at least covered by this test.

I looked at your updates to the test, and it looks good to me.

Thanks!
-John

khaireddine120 · 2019-03-05T07:26:46Z

Hi @vvcephei @guozhangwang,
Can you recheck the pull request ?
Thanks

guozhangwang

Thanks for the PR @khaireddine120 , I've made a pass on it. LGTM overall.

guozhangwang · 2019-01-19T23:16:13Z

+        Thread.sleep(10000);
+
+        final List<Metric> listMetricAfterStartingApp = new ArrayList<Metric>(kafkaStreams.metrics().values()).stream().filter(m -> m.metricName().group().contains("stream")).collect(Collectors.toList());
+        Assert.assertTrue(listMetricAfterStartingApp.size() > 0);


Just checking there are some metrics with stream is very loose.. I'd suggest we:

have a simple topology that reads from topic1, access a store1, and then write to a sink topic2.

get the list of metrics from kafkaStreams.metrics() which contains producer, consumer, admin and streams' own metrics (thread, task, processor node, store, cache), and check that they all exist with the exact number of metrics (with the PR for KIP-414 is in it should be easy to get the corresponding client id for different modules).

Close the app; and wait for the metrics to be all closed with a timeout (see comment below).

guozhangwang · 2019-01-19T23:17:17Z

+
+        closeApplication();
+
+        Thread.sleep(10000);


We avoid using time-based operations in integrations since they usually leads to flaky tests.

Consider using TestUtils.waitForCondition() with a timeout.

guozhangwang · 2019-03-06T16:56:01Z

+    }
+
+    @Test
+    public void testStreamMetricOfWindowStore() throws Exception {


I was originally thinking about getting the list of all metrics from kafkaStreams.metrics() which contains producer, consumer, admin and streams' own metrics (thread, task, processor node, store, cache), and check that they all exist with the exact number of metrics (with the PR for KIP-414 is in it should be easy to get the corresponding client id for different modules).

But after reading @vvcephei 's comment I think I'm convinced that we can save on getting non-streams embedded client's metrics later, and also the actual metrics name validation may better not in Streams metrics test, so we'd probably only check that the corresponding groups with clientIds existed rather than checking each metric exists. So I'm fine with the current scope of this PR.

guozhangwang · 2019-03-06T16:56:32Z

+        kafkaStreams = new KafkaStreams(builder.build(), streamsConfiguration);
+        kafkaStreams.start();
+
+        Thread.sleep(10000);


We avoid using time-based operations in integrations since they usually leads to flaky tests.

Consider using TestUtils.waitForCondition() with a timeout.

@guozhangwang the issue is that i have no condition to be sure that the metrics has been registred to begin the test, i just tested with 5 sec (no all metric are registred) then tested with 10 sec and the test never fails. any suggestions ?

What I was suggesting is that we can 1) remove the sleep in startApplication function, and 2) in the test itself, use TestUtils.waitForCondition() after startApplication for each testXX check, such that if it is not yet satisfied, instead of failing the test immediately, it will backoff a bit and re-execute the check and see if it is good now -- so we are giving the check continuously than doing a one-time shot.

Of course, this requires you to replace the assertions in testMetricByName, and instead to return booleans plus propagating the error message bottom-up so that it can shows which exact metric name did not find when we've exhausted the wait time and finally decide to fail (otherwise it will just say "fail to meet condition" but the internal information like which test function's which metric name caused it, will be lost).

Does that make sense?

yes, thank you.

khaireddine120 · 2019-03-11T13:28:43Z

Hi @guozhangwang , can you confirm the fix ?
BTW, i forgot the process of merge, did someone from PMC will take it in charge ?
Thanks

guozhangwang · 2019-03-11T17:35:30Z

Don't worry, I will squash / merge when needed.

guozhangwang

Made another pass, and left a couple minor comments.

guozhangwang · 2019-03-11T17:51:56Z

+                Assert.assertNotNull("Metric:'" + m.metricName() + "' must be not null", m.metricValue());
+            }
+        } catch (final Throwable e) {
+            throw e;


Since in the caller we will capture throwable and swallow it by returning false, would the actual error message would be lost?

I put a logger, but the checkstyle refuse it, should i find a way to log the error ? Or ignore it ?

@khaireddine120 I thought about it again and now I realized adding a logger is not ideal, since it may fail a couple of time within the timeout until it succeeds, but the assert failure would still log an record before succeeds.

So how about doing this?

Create an error string variable (initialized as null) which will be passed to testXX function, which will also pass it to testMetricByName as the additional parameter, then inside testMetricByName replace the assertion with the size / not null conditions directly; and if failed modify the passed-in string reference with the necessary information like which metric name contains unexpected numbers, or name equals to null.

And this string can be passed in as the second parameter if waitForCondition, this is because that parameter is actually used as a Supplier<String> which is evaluated lazily. So if the condition did not met indeed, that evaluation will then take whatever the current string reference to construct the final message.

WDYT?

it make sense. thanks

guozhangwang · 2019-03-11T17:52:36Z

+            for (final Metric m : metrics) {
+                Assert.assertNotNull("Metric:'" + m.metricName() + "' must be not null", m.metricValue());
+            }
+        } catch (final Throwable e) {


Do we need to capture-and-rethrow here? Since the caller will always capture and swallow, this seems unnecessary to me.

guozhangwang · 2019-03-12T17:00:42Z

@khaireddine120 Please ping me whenever it is ready for reviews again.

khaireddine120 · 2019-03-13T08:18:08Z

Of course Le mar. 12 mars 2019 à 18:00, Guozhang Wang <notifications@github.com> a écrit :

…

@khaireddine120 <https://github.com/khaireddine120> Please ping me whenever it is ready for reviews again. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6080 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKWVUKENMRSoGJemcRRPrzHpedgWAypks5vV91DgaJpZM4ZlR2t> .

-- Ingénieur en informatique

khaireddine120 · 2019-03-13T20:15:39Z

Hi @guozhangwang, can you take a look on the last update ?
Thanks.

guozhangwang · 2019-03-13T21:32:14Z

Hmm.. the current approach will still be printing multiple error messages for previous runs right? For example let's say in waitForCondition you first failed on testing metricA, and then backoff a bit an retry and failed on testing metricB and then backoff and retry and failed on testing metricC, your error message will print as

error message for MessageA;error message for MessageB;error message for MessageC;

while what really want is just

error message for MessageC;

Does that make sense?

khaireddine120 · 2019-03-13T21:50:01Z

I will recheck this case

guozhangwang · 2019-03-13T22:34:45Z

Ah I see. Yeah errorMessage.setLength(0); should work.

guozhangwang · 2019-03-13T23:39:48Z

retest this please

khaireddine120 · 2019-03-14T15:36:57Z

retest this please

guozhangwang · 2019-03-15T01:39:01Z

LGTM. @vvcephei could you take another look as well?

guozhangwang · 2019-03-15T01:39:07Z

retest this please

vvcephei

Hey @khaireddine120 , I just have a couple of small remarks...

vvcephei · 2019-03-19T20:17:13Z

+        streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
+    }
+
+    @Before


Should this be @After?

Yes, good catch :)

vvcephei · 2019-03-19T20:23:32Z

+                    .collect(Collectors.toList());
+            testMetricByName(listMetricStore, PUT_LATENCY_AVG, 2);
+            testMetricByName(listMetricStore, PUT_LATENCY_MAX, 2);
+            testMetricByName(listMetricStore, PUT_IF_ABSENT_LATENCY_AVG, 0);


Do I read this correctly: it's verifying that we have 0 metrics registered for PUT_IF_ABSENT_LATENCY_AVG?

That's what i found after starting the app. I don't know the required "given" to got this type of metrics :)

For window / session stores, there's no putIfAbsent function and hence no metrics would be registered.

i remove it or let the test on 0 ?

guozhangwang · 2019-03-20T22:15:24Z

+        }
+    }
+
+    private boolean testStoreMetricByType(final String storeType, final StringBuilder errorMessage) {


nit: rename to testStoreMetricKeyValueByType

guozhangwang

@khaireddine120 could you address the comments and ping me again?

khaireddine120 · 2019-03-21T13:15:43Z

retest this please

khaireddine120 · 2019-03-21T16:24:32Z

Hi @guozhangwang, can you recheck the pull request ?

guozhangwang

Thanks @khaireddine120 !

khaireddine120 · 2019-03-22T07:37:12Z

@guozhangwang @vvcephei , thanks for your help guys.

* apache/trunk: (23 commits) KAFKA-7986: Distinguish logging from different ZooKeeperClient instances (apache#6493) KAFKA-8102: Add an interval-based Trogdor transaction generator (apache#6444) MINOR: Fix misspelling in protocol documentation KAFKA-8150: Fix bugs in handling null arrays in generated RPC code (apache#6489) KAFKA-8014: Extend Connect integration tests to add and remove workers dynamically (apache#6342) MINOR: Remove line for testing repartition topic name (apache#6488) MINOR: add MacOS requirement to Streams docs MINOR: fix message protocol help text for ElectPreferredLeadersResult (apache#6479) MINOR: list-topics should not require topic param MINOR: Clean up ThreadCacheTest (apache#6485) MINOR: Avoid unnecessary collection copy in MetadataCache (apache#6397) KAFKA-8142: Fix NPE for nulls in Headers (apache#6484) KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams (apache#6080) MINOR: Add verification step for Streams archetype to Jenkins build (apache#6431) KAFKA-7819: Improve RoundTripWorker (apache#6187) KAFKA-7989: RequestQuotaTest should wait for quota config change before running tests (apache#6482) KAFKA-8098: Fix Flaky Test testConsumerGroups KAFKA-6958: Add new NamedOperation interface to enforce consistency in naming operations (apache#6409) MINOR: capture result timestamps in Kafka Streams DSL tests (apache#6447) MINOR: updated names for deprecated streams constants (apache#6466) ...

…treams (apache#6080) The goal of this task is to implement an integration test for the kafka stream metrics. We have to check 2 things: 1. After streams application are started, all metrics from different levels (thread, task, processor, store, cache) are correctly created and displaying recorded values. 2. When streams application are shutdown, all metrics are correctly de-registered and removed. Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

krezguiMT and others added 9 commits April 2, 2018 08:26

[MINOR] refactor return value

c436b2c

cleanup line

f122e0d

Merge pull request #1 from apache/trunk

4e492a9

synck wit origin

add basic integration test for metrics

9ac6c9b

enhance test

e9cf0ce

remove unused import

3a5aadb

fix style error

f64becd

enhance code

610a767

trigger new build

68b3149

guozhangwang changed the title ~~Kafka-7243: Add unit integration tests to validate metrics in Kafka Streams~~ KAFKA-7243: Add unit integration tests to validate metrics in Kafka Streams Jan 3, 2019

vvcephei requested changes Jan 23, 2019

View reviewed changes

khairy added 2 commits January 29, 2019 14:15

add processor metric test

0a4e3df

add detailed metric tests

bcd9817

khairy added 2 commits January 30, 2019 12:10

remove unused logger

f411e36

enhance code

506acd9

khairy added 7 commits February 28, 2019 16:26

add test for store window and session

945f64f

rerange code

05d34d6

enhance code

a64ca96

rename test file

f0f3cfa

remove unused constant

4d3c8c2

add comment

d05eecd

fix comment

779869a

guozhangwang reviewed Mar 6, 2019

View reviewed changes

guozhangwang mentioned this pull request Mar 6, 2019

KAFKA-7963: Extract hard-coded Streams metric name strings to centralized place #6355

Merged

3 tasks

remove logger

2a332fd

guozhangwang reviewed Mar 11, 2019

View reviewed changes

bbejeck added the streams label Mar 13, 2019

add propagation of assertion error

a460aa2

khairy added 2 commits March 14, 2019 09:25

enhance constants declaration

78cdbe7

remove blank line

1353b73

vvcephei reviewed Mar 19, 2019

View reviewed changes

guozhangwang reviewed Mar 20, 2019

View reviewed changes

khairy added 2 commits March 21, 2019 10:21

some fix based on github discussion

187026a

fix checkstyme error

bde6ef0

guozhangwang approved these changes Mar 22, 2019

View reviewed changes

guozhangwang merged commit 3124b07 into apache:trunk Mar 22, 2019

Conversation

khaireddine120 commented Dec 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khaireddine120 commented Jan 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guozhangwang commented Jan 9, 2019

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

khaireddine120 commented Jan 30, 2019

Uh oh!

khaireddine120 commented Feb 12, 2019

Uh oh!

vvcephei commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khaireddine120 commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khaireddine120 Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khaireddine120 commented Mar 11, 2019

Uh oh!

guozhangwang commented Mar 11, 2019

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Mar 12, 2019

Uh oh!

khaireddine120 commented Mar 13, 2019 via email

Uh oh!

khaireddine120 commented Mar 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guozhangwang commented Mar 13, 2019

Uh oh!

khaireddine120 commented Mar 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guozhangwang commented Mar 13, 2019

Uh oh!

guozhangwang commented Mar 13, 2019

Uh oh!

khaireddine120 commented Mar 14, 2019

Uh oh!

guozhangwang commented Mar 15, 2019

Uh oh!

khaireddine120 commented Dec 31, 2018 •

edited

Loading

khaireddine120 commented Jan 1, 2019 •

edited

Loading

vvcephei commented Feb 18, 2019 •

edited

Loading

khaireddine120 commented Mar 5, 2019 •

edited

Loading

khaireddine120 Mar 7, 2019 •

edited

Loading

guozhangwang Mar 7, 2019 •

edited

Loading

khaireddine120 commented Mar 13, 2019 •

edited

Loading

khaireddine120 commented Mar 13, 2019 •

edited

Loading