KAFKA-7223: Suppression Buffer Metrics by vvcephei · Pull Request #5795 · apache/kafka

vvcephei · 2018-10-12T20:23:40Z

Add the final batch of metrics from KIP-328

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei

Do you mind taking a look at this last PR for the KIP?

vvcephei · 2018-10-12T20:33:00Z

I did the metrics in a separate file to keep related tests together for readability.

I'm aware there are a lot of repeated strings in the metric names, but it is on purpose: it's the only place in the code where you can visually confirm that the metrics we're measuring conform to the ones we document.

vvcephei · 2018-10-12T21:20:22Z

might as well consolidate this repeated string.

I am always confused by metric. What it the group name used for (ie, semantic meaning)?

AFAICT, I don't think it's "used for" anything in particular. It's part of the metric name, but so is the description.

I assume it's intended to be like a namespace.

group name is used as a first "tag" of the metric name in JMX reporter: xxx-metrics:type=[group-name],[tag1]=[value1],...; for other reporters they can use the group name however they like.

Sorry for my confusion. Let me rephrase my current understanding:
A "metric" is a single value that we track and report as key-value pair. A Sensor groups multiple metric together for ease of use -- each metric gets globally unique name that is put together with many parts.

We prefix metric names (the xxx-metrics: part) with the clientId that is different for each StreamsThread.

What is the purpose of group-name in the metric-name, ie, what metrics should use the same or different group name

Each sensor has a name that is added to all metric names within the sensor (ie, the sensor name groups all it's contains metrics)

we also put additional tags to add more meta information (task-id, processor-id) if appropriate to name each metric uniquely within a sensor.

To clarify: by "grouping" I meant use the same string for a specific part in the metric name. Ie, the prefix groups all metric based on the stream-thread, and the sensor name groups all it's contained metrics (as sub-group within the stream-thread group of metrics). Does this make sense?

To clarify: there are two entities: the metrics registry which organizes the metrics, and there is metrics reporter which regularly pulls from the registry to report the metric values.

Inside metrics registry there is sensor which as you understand is just a way of grouping metrics into meaningful clusters. The SensorName is just an id for distinguishing sensors in the metrics registry (i.e. you will see logic like if this sensor as already been created in the registry skip this step). A metric name presented in MetricName which contains groupName, tags, etc is just a logical entity in the registry. How to represent the metric names is up to the metrics reporter (different reporters can definitely represent it differently). As for the sensor names, they should never be seen outside the registry as the metrics reporter never exposed them.

Thanks @guozhangwang

I was just checking, and we also define "stream-processor-node-metrics" in ProcessorNode -- should we unify both, to have one constant only?

Think, we can also simplify ProcessorNode#createTaskAndNodeLatencyAndThroughputSensors and remove the group parameter.

Actually, someone filed a Jira last week reporting that a heap analysis identified this exact group name as responsible for Megabytes of heap space, so perhaps we should consolidate it into one constant!

If this is an issue, maybe we should do a fix first (in it's own PR, so we can cherry-pick), and rebase/merge this PR later? Thought? Are you planning to do a PR for the reported issue?

oh. I've already added a commit to this PR.

I am planning a separate PR for a couple of other issues he identified. It's https://issues.apache.org/jira/browse/KAFKA-7660, by the way.

vvcephei

@guozhangwang @mjsax @bbejeck

I've updated this PR, and it's ready for review, now that 2.1 is complete.

These metrics would be included in the 2.2 release.

vvcephei · 2018-10-16T19:35:08Z

This sensor is different than that proposed in the kip. During implementation, I noticed a weird asymmetry in which the processor node would measure the suppression, but the buffer would measure the "eviction" aka "emission" aka "forwarding".

I'm proposing to update the KIP to make these two measurements symmetrical.

I didn't think of that before, but now that you mention it, the change makes sense to me.

vvcephei · 2018-10-16T19:36:41Z

Avoid recording metrics if they haven't changed.

vvcephei · 2018-10-16T19:38:12Z

In the KIP, I erroneously proposed to make this a node-level sensor. It should be a store-level sensor instead. I'm proposing to update the KIP if you all agree.

Yeah, that makes sense to me. +1

This is not newly introduced in this PR but: although we call it stream-buffer-metrics where the buffer-id is the store-name, BUT the store name could be an auto-generated name. Note for the named cache we call it stream-record-cache-metrics but the cache name is the store name, and hence we have the same issue.

But here we use the store.name which would be KTABLE-SUPPRESS-XXXX-store if it is auto generated, maybe it's less confusing to use the processor-name instead which would be KTABLE-SUPPRESS-XXX if it is auto generated. Note even if Suppressed.name() is specified as XYZ with store.name() we still set the buffer-id as XYZ-store instead.

I think it's good to be aware of this, but it seems like reporting the store name is the right thing to do here.

Note that the buffer in this case is the store for the suppression processor. So, using the processor name instead of the full store name would be more analogous to doing the same thing for a K/V Store.

As written, the metrics that are specific to the buffer (aka store) will be tagged with a name that matches the name of that component when you describe the topology.

Does that seem right, or have I missed your point?

vvcephei · 2018-10-16T19:38:26Z

just cleaning up deprecated usages.

vvcephei · 2018-10-16T19:39:23Z

need this for the node-level metrics

vvcephei · 2018-10-16T19:40:09Z

inline unnecessary method. I must have made this when I thought that the cast would cause a compiler warning, but it does not.

vvcephei · 2018-10-16T19:42:00Z

I think it should be fine to set the recording level to debug for all unit tests, rather than create a way to configure it.

vvcephei · 2018-10-17T22:08:27Z

java 11 had six unrelated failures:

kafka.api.ClientIdQuotaTest.testThrottledProducerConsumer
kafka.api.UserClientIdQuotaTest.testThrottledProducerConsumer
kafka.api.UserQuotaTest.testThrottledProducerConsumer
kafka.api.UserQuotaTest.testQuotaOverrideDelete
org.apache.kafka.common.network.SslTransportLayerTest.testListenerConfigOverride
kafka.tools.MirrorMakerIntegrationTest.testCommaSeparatedRegex

bbejeck

I made a pass over the PR and left some comments, overall looks good.

bbejeck · 2018-10-19T23:16:01Z

nit: Should the string say The rate of...?

Actually, strange as the sentence is, this is what the other rate metrics say, so I think we should just keep it for consistency.

bbejeck · 2018-10-19T23:19:41Z

seems like the contents of the two sensor#add methods are the same as above maybe refactor into two methods?

bbejeck · 2018-10-19T23:24:14Z

Yeah, that makes sense to me. +1

bbejeck · 2018-10-19T23:36:42Z

I feel like I should know the answer to this, but how do we 51 for the answer here and below?

Empirically :/

It's just the result of summing all the pieces of the test data.

bbejeck · 2018-10-19T23:40:16Z

bbejeck · 2018-10-19T23:41:54Z

I didn't think of that before, but now that you mention it, the change makes sense to me.

bbejeck · 2018-10-19T23:43:44Z

retest this please

vvcephei · 2018-10-20T20:56:54Z

Hi @bbejeck , Thanks for the review!

As explained above, that description string for the rate metrics is common across the other rate metrics, so I've left it as-is. Is that ok?

Incidentally, I was surprised that the tests didn't fail when I changed the description, since I specifically tested for them. Turns out, MetricName equality is defined to ignore the description.

I've refactored the test to verify the description as well, and also fixed a problem with one of the description strings I found.

bbejeck

LGTM

bbejeck · 2018-10-21T20:38:47Z

retest this please

vvcephei · 2018-10-24T20:21:20Z

rebased

vvcephei · 2018-10-24T23:27:19Z

Java 8 failure unrelated.

ConsumerBounceTest.testClose: created https://issues.apache.org/jira/browse/KAFKA-7540

Java 11 failures unrelated:

DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable: created https://issues.apache.org/jira/browse/KAFKA-7541
GssapiAuthenticationTest.testServerAuthenticationFailure: created https://issues.apache.org/jira/browse/KAFKA-7542
EosIntegrationTest.shouldNotViolateEosIfOneTaskFails: created https://issues.apache.org/jira/browse/KAFKA-7544

Retest this, please.

vvcephei · 2018-10-25T18:42:34Z

hmm. It looks like the java 11 build hung and then timed out. Here's the end of the output:

...
00:46:27.275 kafka.server.ServerGenerateBrokerIdTest > testConsistentBrokerIdFromUserConfigAndMetaProps STARTED
00:46:29.775 
00:46:29.775 kafka.server.ServerGenerateBrokerIdTest > testConsistentBrokerIdFromUserConfigAndMetaProps PASSED
03:00:51.124 Build timed out (after 180 minutes). Marking the build as aborted.
03:00:51.440 Build was aborted
03:00:51.492 [FINDBUGS] Skipping publisher since build result is ABORTED
03:00:51.492 Recording test results
03:00:51.495 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:58.017 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:59.330 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:59.331 Adding one-line test results to commit status...
03:00:59.332 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:59.334 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:59.335 Setting status of 5bdcd0e023c6f406d585155399f6541bb6a9f9c2 to FAILURE with url https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/266/ and message: 'FAILURE
03:00:59.335  9053 tests run, 1 skipped, 0 failed.'
03:00:59.335 Using context: JDK 11 and Scala 2.12
03:00:59.541 Setting GRADLE_4_8_1_HOME=/home/jenkins/tools/gradle/4.8.1
03:00:59.542 Finished: ABORTED

I reported this as https://issues.apache.org/jira/browse/KAFKA-7553

vvcephei · 2018-10-25T19:11:22Z

Retest this, please.

vvcephei · 2018-10-30T18:18:15Z

The tests failed again, but this time, I didn't catch what failed. I'll take the opportunity to rebase, which will run the tests again.

vvcephei · 2018-10-31T20:26:22Z

Java 8 passed, and Java 11 hung again.
I added a comment to https://issues.apache.org/jira/browse/KAFKA-7553

Retest this, please.

vvcephei · 2018-11-01T00:24:41Z

Aha! The tests finally passed.

@mjsax or @guozhangwang , do you mind taking a look?

mjsax · 2018-11-06T22:08:49Z

This seems to effectively count the number of input records (total and average) -- so it's basically input data rate? Is it intended like this. intermediate-result-suppression sounds more like how often something is suppressed, but I don't see how this would be reflected here.

I struggled a bit with this question as well. Technically, everything that goes into the buffer is suppressed, and then some stuff is emitted later. We have one metric for each of these things.

I agree it seems more like we should be measuring the number of records that are discarded. I guess we could do this by recording the metric when we buffer a record for which there was a pre-existing record in the buffer. This would be straightforward for the in-memory buffer, but I'm concerned about doing an unnecessary get with an on-disk store to maintain the metric.

You should be able to arrive at the same (or similar) number by subtracting the emit metric from the suppress metric. I was thinking this is what people would probably do, but on second though, I could put together a composite metrics that does it internally.

WDYT?

I'd prefer only providing the non-composite metrics out-of-box while letting users to compose whenever they want, this is along with the same argument @vvcephei had about "always recording at the lowest level granularity only, and let users to the roll-up themselves if they like".

Following this line, I'd suggest we only keep one metrics aka suppressionEmitSensor and users can then calculate the suppression rate by "processing-rate" minus "suppression-emit-rate".

Ok, I can get behind this. It should be measuring the same thing as the process-rate metric for the suppression ProcessorNode.

I'll remove this metric.

mjsax · 2018-11-06T22:12:19Z

Should we call this in init() and/or close(), too?

sure; I don't think they're strictly necessary, but it doesn't hurt, and it'll make the system more resilient to future changes.

mjsax · 2018-11-06T22:13:44Z

I am always confused by metric. What it the group name used for (ie, semantic meaning)?

mjsax · 2018-11-06T22:14:21Z

Does it make sense to use the same group name as above?

This string can effectively never change (it would mean we've renamed all our processor metrics), so I don't think there's any risk from the duplication.

guozhangwang · 2018-11-15T01:34:39Z

group name is used as a first "tag" of the metric name in JMX reporter: xxx-metrics:type=[group-name],[tag1]=[value1],...; for other reporters they can use the group name however they like.

guozhangwang · 2018-11-15T01:45:12Z

I'd prefer only providing the non-composite metrics out-of-box while letting users to compose whenever they want, this is along with the same argument @vvcephei had about "always recording at the lowest level granularity only, and let users to the roll-up themselves if they like".

Following this line, I'd suggest we only keep one metrics aka suppressionEmitSensor and users can then calculate the suppression rate by "processing-rate" minus "suppression-emit-rate".

guozhangwang · 2018-11-15T01:49:41Z

guozhangwang · 2018-11-15T02:00:33Z

This is not newly introduced in this PR but: although we call it stream-buffer-metrics where the buffer-id is the store-name, BUT the store name could be an auto-generated name. Note for the named cache we call it stream-record-cache-metrics but the cache name is the store name, and hence we have the same issue.

But here we use the store.name which would be KTABLE-SUPPRESS-XXXX-store if it is auto generated, maybe it's less confusing to use the processor-name instead which would be KTABLE-SUPPRESS-XXX if it is auto generated. Note even if Suppressed.name() is specified as XYZ with store.name() we still set the buffer-id as XYZ-store instead.

guozhangwang · 2018-11-15T02:04:14Z

In the wiki there is another metric for suppression-mem-buffer-evict, is that not in the scope of this PR?

Ah, now that you mention it, when I originally created this PR, I noticed there were some shortcomings with the metrics I proposed in the KIP, so I wanted to get the reviewers' feedback on the form it has taken in this PR before proposing an update to the KIP.

In particular, I proposed an asymmetric pair of metrics where the processor node would measure the number of incoming events (intermediate-result-suppression), but the buffer would measure the outgoing (emitted==evicted) events. It seemed better to offer a symmetric pair of in/out metrics on either the processor or the buffer. In this PR, I chose to put them both on the node.

In other words, I'm proposing to replace suppression-mem-buffer-evict with suppression-emit.

Of course, in this same review, you have proposed to drop intermediate-result-suppression in favor of the existing process-rate and process-total node-level metrics. This reduces the argument, but I think it still makes sense to make this a node metric instead of a buffer metric, since (IMHO) it more closely matches the intent of using a suppression node to control the emission rate.

Thanks for the clarification, that sounds good to me.

guozhangwang · 2018-11-15T02:05:42Z

Is this rate value be time-dependent (I remember we saw some issues with it if the test runs too slowly), to be non-zero in some edge cases?

IIRC, rate metrics are time-dependent, but I think that this particular comparison is pretty safe for a mock processor unit test.

vvcephei · 2018-11-16T01:08:08Z

Hi @mjsax @guozhangwang @bbejeck ,

Thank you all for your reviews!

I have updated this PR to address the outstanding comments. If it looks good to you now, I will go ahead and update the KIP (and send out a notice to the vote thread), and we can merge the PR.

guozhangwang · 2018-11-16T06:15:40Z

LGTM! @mjsax please feel free to merge if you think all your comments are addressed.

vvcephei · 2018-11-16T21:19:38Z

KIP is updated and message sent to the vote thread.

mjsax

Two more follow up question. Overall LGTM.

mjsax · 2018-11-21T18:48:18Z

Thanks @guozhangwang

I was just checking, and we also define "stream-processor-node-metrics" in ProcessorNode -- should we unify both, to have one constant only?

Think, we can also simplify ProcessorNode#createTaskAndNodeLatencyAndThroughputSensors and remove the group parameter.

mjsax · 2018-11-21T18:53:06Z

        dirtyKeys.clear();
        memBufferSize = 0;
        minTimestamp = Long.MAX_VALUE;
-        open = false;


Why moving this?

so that the buffer will report as "closed" immediately after this method is called, rather than only after the close operation has completed.

In retrospect, it seemed safer for :

open() to set open=true at when opening has completed

close() to set open=false when closing has started

This way, observers who first check that a store is open before querying would never be fooled into querying a partially closed store.

vvcephei · 2018-11-27T16:15:40Z

Unrelated test failure:

kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback

Error Message
java.lang.AssertionError: Partition [group1_largeTopic,69] metadata not propagated after 15000 ms
Stacktrace
java.lang.AssertionError: Partition [group1_largeTopic,69] metadata not propagated after 15000 ms
	at kafka.utils.TestUtils$.fail(TestUtils.scala:351)
	at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:761)
	at kafka.utils.TestUtils$.waitUntilMetadataIsPropagated(TestUtils.scala:850)
	at kafka.utils.TestUtils$$anonfun$createTopic$2.apply(TestUtils.scala:330)
	at kafka.utils.TestUtils$$anonfun$createTopic$2.apply(TestUtils.scala:329)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:174)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at scala.collection.AbstractSet.map(Set.scala:47)
	at kafka.utils.TestUtils$.createTopic(TestUtils.scala:329)
	at kafka.utils.TestUtils$.createTopic(TestUtils.scala:312)
	at kafka.api.CustomQuotaCallbackTest.createTopic(CustomQuotaCallbackTest.scala:180)
	at kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback(CustomQuotaCallbackTest.scala:135)

Retest this, please

vvcephei · 2018-11-27T20:15:18Z

@mjsax , do you have any more feedback on this?

Thanks,
-John

guozhangwang · 2018-11-27T20:53:13Z

@mjsax , do you have any more feedback on this?

I think there is none as he said "otherwise LGTM" :) Will merge to trunk now.

guozhangwang · 2018-11-27T20:56:15Z

I was just checking, and we also define "stream-processor-node-metrics" in ProcessorNode -- should we unify both, to have one constant only?

Think, we can also simplify ProcessorNode#createTaskAndNodeLatencyAndThroughputSensors and remove the group parameter.

There are a bunch of those constants that appears on different classes and I remember @vvcephei mentioned the reason not doing it in this PR is to not introduce mutual dependency on those classes. If we want to do so I'd suggest we just exact all these constants into a separate class and let all other sensor classes refer to this single one, which we can do in another minor PR.

guozhangwang · 2018-11-27T21:07:22Z

There are a bunch of those...

My bad, I did not refresh on the file changes and hence was looking at an older version. The above comment is actually already addressed.

Document the new metrics added in #5795 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

Add the final batch of metrics from KIP-328 Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

Document the new metrics added in apache#5795 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

vvcephei commented Oct 12, 2018

View reviewed changes

vvcephei commented Oct 16, 2018

View reviewed changes

bbejeck reviewed Oct 19, 2018

View reviewed changes

bbejeck approved these changes Oct 21, 2018

View reviewed changes

mjsax reviewed Nov 6, 2018

View reviewed changes

guozhangwang reviewed Nov 15, 2018

View reviewed changes

vvcephei added 4 commits November 15, 2018 18:30

KAFKA-7223: Suppression Buffer Metrics

f9c6c8e

remove suppression rate metric

526dc74

update buffer metrics on init and close

4e392a5

remove unused field

fa4e858

mjsax reviewed Nov 21, 2018

View reviewed changes

cache processor node metric strings as constants

bd8806c

guozhangwang merged commit 55c77eb into apache:trunk Nov 27, 2018

vvcephei mentioned this pull request Dec 11, 2018

KAFKA-7223: document suppression buffer metrics #6024

Merged

3 tasks

guozhangwang pushed a commit that referenced this pull request Dec 13, 2018

KAFKA-7223: document suppression buffer metrics (#6024)

fdd33bc

Document the new metrics added in #5795 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019

KAFKA-7223: document suppression buffer metrics (apache#6024)

f9adaf4

Document the new metrics added in apache#5795 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

Conversation

vvcephei commented Oct 12, 2018

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang Nov 21, 2018 • edited by mjsax Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Oct 17, 2018

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mjsax Nov 20, 2018 •

edited

Loading

guozhangwang Nov 21, 2018 •

edited by mjsax

Loading

vvcephei commented Oct 24, 2018 •

edited

Loading

vvcephei commented Oct 25, 2018 •

edited

Loading