KAFKA-2668; Add a metric that records the total number of metrics#328
KAFKA-2668; Add a metric that records the total number of metrics#328lindong28 wants to merge 16 commits intoapache:trunkfrom
Conversation
|
I think this may cause problems when there are multiple clients. I ran a sample application that makes two KafkaConsumers. Here's what jconsole shows me: Basically, metrics-stats isn't scoped to a client-id while the others were. I think what ends up happening is the two clients might be writing their own metrics count to the same metrics-total mbean. So if one consumer had x metrics while another consumer had y metrics, the metrics-total could for instance jump back forth between x and y. Below is the same application: /**
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE
* file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file
* to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
* License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
* an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/
package org.apache.kafka.clients.consumer;
import java.util.Arrays;
import java.util.Properties;
public class MetricOfMetricsConsumer {
public static void main(String[] args) {
KafkaConsumer<String, String> consumer1 = makeConsumer();
KafkaConsumer<String, String> consumer2 = makeConsumer();
consumer1.subscribe(Arrays.asList("t"));
consumer2.subscribe(Arrays.asList("t"));
while(true) {
consumer1.poll(1000);
consumer2.poll(1000);
}
}
public static KafkaConsumer<String, String> makeConsumer() {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", System.currentTimeMillis() + "");
props.put("partition.assignment.strategy", "roundrobin");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
return new KafkaConsumer<>(props);
}
}The one mbean ObjectName is: |
|
So yeah I think the original intent was to just have this metric for a broker, which made me think it was odd putting this in the Metrics constructor. But it kind of makes sense to have this for consumers and producers too. There are some ways to address this bug:
|
|
@onurkaraman Thanks much for taking time to test it! I have updated the patch to address your comments. The updated patch use the client-id in metricsTags in the metricName. And I have tested the patch with broker, consumer and producer. Please let me know if the updated patch looks good. |
|
LGTM. I think your revised patch did the trick. jconsole now shows: The two mbean ObjectNames now are: If we want this also for MirrorMaker (once it switches to new consumer and producer), I'm not really sure how that'd work, but I think this is probably good for now. |
|
Great. Thanks for confirming that the patch works. |
|
Actually, while things look good from the perspective of clients that fully depend on kafka's metrics system, it doesn't quite seem right from the broker's perspective. It only reports the number of org.apache.kafka.common.metrics and doesn't factor in the yammer metrics, which I think is misleading. This is okay in the long run since we will migrate all of the sensors away from yammer, but it's unclear what the right thing to do now is. |
|
@onurkaraman I have updated the patch such that the new metric shows the total number of attributes of all mbeans in the MBean server. This may include some MBean that is not directly registerd via Kafka, e.g. java.lang:type=Memory:ObjectName. But the semantics of this API is clear and it should solve the problem that motivates this ticket. |
|
Let me think about this for a bit more - originally I thought we should only count the metrics under the kafka-metrics package. I think we should eventually move away from yammer metrics but I think the latest approach of just reporting the total number of mbeans may subsume that and may be useful by itself. |
|
I think recording the total number of mbeans is probably not right. In terms of debugging, we care about how many metrics are coming from kafka. Recording the total number of mbeans doesn't really help us if there's a ton of mbeans being made outside of kafka. For instance, let's say an application has two clients: a KafkaConsumer and a client for some other KV store. If the KV store client mbeans start going through the roof, it can incorrectly get classified as a kafka problem. |
|
@onurkaraman I think the total number of mbeans is generally useful even in your use case -- it means something is going wrong if the number goes through roof. Importantly, it addresses the usecase that motivates this patch, i.e. the number of mbeans registered on the server side. If you think we need to specifically record the number of mbeans registered by kafka threads, we can add a filter in the query to only record those mbeans named kafka. But I think it is actually useful to report total number of mbeans from user's perspective. PS. the semantics of this metric is total number of mbeans registered by the program, not total number of mbeans registered by kafka threads. Thus it should not simply get classified as a kafka problem in your example. |
|
I agree that it may be useful to record and report the total number of attributes. The main concern I have in the current approach is that each lookup for this metric (from some metric pulling system) will involve a traversal of the mbean registry which is a bit much if the lookup happens frequently. There are a couple of ways to avoid this/improve it:
|
|
@jjkoshy I think the updated the patch should address the concern raised by you and Onur. Thanks! |
|
My bad. I was again going to say LGTM but noticed that your change makes kafka-clients depend on yammer metrics, which I think we wanted to avoid. |
|
Given the dependency constraint, I think the best approach would be to have a kafka-metrics-total and yammer-metrics-total. kafka-metrics-total can get created within Metrics kafka-metrics-total could use |
|
@onurkaraman @jjkoshy @becketqin Thanks much for taking time to review the patch! I have updated the patch so that we don't introduce yammer dependency to new producer and new consumer. With the latest patch, server will have 2 additional jmx metrics with name kafka-metrics-total and yammer-metrics-total. New producer and new consumer will have 1 additional jmx metrics named kafka-metrics-total. Old consumer will have 1 additional metrics named yammer-metrics-total. I have tested the patch with server, new producer and old consumer. The patch works as expected. Can you please have another look? |
|
Test failure is unrelated to this patch. |
60c923c to
d83f19c
Compare
|
@guozhangwang Would you have time to review this patch? @onurkaraman has reviewed this patch. @jjkoshy would be on leave soon. |
|
Went through the code and it looks good to me overall, but I would like to borrow another pair of eyes from @junrao to take a look as my knowledge in this module is shallow. |
There was a problem hiding this comment.
groupedMetrics creates a new TreeMap on every invocation. Also, the conversion makes yet another copy of these in scala. Can you instead just do defaultRegistry().allMetrics().size()? That does not create any unnecessary intermediate copies.
|
Thank you @guozhangwang @jjkoshy for your review. I have updated the patch as suggested. |
There was a problem hiding this comment.
Currently, we add the default tag (e.g, client-id) explicitly for each metric. With the default tags, I am wondering if it's simpler to add the following method in Metrics that automatically adds the default tags in the metric name. Then, we can use that method to create the metric name and remove the code that explicitly sets the default tags.
MetricName metricName(...)
There was a problem hiding this comment.
Yes it is a good point and should remove quite some lines of code. But this will touch a lot of files and potentially cause conflict with other commits. Let me give it a shot.
75cb495 to
b1acb1a
Compare
|
@junrao My last commit remove the code that explicitly sets the default tags, and instead use I still kept the explicit usage of Could you have a look and let me know if this is what you originally suggested? Thanks! |
There was a problem hiding this comment.
Instead of using MeasurableStat(), it seems that we can just use Measurable like the following.
new Measurable() {
public double measure(MetricConfig config, long now) {
return metrics.size();
}
There was a problem hiding this comment.
Ah I misunderstood your comment on this earlier. Fixed now.
|
@lindong28 Seems some of the indentation does not match check-style rules in the Jenkins job, could you verify? |
|
@guozhangwang I was wondering why the Jenkins test failed without providing failure information. I have fixed indention and remove unused imports. Could you have a look? Thanks! |
|
@junrao Great! Thanks so much for your time and review. I have removed a bunch of unnecessary space and rebased the patch against trunk. |
|
Thanks for the patch. LGTM |
|
@lindong28 I got some warnings when compiling with this patch, could you take a look? Seems "<>" cannot be directly used in comment link: |
|
@guozhangwang Thanks for the catch. I didn't notice the warning message.. I have created a minor pull request at #651. Can you take a look? |
…e#328) TICKET = KAFKA-13797 LI_DESCRIPTION = In the past, we found that some brokers in the venice cluster went through heavy load due to excessive Metadata requests. This PR adds a metric to show the rate of metadata outgoing bytes per second for easier troubleshooting of such issues. EXIT_CRITERIA = When KAFKA-13797 is resolved and the changes are pulled into this repo.
…e#328) TICKET = KAFKA-13797 LI_DESCRIPTION = In the past, we found that some brokers in the venice cluster went through heavy load due to excessive Metadata requests. This PR adds a metric to show the rate of metadata outgoing bytes per second for easier troubleshooting of such issues. EXIT_CRITERIA = When KAFKA-13797 is resolved and the changes are pulled into this repo.
@onurkaraman @becketqin Do you have time to review this patch? It addresses the ticket that @jjkoshy filed in KAFKA-2668.