Track Kafka client side metrics via Kamon by chetanmeh · Pull Request #4481 · apache/openwhisk

chetanmeh · 2019-05-15T10:46:59Z

Tracks Kafka client metrics via Kamon for monitoring

Description

Currently Kafka metrics are not getting tracked via Kamon. Due to this we do not gain any insight into the Kafka interactions. Out of the box Kafka tracks quite a few metrics on client side these metrics are exposed via JMX

It also support custom MetricsReporter to listen to such metrics. This PR makes use of same reporter support to publish the metrics to Kamon (based on approach taken in kamon-metrics-reporter)

Usage

KamonMetricsReporter needs to be enabled via config and provided a set of metric names to track.

whisk {
  kafka {
    common {
      metric-reporters = "org.apache.openwhisk.connector.kafka.KamonMetricsReporter"
    }
    metrics {
      // Name of metrics which should be tracked via Kamon
      names = [
        // consumer-fetch-manager-metrics
        "records-lag-max", // The maximum lag in terms of number of records for any partition in this window
        "records-consumed-total" // The total number of records consumed
      ]

      report-interval = 10 seconds
    }
  }
}

Once enabled those metrics would be pushed to Kamon. For above config following metrics can be seen in Prometheus

# TYPE consumer_fetch_manager_metrics_records_consumed_total counter
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-completed0"} 2.0
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-cacheInvalidation"} 0.0
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-health"} 1007.0

Implementation

This PR takes a whitelist approach and does not publishes all metrics by default. As there are more than 300 metrics tracked by Kafka across Producer and Consumer.

For counters Kafka records two types of metrics total and rate. See KIP-187 for details. So we should ignore metrics ending with rate and prefer total metrics for Kamon tracking

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

style95 · 2019-05-15T11:25:36Z

@chetanmeh IMHO, we can collect these metrics out of OpenWhisk.
Is there any reason to collect them via OpenWhisk?

Similarly, I am not sure we would collect some metrics in CouchDB via OpenWhisk even if we want to see some CouchDB metrics as well.

codecov-io · 2019-05-15T11:42:20Z

Codecov Report

Merging #4481 into master will decrease coverage by 3.54%.
The diff coverage is 2.5%.

@@            Coverage Diff             @@
##           master    #4481      +/-   ##
==========================================
- Coverage   83.99%   80.45%   -3.55%     
==========================================
  Files         170      171       +1     
  Lines        7940     7980      +40     
  Branches      536      532       -4     
==========================================
- Hits         6669     6420     -249     
- Misses       1271     1560     +289

Impacted Files	Coverage Δ
...enwhisk/connector/kafka/KamonMetricsReporter.scala	`0% <0%> (ø)`
...whisk/connector/kafka/KafkaConsumerConnector.scala	`60.29% <100%> (+5.07%)`	⬆️
...core/database/cosmosdb/RxObservableImplicits.scala	`0% <0%> (-100%)`	⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala	`0% <0%> (-95.46%)`	⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala	`0% <0%> (-92.67%)`	⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala	`0% <0%> (-84.62%)`	⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala	`4% <0%> (-52%)`	⬇️
...in/scala/org/apache/openwhisk/common/Counter.scala	`40% <0%> (-20%)`	⬇️
...penwhisk/core/database/cosmosdb/CosmosDBUtil.scala	`81.81% <0%> (-15.16%)`	⬇️
...nwhisk/core/database/cosmosdb/CosmosDBConfig.scala	`94.11% <0%> (-5.89%)`	⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4df3d09...fb18207. Read the comment docs.

chetanmeh · 2019-05-15T13:27:44Z

IMHO, we can collect these metrics out of OpenWhisk

@style95 Any pointers on that? So far did not find a easy way to get handle to client side metrics in OpenWhisk. Only approach I saw so far was using some form of JMX Exporter which are tricky to setup. Hence using this approach

Note that this PR only enables client side metrics and not broker side ones

style95 · 2019-05-15T15:21:09Z

@chetanmeh
I am using an in-house JMX metric collector to collect major Kafka metrics including client metrics. And it is a standalone java application which is comprised of 300 lines of codes.
I think it is one of the common and natural ways as we can find a similar approach in many places such as Prometheus.
https://github.com/prometheus/jmx_exporter

I meant no offense. I just want to discuss this with you.
I feel like we can take advantage of external components for this kind of case.
For example, we can delegate logs collection to logback or slf4j appender or standalone collection agents such as logstash.
I believe, the more we delegate such logic to external components, the more concise we can keep the core logic.

I have not looked into it deeply yet, but there is also a similar interesting approach.
https://github.com/Segence/kamon-jmx-collector

selfxp · 2019-05-16T00:19:14Z

@chetanmeh I personally think that this extension is very useful. We already have integration with Kamon and Kafka has been a black box for any type of deployment for a long time. Collecting client side metrics is definitely a step in the right direction.
But also collecting the jmx Kafka metrics would make a lot of sense also, and that could the "external" and "done in house" part that I think @style95 is referring to.

chetanmeh · 2019-05-16T04:22:59Z

@style95 Agreed that JMX exporters can be used. However I find them tricky to setup and each monitoring system has its own way of exporting (Datadog, Prometheus etc). In most cases it involves running another agent which needs a custom Docker build and for Prometheus results in having 2 http servers running exposing the metrics.

As OpenWhisk uses Kamon which abstract away all such integrations I was looking for a way to route such metrics to Kamon and then sent to various monitoring systems in a uniform way.

Further note that proposed KamonMetricsReporter is optional (disabled by default) and would introduce a minimal overhead in terms of metrics collected and reported

rabbah · 2019-05-16T12:13:29Z

Can someone mention this pr on the dev list?

chetanmeh · 2019-05-28T14:42:27Z

Did not got any feedback on dev thread. Would it be ok to merge this and have this feature disabled by default?

rabbah · 2019-05-28T14:47:08Z

@chetanmeh LGTM - is this redundant though (when turned on) with the other kafka metric that's emitted by the scheduled actor?

chetanmeh · 2019-05-28T15:05:57Z

is this redundant though (when turned on) with the other kafka metric that's emitted by the scheduled actor?

This exposes more metrics in addition to lag. So kind of complements what we have in scheduled actor. Going forward we may even think of removing the logic in scheduled actor and make use of approach like done here

val server = ManagementFactory.getPlatformMBeanServer
val name = new ObjectName(s"kafka.consumer:type=consumer-fetch-manager-metrics,client-id=$id")
def consumerLag: Long = server.getAttribute(name, "records-lag-max").asInstanceOf[Double].toLong.max(0)

rabbah · 2019-05-28T15:08:42Z

i like it 👍

Adds a configurable MetricsReporter to route Kafka metrics to Kamon once enabled. Set of metrics names which need to be captured needs to be explicitly configured

chetanmeh added 5 commits May 15, 2019 15:11

Add clientId to consumer

46a74c2

Initial reporter implementation

9ac7cc9

Drop total suffix if present

2dd6a5a

Exclude duplicate metrics

894995c

Metric config

fb18207

rabbah approved these changes May 28, 2019

View reviewed changes

chetanmeh merged commit 658516e into apache:master May 29, 2019

chetanmeh deleted the kafka-metrics branch May 29, 2019 04:15

chetanmeh mentioned this pull request Oct 3, 2019

Expose Kafka metrics for User Event consumer as json endpoint #4665

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track Kafka client side metrics via Kamon#4481

Track Kafka client side metrics via Kamon#4481
chetanmeh merged 5 commits intoapache:masterfrom
chetanmeh:kafka-metrics

chetanmeh commented May 15, 2019

Uh oh!

style95 commented May 15, 2019

Uh oh!

codecov-io commented May 15, 2019

Uh oh!

chetanmeh commented May 15, 2019

Uh oh!

style95 commented May 15, 2019

Uh oh!

selfxp commented May 16, 2019

Uh oh!

chetanmeh commented May 16, 2019

Uh oh!

rabbah commented May 16, 2019

Uh oh!

chetanmeh commented May 28, 2019

Uh oh!

rabbah commented May 28, 2019

Uh oh!

chetanmeh commented May 28, 2019

Uh oh!

rabbah commented May 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

chetanmeh commented May 15, 2019

Description

Usage

Implementation

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

style95 commented May 15, 2019

Uh oh!

codecov-io commented May 15, 2019

Codecov Report

Uh oh!

chetanmeh commented May 15, 2019

Uh oh!

style95 commented May 15, 2019

Uh oh!

selfxp commented May 16, 2019

Uh oh!

chetanmeh commented May 16, 2019

Uh oh!

rabbah commented May 16, 2019

Uh oh!

chetanmeh commented May 28, 2019

Uh oh!

rabbah commented May 28, 2019

Uh oh!

chetanmeh commented May 28, 2019

Uh oh!

rabbah commented May 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants