Skip to content

Track Kafka client side metrics via Kamon#4481

Merged
chetanmeh merged 5 commits intoapache:masterfrom
chetanmeh:kafka-metrics
May 29, 2019
Merged

Track Kafka client side metrics via Kamon#4481
chetanmeh merged 5 commits intoapache:masterfrom
chetanmeh:kafka-metrics

Conversation

@chetanmeh
Copy link
Member

Tracks Kafka client metrics via Kamon for monitoring

Description

Currently Kafka metrics are not getting tracked via Kamon. Due to this we do not gain any insight into the Kafka interactions. Out of the box Kafka tracks quite a few metrics on client side these metrics are exposed via JMX

image

It also support custom MetricsReporter to listen to such metrics. This PR makes use of same reporter support to publish the metrics to Kamon (based on approach taken in kamon-metrics-reporter)

Usage

KamonMetricsReporter needs to be enabled via config and provided a set of metric names to track.

whisk {
  kafka {
    common {
      metric-reporters = "org.apache.openwhisk.connector.kafka.KamonMetricsReporter"
    }
    metrics {
      // Name of metrics which should be tracked via Kamon
      names = [
        // consumer-fetch-manager-metrics
        "records-lag-max", // The maximum lag in terms of number of records for any partition in this window
        "records-consumed-total" // The total number of records consumed
      ]

      report-interval = 10 seconds
    }
  }
}

Once enabled those metrics would be pushed to Kamon. For above config following metrics can be seen in Prometheus

# TYPE consumer_fetch_manager_metrics_records_consumed_total counter
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-completed0"} 2.0
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-cacheInvalidation"} 0.0
consumer_fetch_manager_metrics_records_consumed_total{client_id="consumer-health"} 1007.0

Implementation

This PR takes a whitelist approach and does not publishes all metrics by default. As there are more than 300 metrics tracked by Kafka across Producer and Consumer.

For counters Kafka records two types of metrics total and rate. See KIP-187 for details. So we should ignore metrics ending with rate and prefer total metrics for Kamon tracking

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95
Copy link
Member

style95 commented May 15, 2019

@chetanmeh IMHO, we can collect these metrics out of OpenWhisk.
Is there any reason to collect them via OpenWhisk?

Similarly, I am not sure we would collect some metrics in CouchDB via OpenWhisk even if we want to see some CouchDB metrics as well.

@codecov-io
Copy link

Codecov Report

Merging #4481 into master will decrease coverage by 3.54%.
The diff coverage is 2.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4481      +/-   ##
==========================================
- Coverage   83.99%   80.45%   -3.55%     
==========================================
  Files         170      171       +1     
  Lines        7940     7980      +40     
  Branches      536      532       -4     
==========================================
- Hits         6669     6420     -249     
- Misses       1271     1560     +289
Impacted Files Coverage Δ
...enwhisk/connector/kafka/KamonMetricsReporter.scala 0% <0%> (ø)
...whisk/connector/kafka/KafkaConsumerConnector.scala 60.29% <100%> (+5.07%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.46%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.67%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0% <0%> (-84.62%) ⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 4% <0%> (-52%) ⬇️
...in/scala/org/apache/openwhisk/common/Counter.scala 40% <0%> (-20%) ⬇️
...penwhisk/core/database/cosmosdb/CosmosDBUtil.scala 81.81% <0%> (-15.16%) ⬇️
...nwhisk/core/database/cosmosdb/CosmosDBConfig.scala 94.11% <0%> (-5.89%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4df3d09...fb18207. Read the comment docs.

@chetanmeh
Copy link
Member Author

IMHO, we can collect these metrics out of OpenWhisk

@style95 Any pointers on that? So far did not find a easy way to get handle to client side metrics in OpenWhisk. Only approach I saw so far was using some form of JMX Exporter which are tricky to setup. Hence using this approach

Note that this PR only enables client side metrics and not broker side ones

@style95
Copy link
Member

style95 commented May 15, 2019

@chetanmeh
I am using an in-house JMX metric collector to collect major Kafka metrics including client metrics. And it is a standalone java application which is comprised of 300 lines of codes.
I think it is one of the common and natural ways as we can find a similar approach in many places such as Prometheus.
https://github.com/prometheus/jmx_exporter

I meant no offense. I just want to discuss this with you.
I feel like we can take advantage of external components for this kind of case.
For example, we can delegate logs collection to logback or slf4j appender or standalone collection agents such as logstash.
I believe, the more we delegate such logic to external components, the more concise we can keep the core logic.

I have not looked into it deeply yet, but there is also a similar interesting approach.
https://github.com/Segence/kamon-jmx-collector

@selfxp
Copy link
Contributor

selfxp commented May 16, 2019

@chetanmeh I personally think that this extension is very useful. We already have integration with Kamon and Kafka has been a black box for any type of deployment for a long time. Collecting client side metrics is definitely a step in the right direction.
But also collecting the jmx Kafka metrics would make a lot of sense also, and that could the "external" and "done in house" part that I think @style95 is referring to.

@chetanmeh
Copy link
Member Author

@style95 Agreed that JMX exporters can be used. However I find them tricky to setup and each monitoring system has its own way of exporting (Datadog, Prometheus etc). In most cases it involves running another agent which needs a custom Docker build and for Prometheus results in having 2 http servers running exposing the metrics.

As OpenWhisk uses Kamon which abstract away all such integrations I was looking for a way to route such metrics to Kamon and then sent to various monitoring systems in a uniform way.

Further note that proposed KamonMetricsReporter is optional (disabled by default) and would introduce a minimal overhead in terms of metrics collected and reported

@rabbah
Copy link
Member

rabbah commented May 16, 2019

Can someone mention this pr on the dev list?

@chetanmeh
Copy link
Member Author

Did not got any feedback on dev thread. Would it be ok to merge this and have this feature disabled by default?

@rabbah
Copy link
Member

rabbah commented May 28, 2019

@chetanmeh LGTM - is this redundant though (when turned on) with the other kafka metric that's emitted by the scheduled actor?

@chetanmeh
Copy link
Member Author

is this redundant though (when turned on) with the other kafka metric that's emitted by the scheduled actor?

This exposes more metrics in addition to lag. So kind of complements what we have in scheduled actor. Going forward we may even think of removing the logic in scheduled actor and make use of approach like done here

val server = ManagementFactory.getPlatformMBeanServer
val name = new ObjectName(s"kafka.consumer:type=consumer-fetch-manager-metrics,client-id=$id")
def consumerLag: Long = server.getAttribute(name, "records-lag-max").asInstanceOf[Double].toLong.max(0)

@rabbah
Copy link
Member

rabbah commented May 28, 2019

i like it 👍

@chetanmeh chetanmeh merged commit 658516e into apache:master May 29, 2019
@chetanmeh chetanmeh deleted the kafka-metrics branch May 29, 2019 04:15
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
Adds a configurable MetricsReporter to route Kafka metrics to Kamon once enabled. Set of metrics names which need to be captured needs to be explicitly configured
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants