/area monitoring
When using Prometheus a standard principle is to have metrics with low cardinality but also as a key concept in monitoring in general. Low cardinality is a key design principle in latest standards too.
Although Prometheus made steps to make things more flexible in the past, when it comes to configuring mem current versions enforce you to limit the time-series ingested to tune mem implicitly, the memory consumption Knative metrics impose is not low. One way to calculate the memory is using this calculator .
Right now we have a lot of metrics which use as a label the revision name, config name, pod name and namespace name.
For example to mention a few:
activator (request_latencies), autoscaler (reconsiler) time series have a complexity of: #histogram_buckets*#revision*#ns
webhook emits similar histogram metrics and depends on number of kinds and namespaces.
To understand the scale if we use 30 buckets (aggregated from several histograms), 100 services and 50 namespaces this means 150K timeseries from one pod.
We have several pods and no Eventing is added in the picture where we have high cardinality due to event_type, filter_type etc.
In the calculator above 1M of time serties with specific assumption needs around 4GB of memory. Given the number of pods we use we can easily reach that number. We already face this downstream.
Here is a sample status report for the top series on Prometheus when using 100 services:

Also note that we havent taken into consideration the scenario where a pod name changes due to a restart (it can happen easily). A Prometheus instance is not meant to serve only Knative so in general we should tune our metrics api. I propose we limit our labels to the namespace level not per revision.
Logging should be used to understand the behavior of individual services not metrics. Also we need to reconsider histograms for the webhook and controller cases, buckets make cardinality explode.
What version of Knative?
All versions
Expected Behavior
Metrics should have low cardinality.
Actual Behavior
Excessive number of time series are created.
Steps to Reproduce the Problem
Create a moderate number of namespaces and ksvcs.
/cc @evankanderson @mattmoor @markusthoemmes
/area monitoring
When using Prometheus a standard principle is to have metrics with low cardinality but also as a key concept in monitoring in general. Low cardinality is a key design principle in latest standards too.

Although Prometheus made steps to make things more flexible in the past, when it comes to configuring mem current versions enforce you to limit the time-series ingested to tune mem implicitly, the memory consumption Knative metrics impose is not low. One way to calculate the memory is using this calculator .
Right now we have a lot of metrics which use as a label the revision name, config name, pod name and namespace name.
For example to mention a few:
activator (request_latencies), autoscaler (reconsiler) time series have a complexity of: #histogram_buckets*#revision*#ns
webhook emits similar histogram metrics and depends on number of kinds and namespaces.
To understand the scale if we use 30 buckets (aggregated from several histograms), 100 services and 50 namespaces this means 150K timeseries from one pod.
We have several pods and no Eventing is added in the picture where we have high cardinality due to event_type, filter_type etc.
In the calculator above 1M of time serties with specific assumption needs around 4GB of memory. Given the number of pods we use we can easily reach that number. We already face this downstream.
Here is a sample status report for the top series on Prometheus when using 100 services:
Also note that we havent taken into consideration the scenario where a pod name changes due to a restart (it can happen easily). A Prometheus instance is not meant to serve only Knative so in general we should tune our metrics api. I propose we limit our labels to the namespace level not per revision.
Logging should be used to understand the behavior of individual services not metrics. Also we need to reconsider histograms for the webhook and controller cases, buckets make cardinality explode.
What version of Knative?
All versions
Expected Behavior
Metrics should have low cardinality.
Actual Behavior
Excessive number of time series are created.
Steps to Reproduce the Problem
Create a moderate number of namespaces and ksvcs.
/cc @evankanderson @mattmoor @markusthoemmes