From 9fb473a7d66761374f1fe4670eed0b094c295f3b Mon Sep 17 00:00:00 2001 From: Stavros Kontopoulos Date: Mon, 19 Apr 2021 18:02:29 +0300 Subject: [PATCH 1/5] add metrics apid docs --- docs/_index.md | 6 +++ docs/eventing/_index.md | 4 ++ docs/eventing/metrics.md | 52 ++++++++++++++++++ docs/serving/_index.md | 4 ++ docs/serving/metrics.md | 110 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 176 insertions(+) create mode 100644 docs/eventing/metrics.md create mode 100644 docs/serving/metrics.md diff --git a/docs/_index.md b/docs/_index.md index 825daa11f8c..4d44a76039c 100755 --- a/docs/_index.md +++ b/docs/_index.md @@ -51,6 +51,12 @@ These components are delivered as Kubernetes custom resource definitions (CRDs), - [All samples for serving](./serving/samples/) - [All samples for eventing](./eventing/samples/) +### Observability + +- [Serving Metrics API](./serving/metrics/) +- [Eventing Metrics API](./eventing/metrics/) +- [Collecting metrics](./install/collecting-metrics) + ### Debugging - [Debugging application issues](./serving/debugging-application-issues/) diff --git a/docs/eventing/_index.md b/docs/eventing/_index.md index 268cc8fb783..1cb62cfb055 100644 --- a/docs/eventing/_index.md +++ b/docs/eventing/_index.md @@ -98,3 +98,7 @@ resources: 1. **[Sequence](./flows/sequence)** provides a way to define an in-order list of functions. 1. **[Parallel](./flows/parallel)** provides a way to define a list of branches for events. + +## Observability + +- [Eventing Metrics API](./metrics.md) diff --git a/docs/eventing/metrics.md b/docs/eventing/metrics.md new file mode 100644 index 00000000000..82ed44fd4eb --- /dev/null +++ b/docs/eventing/metrics.md @@ -0,0 +1,52 @@ +--- +title: "Metrics API" +weight: 99 +type: "docs" +--- + +## Eventing sources + +Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics +to verify that events have been delivered from the source side, thus verifying that the source and any connection with the source work as expected. + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| event_count | Number of events sent by the source | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable | +| retry_event_count | Number of events sent by the source in retries | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable + +## Broker + +### Ingress + +Use the following metrics to debug how broker ingress performs and what events are dispacthed via the ingress component. +By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| event_count | Number of events received by a Broker | Counter | broker_name
event_type
namespace_name
response_code
response_code_class
unique_name | Dimensionless | Stable +| event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name
event_type
namespace_name
response_code
response_code_class
unique_name | Milliseconds | Stable + +### Filter + +Use the following metrics to debug how broker filter performs and what events are dispatched via the filter component. +Also user can measure the latency of the actual filtering action on an event. +By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| event_count | Number of events received by a Broker | Counter | broker_name
container_name=
filter_type
namespace_name
response_code
response_code_class
trigger_name
unique_name | Dimensionless | Stable +| event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name
container_name
filter_type
namespace_name
response_code
response_code_class
trigger_name
unique_name | Milliseconds | Stable +| event_processing_latencies | The time spent processing an event before it is dispatched to a Trigger subscriber | Histogram | broker_name
container_name
filter_type
namespace_name
trigger_name
unique_name | Milliseconds | Stable + +## In-memory Dispatcher + +In-memory channel can be evaluated via the following metrics. +By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| event_count | Number of events dispatched by the in-memory channel | Counter | container_name
event_type=
namespace_name=
response_code
response_code_class
unique_name | Dimensionless | Stable +| event_dispatch_latencies | The time spent dispatching an event from a in-memory Channel | Histogram | container_name
event_type
namespace_name=
response_code
response_code_class
unique_name | Milliseconds | Stable + + +Note: A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics/). diff --git a/docs/serving/_index.md b/docs/serving/_index.md index 00b98539143..14fd9ca5830 100644 --- a/docs/serving/_index.md +++ b/docs/serving/_index.md @@ -77,6 +77,10 @@ in the Knative Serving repository. - [Assigning a static IP address for Knative on Google Kubernetes Engine](./gke-assigning-static-ip-address.md) - [Using subroutes](./using-subroutes.md) +## Observability + +- [Serving Metrics API](./metrics.md) + ## Known Issues See the [Knative Serving Issues](https://github.com/knative/serving/issues) page diff --git a/docs/serving/metrics.md b/docs/serving/metrics.md new file mode 100644 index 00000000000..62a627cf725 --- /dev/null +++ b/docs/serving/metrics.md @@ -0,0 +1,110 @@ +--- +title: "Metrics API" +weight: 99 +type: "docs" +--- + +### Activator + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| request_concurrency | Concurrent requests that are routed to Activator
These are requests reported by the concurrency reporter which may not be done yet.
This is the average concurrency over a reporting period | Gauge | configuration_name
container_name
namespace_name
pod_name
revision_name
service_name | Dimensionless | Stable | +| request_count | The number of requests that are routed to Activator.
These are requests that have been fulfilled from the activator handler. | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | +| request_latencies | The response time in millisecond for the fulfilled routed requests | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | + +### Autoscaler + +Generic + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| desired_pods | Number of pods autoscaler wants to allocate | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| excess_burst_capacity | Excess burst capacity overserved over the stable window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| stable_request_concurrency | Average of requests count per observed pod over the stable window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| panic_request_concurrency | Average of requests count per observed pod over the panic window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| target_concurrency_per_pod | The desired number of concurrent requests for each pod | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| stable_requests_per_second | Average requests-per-second per observed pod over the stable window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| panic_requests_per_second | Average requests-per-second per observed pod over the panic window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| target_requests_per_second | The desired requests-per-second for each pod | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| panic_mode | 1 if autoscaler is in panic mode, 0 otherwise | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | + +KPA + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| requested_pods | Number of pods autoscaler requested from Kubernetes | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| actual_pods | Number of pods that are allocated currently | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| not_ready_pods | Number of pods that are not ready currently | Gauge | configuration_name=
namespace_name=
revision_name
service_name | Dimensionless | Stable | +| pending_pods | Number of pods that are pending currently | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| terminating_pods | Number of pods that are terminating currently | Gauge | configuration_name
namespace_name
revision_name
service_name
| Dimensionless | Stable | + +### QUEUE proxy + +Requests endpoint + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| revision_request_count | The number of requests that are routed to queue-proxy | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | +| revision_request_latencies | The response time in millisecond | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | +| revision_app_request_count | The number of requests that are routed to user-container | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | +| revision_app_request_latencies | The response time in millisecond | Histogram | configuration_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | +| revision_queue_depth | The current number of items in the serving and waiting queue, or not reported if unlimited concurrency | Gauge | configuration_name
event-display
container_name
namespace_name
pod_name
response_code_class
revision_name
service_name | Dimensionless | Stable | + + +### Controller + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| work_queue_depth | Depth of the work queue | Gauge | reconciler | Dimensionless | Stable | +| reconcile_count | Number of reconcile operations | Counter | reconciler
success
| Dimensionless | Stable | +| reconcile_latency | Latency of reconcile operations | Histogram | reconciler
success
| Milliseconds | Stable | +| workqueue_adds_total | Total number of adds handled by workqueue | Counter | name | Dimensionless | Stable | +| workqueue_depth | Current depth of workqueue | Gauge | reconciler | Dimensionless | Stable | +| workqueue_queue_latency_seconds | How long in seconds an item stays in workqueue before being requested | Histogram | name | Seconds | Stable | +| workqueue_retries_total | Total number of retries handled by workqueue | Counter | name | Dimensionless | Stable | +| workqueue_work_duration_seconds | How long in seconds processing an item from a workqueue takes. | Histogram | name | Seconds| Stable | +| workqueue_unfinished_work_seconds | How long in seconds the outstanding workqueue items have been in flight (total). | Histogram | name | Seconds | Stable | +| workqueue_longest_running_processor_seconds | How long in seconds the longest outstanding workqueue item has been in flight | Histogram | name | Seconds | Stable | + +### Webhook + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| request_count | The number of requests that are routed to webhook | Counter | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Dimensionless | Stable | +| request_latencies | The response time in milliseconds | Histogram | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Milliseconds | Stable | + +### Go Runtime - memstats + +Each process emits a number of memory statistics from the go runtime. + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| go_alloc | The number of bytes of allocated heap objects | Gauge | name | Dimensionless | Stable | +| go_total_alloc | The cumulative bytes allocated for heap objects | Gauge | name | Dimensionless | Stable | +| go_sys | The total bytes of memory obtained from the OS | Gauge | name | Dimensionless | Stable | +| go_lookups | The number of pointer lookups performed by the runtime | Gauge | name | Dimensionless | Stable | +| go_mallocs | The cumulative count of heap objects allocated | Gauge | name | Dimensionless | Stable | +| go_frees | The cumulative count of heap objects freed | Gauge | name | Dimensionless | Stable | +| go_heap_alloc | The number of bytes of allocated heap objects | Gauge | name | Dimensionless | Stable | +| go_heap_sys | The number of bytes of heap memory obtained from the OS | Gauge | name | Dimensionless | Stable | +| go_heap_idle | The number of bytes in idle (unused) spans | Gauge | name | Dimensionless | Stable | +| go_heap_in_use | The number of bytes in in-use spans | Gauge | name | Dimensionless | Stable | +| go_heap_released | The number of bytes of physical memory returned to the OS | Gauge | name | Dimensionless | Stable | +| go_heap_objects | The number of allocated heap objects | Gauge | name | Dimensionless | Stable | +| go_stack_in_use | The number of bytes in stack spans | Gauge | name | Dimensionless | Stable | +| go_stack_sys | The number of bytes of stack memory obtained from the OS | Gauge | name | Dimensionless | Stable | +| go_mspan_in_use | The number of bytes of allocated mspan structures | Gauge | name | Dimensionless | Stable | +| go_mspan_sys | The number of bytes of memory obtained from the OS for mspan structures | Gauge | name | Dimensionless | Stable | +| go_mcache_in_use | The number of bytes of allocated mcache structures | Gauge | name | Dimensionless | Stable | +| go_mcache_sys | The number of bytes of memory obtained from the OS for mcache structures | Gauge | name | Dimensionless | Stable | +| go_bucket_hash_sys | The number of bytes of memory in profiling bucket hash tables. | Gauge | name | Dimensionless | Stable | +| go_gc_sys | The number of bytes of memory in garbage collection metadata | Gauge | name | Dimensionless | Stable | +| go_other_sys | The number of bytes of memory in miscellaneous off-heap runtime allocations | Gauge | name | Dimensionless | Stable | +| go_next_gc | The target heap size of the next GC cycle | Gauge | name | Dimensionless | Stable | +| go_last_gc | The time the last garbage collection finished, as nanoseconds since 1970 (the UNIX epoch) | Gauge | name | Nanoseconds | Stable | +| go_total_gc_pause_ns | The cumulative nanoseconds in GC stop-the-world pauses since the program started | Gauge | name | Nanoseconds | Stable | +| go_num_gc | The number of completed GC cycles. | Gauge | name | Dimensionless | Stable | +| go_num_forced_gc | The number of GC cycles that were forced by the application calling the GC function. | Gauge | name | Dimensionless | Stable | +| go_gc_cpu_fraction | The fraction of this program's available CPU time used by the GC since the program started | Gauge | name | Dimensionless | Stable | + +Note: name tag is empty. From a9f6c4a16dc41d8efb94107bb4098dc19a4d54f6 Mon Sep 17 00:00:00 2001 From: Stavros Kontopoulos Date: Mon, 19 Apr 2021 18:10:07 +0300 Subject: [PATCH 2/5] remove whitespace --- docs/serving/metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/serving/metrics.md b/docs/serving/metrics.md index 62a627cf725..8a055df9bd7 100644 --- a/docs/serving/metrics.md +++ b/docs/serving/metrics.md @@ -12,7 +12,7 @@ type: "docs" | request_count | The number of requests that are routed to Activator.
These are requests that have been fulfilled from the activator handler. | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | | request_latencies | The response time in millisecond for the fulfilled routed requests | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | -### Autoscaler +### Autoscaler Generic From b211d366e33f74ca0601435b0c9c2cfc765e0ec7 Mon Sep 17 00:00:00 2001 From: Stavros Kontopoulos Date: Tue, 20 Apr 2021 20:41:09 +0300 Subject: [PATCH 3/5] add notes --- docs/eventing/metrics.md | 7 ++++++- docs/serving/metrics.md | 19 ++++++++++++------- 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/docs/eventing/metrics.md b/docs/eventing/metrics.md index 82ed44fd4eb..2f2b789495d 100644 --- a/docs/eventing/metrics.md +++ b/docs/eventing/metrics.md @@ -4,6 +4,11 @@ weight: 99 type: "docs" --- +
+ +**NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. + + ## Eventing sources Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics @@ -49,4 +54,4 @@ By aggregating the metrics over the http code, events can be separated into two | event_dispatch_latencies | The time spent dispatching an event from a in-memory Channel | Histogram | container_name
event_type
namespace_name=
response_code
response_code_class
unique_name | Milliseconds | Stable -Note: A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics/). +**NOTE:** A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics.md#controller). diff --git a/docs/serving/metrics.md b/docs/serving/metrics.md index 8a055df9bd7..26276c86881 100644 --- a/docs/serving/metrics.md +++ b/docs/serving/metrics.md @@ -4,7 +4,12 @@ weight: 99 type: "docs" --- -### Activator +
+ +**NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. + + +## Activator | Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| @@ -12,7 +17,7 @@ type: "docs" | request_count | The number of requests that are routed to Activator.
These are requests that have been fulfilled from the activator handler. | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | | request_latencies | The response time in millisecond for the fulfilled routed requests | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | -### Autoscaler +## Autoscaler Generic @@ -38,7 +43,7 @@ KPA | pending_pods | Number of pods that are pending currently | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | | terminating_pods | Number of pods that are terminating currently | Gauge | configuration_name
namespace_name
revision_name
service_name
| Dimensionless | Stable | -### QUEUE proxy +## Queue proxy Requests endpoint @@ -51,7 +56,7 @@ Requests endpoint | revision_queue_depth | The current number of items in the serving and waiting queue, or not reported if unlimited concurrency | Gauge | configuration_name
event-display
container_name
namespace_name
pod_name
response_code_class
revision_name
service_name | Dimensionless | Stable | -### Controller +## Controller | Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| @@ -66,14 +71,14 @@ Requests endpoint | workqueue_unfinished_work_seconds | How long in seconds the outstanding workqueue items have been in flight (total). | Histogram | name | Seconds | Stable | | workqueue_longest_running_processor_seconds | How long in seconds the longest outstanding workqueue item has been in flight | Histogram | name | Seconds | Stable | -### Webhook +## Webhook | Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| | request_count | The number of requests that are routed to webhook | Counter | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Dimensionless | Stable | | request_latencies | The response time in milliseconds | Histogram | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Milliseconds | Stable | -### Go Runtime - memstats +## Go Runtime - memstats Each process emits a number of memory statistics from the go runtime. @@ -107,4 +112,4 @@ Each process emits a number of memory statistics from the go runtime. | go_num_forced_gc | The number of GC cycles that were forced by the application calling the GC function. | Gauge | name | Dimensionless | Stable | | go_gc_cpu_fraction | The fraction of this program's available CPU time used by the GC since the program started | Gauge | name | Dimensionless | Stable | -Note: name tag is empty. +**NOTE:** name tag is empty. From 0cdaf682d9a227494209e9ce4ee3e2c960014063 Mon Sep 17 00:00:00 2001 From: Stavros Kontopoulos Date: Wed, 21 Apr 2021 19:17:24 +0300 Subject: [PATCH 4/5] separate to admin-dev --- docs/eventing/metrics.md | 34 ++++++++++-------- docs/serving/metrics.md | 74 +++++++++++++++++++++++++--------------- 2 files changed, 66 insertions(+), 42 deletions(-) diff --git a/docs/eventing/metrics.md b/docs/eventing/metrics.md index 2f2b789495d..94886b43ede 100644 --- a/docs/eventing/metrics.md +++ b/docs/eventing/metrics.md @@ -8,20 +8,12 @@ type: "docs" **NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. +## Admin -## Eventing sources +Administrators can monitor Eventing based on the metrics exposed by each Eventing component. +Metrics are listed next. -Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics -to verify that events have been delivered from the source side, thus verifying that the source and any connection with the source work as expected. - -| Metric Name | Description | Type | Tags | Unit | Status | -|:-|:-|:-|:-|:-|:-| -| event_count | Number of events sent by the source | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable | -| retry_event_count | Number of events sent by the source in retries | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable - -## Broker - -### Ingress +### Broker - Ingress Use the following metrics to debug how broker ingress performs and what events are dispacthed via the ingress component. By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). @@ -31,7 +23,7 @@ By aggregating the metrics over the http code, events can be separated into two | event_count | Number of events received by a Broker | Counter | broker_name
event_type
namespace_name
response_code
response_code_class
unique_name | Dimensionless | Stable | event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name
event_type
namespace_name
response_code
response_code_class
unique_name | Milliseconds | Stable -### Filter +### Broker - Filter Use the following metrics to debug how broker filter performs and what events are dispatched via the filter component. Also user can measure the latency of the actual filtering action on an event. @@ -43,7 +35,7 @@ By aggregating the metrics over the http code, events can be separated into two | event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name
container_name
filter_type
namespace_name
response_code
response_code_class
trigger_name
unique_name | Milliseconds | Stable | event_processing_latencies | The time spent processing an event before it is dispatched to a Trigger subscriber | Histogram | broker_name
container_name
filter_type
namespace_name
trigger_name
unique_name | Milliseconds | Stable -## In-memory Dispatcher +### In-memory Dispatcher In-memory channel can be evaluated via the following metrics. By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). @@ -55,3 +47,17 @@ By aggregating the metrics over the http code, events can be separated into two **NOTE:** A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics.md#controller). + + +## Developer metrics + +### Eventing sources + +Eventing sources are created by users so they can trigger their applications with events. +Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics +to verify that events have been delivered from the source side, thus verifying that the source and any connection with the source work as expected. + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| event_count | Number of events sent by the source | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable | +| retry_event_count | Number of events sent by the source in retries | Counter | event_source
event_type
name
namespace_name
resource_group
response_code
response_code_class
response_error
response_timeout | Dimensionless | Stable diff --git a/docs/serving/metrics.md b/docs/serving/metrics.md index 26276c86881..b70b82dcf55 100644 --- a/docs/serving/metrics.md +++ b/docs/serving/metrics.md @@ -7,20 +7,30 @@ type: "docs"
**NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. +
+ +## Admin +Administrators can monitor Serving control plane based on the metrics exposed by each Serving component. +Metrics are listed next. -## Activator +### Activator +The following metrics allow the user to understand how application responds when traffic goes through the activator eg. scaling from zero. For example high request latency means that requests are taken too much time be fulfilled. +
| Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| | request_concurrency | Concurrent requests that are routed to Activator
These are requests reported by the concurrency reporter which may not be done yet.
This is the average concurrency over a reporting period | Gauge | configuration_name
container_name
namespace_name
pod_name
revision_name
service_name | Dimensionless | Stable | | request_count | The number of requests that are routed to Activator.
These are requests that have been fulfilled from the activator handler. | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | | request_latencies | The response time in millisecond for the fulfilled routed requests | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | -## Autoscaler - -Generic +### Autoscaler +Autoscaler component exposes a number of metrics related to its decisions per revision. +For example at any given time user can monitor the desired pods the Autoscaler wants to allocate for +a service, the average number of requests per second during the stable window, whether autoscaler is in panic mode (KPA) etc. +To read more about how autoscaler works check [here](https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md). +
| Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| | desired_pods | Number of pods autoscaler wants to allocate | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | @@ -32,31 +42,17 @@ Generic | panic_requests_per_second | Average requests-per-second per observed pod over the panic window | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | | target_requests_per_second | The desired requests-per-second for each pod | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | | panic_mode | 1 if autoscaler is in panic mode, 0 otherwise | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | - -KPA - -| Metric Name | Description | Type | Tags | Unit | Status | -|:-|:-|:-|:-|:-|:-| | requested_pods | Number of pods autoscaler requested from Kubernetes | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | -| actual_pods | Number of pods that are allocated currently | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | +| actual_pods | Number of pods that are allocated currently in ready state | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | | not_ready_pods | Number of pods that are not ready currently | Gauge | configuration_name=
namespace_name=
revision_name
service_name | Dimensionless | Stable | | pending_pods | Number of pods that are pending currently | Gauge | configuration_name
namespace_name
revision_name
service_name | Dimensionless | Stable | | terminating_pods | Number of pods that are terminating currently | Gauge | configuration_name
namespace_name
revision_name
service_name
| Dimensionless | Stable | -## Queue proxy - -Requests endpoint - -| Metric Name | Description | Type | Tags | Unit | Status | -|:-|:-|:-|:-|:-|:-| -| revision_request_count | The number of requests that are routed to queue-proxy | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | -| revision_request_latencies | The response time in millisecond | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | -| revision_app_request_count | The number of requests that are routed to user-container | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | -| revision_app_request_latencies | The response time in millisecond | Histogram | configuration_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | -| revision_queue_depth | The current number of items in the serving and waiting queue, or not reported if unlimited concurrency | Gauge | configuration_name
event-display
container_name
namespace_name
pod_name
response_code_class
revision_name
service_name | Dimensionless | Stable | - +### Controller -## Controller +The following metrics are emitted by any component that implements a controller logic. +The metrics show details about the reconciliation operations and the workqueue behavior on which +reconciliation requests are enqueued. | Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| @@ -71,20 +67,24 @@ Requests endpoint | workqueue_unfinished_work_seconds | How long in seconds the outstanding workqueue items have been in flight (total). | Histogram | name | Seconds | Stable | | workqueue_longest_running_processor_seconds | How long in seconds the longest outstanding workqueue item has been in flight | Histogram | name | Seconds | Stable | -## Webhook +### Webhook +Webhook metrics report useful info about operations eg. CREATE on Serving resources and if admission was allowed. +For example if a big number of operations fail this could be an issue with the submitted user resource. +
| Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| | request_count | The number of requests that are routed to webhook | Counter | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Dimensionless | Stable | | request_latencies | The response time in milliseconds | Histogram | admission_allowed
kind_group
kind_kind
kind_version
request_operation
resource_group
resource_namespace
resource_resource
resource_version | Milliseconds | Stable | -## Go Runtime - memstats - -Each process emits a number of memory statistics from the go runtime. +### Go Runtime - memstats +Each Knative Serving control plane process emits a number of Go runtime [memory statistics](https://golang.org/pkg/runtime/#MemStats) (shown next). +As a baseline for monitoring purproses, user could start with a subset of the metrics: current allocations (go_alloc), total allocations (go_total_alloc), system memory (go_sys), mallocs (go_mallocs), frees (go_frees) and garbage collection total pause time (total_gc_pause_ns), next gc target heap size (go_next_gc) and number of garbage collection cycles (num_gc). +
| Metric Name | Description | Type | Tags | Unit | Status | |:-|:-|:-|:-|:-|:-| -| go_alloc | The number of bytes of allocated heap objects | Gauge | name | Dimensionless | Stable | +| go_alloc | The number of bytes of allocated heap objects (same as heap_alloc) | Gauge | name | Dimensionless | Stable | | go_total_alloc | The cumulative bytes allocated for heap objects | Gauge | name | Dimensionless | Stable | | go_sys | The total bytes of memory obtained from the OS | Gauge | name | Dimensionless | Stable | | go_lookups | The number of pointer lookups performed by the runtime | Gauge | name | Dimensionless | Stable | @@ -113,3 +113,21 @@ Each process emits a number of memory statistics from the go runtime. | go_gc_cpu_fraction | The fraction of this program's available CPU time used by the GC since the program started | Gauge | name | Dimensionless | Stable | **NOTE:** name tag is empty. + +## Developer - User Services + +Every Knative service has a proxy container that proxies the connections to the application container. +A number of metrics are reported for the queue peroxy performance. Using the following metrics application +developers, devops and others, could measure if requests are queued at the proxy side (need for backpressure) and what is the actual delay in serving requests at the application side. + +### Queue proxy + +Requests endpoint + +| Metric Name | Description | Type | Tags | Unit | Status | +|:-|:-|:-|:-|:-|:-| +| revision_request_count | The number of requests that are routed to queue-proxy | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | +| revision_request_latencies | The response time in millisecond | Histogram | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | +| revision_app_request_count | The number of requests that are routed to user-container | Counter | configuration_name
container_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Dimensionless | Stable | +| revision_app_request_latencies | The response time in millisecond | Histogram | configuration_name
namespace_name
pod_name
response_code
response_code_class
revision_name
service_name | Milliseconds | Stable | +| revision_queue_depth | The current number of items in the serving and waiting queue, or not reported if unlimited concurrency | Gauge | configuration_name
event-display
container_name
namespace_name
pod_name
response_code_class
revision_name
service_name | Dimensionless | Stable | From cf758cec811d2e193f7e25f64d667cba9c34ad44 Mon Sep 17 00:00:00 2001 From: Stavros Kontopoulos Date: Wed, 21 Apr 2021 19:24:53 +0300 Subject: [PATCH 5/5] fix eventing --- docs/eventing/metrics.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/eventing/metrics.md b/docs/eventing/metrics.md index 94886b43ede..24bd99ed2be 100644 --- a/docs/eventing/metrics.md +++ b/docs/eventing/metrics.md @@ -49,11 +49,9 @@ By aggregating the metrics over the http code, events can be separated into two **NOTE:** A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics.md#controller). -## Developer metrics - ### Eventing sources -Eventing sources are created by users so they can trigger their applications with events. +Eventing sources are created by users who own the related system, so they can trigger applications with events. Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics to verify that events have been delivered from the source side, thus verifying that the source and any connection with the source work as expected.