-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add metrics api docs #3478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add metrics api docs #3478
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| --- | ||
| title: "Metrics API" | ||
| weight: 99 | ||
| type: "docs" | ||
| --- | ||
|
|
||
| <br> | ||
|
|
||
| **NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. | ||
|
|
||
| ## Admin | ||
|
|
||
| Administrators can monitor Eventing based on the metrics exposed by each Eventing component. | ||
| Metrics are listed next. | ||
|
|
||
| ### Broker - Ingress | ||
|
|
||
| Use the following metrics to debug how broker ingress performs and what events are dispacthed via the ingress component. | ||
| By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | event_count | Number of events received by a Broker | Counter | broker_name<br>event_type<br>namespace_name<br>response_code<br>response_code_class<br>unique_name | Dimensionless | Stable | ||
| | event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name<br>event_type<br>namespace_name<br>response_code<br>response_code_class<br>unique_name | Milliseconds | Stable | ||
|
|
||
| ### Broker - Filter | ||
|
|
||
| Use the following metrics to debug how broker filter performs and what events are dispatched via the filter component. | ||
| Also user can measure the latency of the actual filtering action on an event. | ||
| By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | event_count | Number of events received by a Broker | Counter | broker_name<br>container_name=<br>filter_type<br>namespace_name<br>response_code<br>response_code_class<br>trigger_name<br>unique_name | Dimensionless | Stable | ||
| | event_dispatch_latencies | The time spent dispatching an event to a Channel | Histogram | broker_name<br>container_name<br>filter_type<br>namespace_name<br>response_code<br>response_code_class<br>trigger_name<br>unique_name | Milliseconds | Stable | ||
| | event_processing_latencies | The time spent processing an event before it is dispatched to a Trigger subscriber | Histogram | broker_name<br>container_name<br>filter_type<br>namespace_name<br>trigger_name<br>unique_name | Milliseconds | Stable | ||
|
|
||
| ### In-memory Dispatcher | ||
|
|
||
| In-memory channel can be evaluated via the following metrics. | ||
| By aggregating the metrics over the http code, events can be separated into two classes, successful (2xx) and failed events (5xx). | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | event_count | Number of events dispatched by the in-memory channel | Counter | container_name<br>event_type=<br>namespace_name=<br>response_code<br>response_code_class<br>unique_name | Dimensionless | Stable | ||
| | event_dispatch_latencies | The time spent dispatching an event from a in-memory Channel | Histogram | container_name<br>event_type<br>namespace_name=<br>response_code<br>response_code_class<br>unique_name | Milliseconds | Stable | ||
|
|
||
|
|
||
| **NOTE:** A number of metrics eg. controller, Go runtime and others are omitted here as they are common across most components. For more about these metrics check the [Serving metrics API section](../serving/metrics.md#controller). | ||
|
|
||
|
|
||
| ### Eventing sources | ||
|
|
||
| Eventing sources are created by users who own the related system, so they can trigger applications with events. | ||
| Every source exposes by default a number of metrics to help user monitor events dispatched. Use the following metrics | ||
| to verify that events have been delivered from the source side, thus verifying that the source and any connection with the source work as expected. | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | event_count | Number of events sent by the source | Counter | event_source<br>event_type<br>name<br>namespace_name<br>resource_group<br>response_code<br>response_code_class<br>response_error<br>response_timeout | Dimensionless | Stable | | ||
| | retry_event_count | Number of events sent by the source in retries | Counter | event_source<br>event_type<br>name<br>namespace_name<br>resource_group<br>response_code<br>response_code_class<br>response_error<br>response_timeout | Dimensionless | Stable | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| --- | ||
| title: "Metrics API" | ||
| weight: 99 | ||
| type: "docs" | ||
| --- | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And here? |
||
| <br> | ||
|
|
||
| **NOTE:** The metrics API may change in the future, this serves as a snapshot of the current metrics. | ||
| <br> | ||
|
|
||
| ## Admin | ||
|
|
||
| Administrators can monitor Serving control plane based on the metrics exposed by each Serving component. | ||
| Metrics are listed next. | ||
|
|
||
| ### Activator | ||
|
|
||
| The following metrics allow the user to understand how application responds when traffic goes through the activator eg. scaling from zero. For example high request latency means that requests are taken too much time be fulfilled. | ||
| <br> | ||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | request_concurrency | Concurrent requests that are routed to Activator<br>These are requests reported by the concurrency reporter which may not be done yet.<br> This is the average concurrency over a reporting period | Gauge | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | request_count | The number of requests that are routed to Activator.<br>These are requests that have been fulfilled from the activator handler. | Counter | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | request_latencies | The response time in millisecond for the fulfilled routed requests | Histogram | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Milliseconds | Stable | | ||
|
|
||
| ### Autoscaler | ||
|
|
||
| Autoscaler component exposes a number of metrics related to its decisions per revision. | ||
| For example at any given time user can monitor the desired pods the Autoscaler wants to allocate for | ||
| a service, the average number of requests per second during the stable window, whether autoscaler is in panic mode (KPA) etc. | ||
| To read more about how autoscaler works check [here](https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md). | ||
| <br> | ||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | desired_pods | Number of pods autoscaler wants to allocate | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | excess_burst_capacity | Excess burst capacity overserved over the stable window | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | stable_request_concurrency | Average of requests count per observed pod over the stable window | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | panic_request_concurrency | Average of requests count per observed pod over the panic window | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | target_concurrency_per_pod | The desired number of concurrent requests for each pod | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | stable_requests_per_second | Average requests-per-second per observed pod over the stable window | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | panic_requests_per_second | Average requests-per-second per observed pod over the panic window | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | target_requests_per_second | The desired requests-per-second for each pod | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | panic_mode | 1 if autoscaler is in panic mode, 0 otherwise | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | requested_pods | Number of pods autoscaler requested from Kubernetes | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | actual_pods | Number of pods that are allocated currently in ready state | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | not_ready_pods | Number of pods that are not ready currently | Gauge | configuration_name=<br>namespace_name=<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | pending_pods | Number of pods that are pending currently | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | terminating_pods | Number of pods that are terminating currently | Gauge | configuration_name<br>namespace_name<br>revision_name<br>service_name<br> | Dimensionless | Stable | | ||
|
|
||
| ### Controller | ||
|
|
||
| The following metrics are emitted by any component that implements a controller logic. | ||
| The metrics show details about the reconciliation operations and the workqueue behavior on which | ||
| reconciliation requests are enqueued. | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | work_queue_depth | Depth of the work queue | Gauge | reconciler | Dimensionless | Stable | | ||
| | reconcile_count | Number of reconcile operations | Counter | reconciler<br>success<br> | Dimensionless | Stable | | ||
| | reconcile_latency | Latency of reconcile operations | Histogram | reconciler<br>success<br> | Milliseconds | Stable | | ||
| | workqueue_adds_total | Total number of adds handled by workqueue | Counter | name | Dimensionless | Stable | | ||
| | workqueue_depth | Current depth of workqueue | Gauge | reconciler | Dimensionless | Stable | | ||
| | workqueue_queue_latency_seconds | How long in seconds an item stays in workqueue before being requested | Histogram | name | Seconds | Stable | | ||
| | workqueue_retries_total | Total number of retries handled by workqueue | Counter | name | Dimensionless | Stable | | ||
| | workqueue_work_duration_seconds | How long in seconds processing an item from a workqueue takes. | Histogram | name | Seconds| Stable | | ||
| | workqueue_unfinished_work_seconds | How long in seconds the outstanding workqueue items have been in flight (total). | Histogram | name | Seconds | Stable | | ||
| | workqueue_longest_running_processor_seconds | How long in seconds the longest outstanding workqueue item has been in flight | Histogram | name | Seconds | Stable | | ||
|
|
||
| ### Webhook | ||
|
|
||
| Webhook metrics report useful info about operations eg. CREATE on Serving resources and if admission was allowed. | ||
| For example if a big number of operations fail this could be an issue with the submitted user resource. | ||
| <br> | ||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | request_count | The number of requests that are routed to webhook | Counter | admission_allowed<br>kind_group<br>kind_kind<br>kind_version<br>request_operation<br>resource_group<br>resource_namespace<br>resource_resource<br>resource_version | Dimensionless | Stable | | ||
| | request_latencies | The response time in milliseconds | Histogram | admission_allowed<br>kind_group<br>kind_kind<br>kind_version<br>request_operation<br>resource_group<br>resource_namespace<br>resource_resource<br>resource_version | Milliseconds | Stable | | ||
|
|
||
| ### Go Runtime - memstats | ||
|
|
||
| Each Knative Serving control plane process emits a number of Go runtime [memory statistics](https://golang.org/pkg/runtime/#MemStats) (shown next). | ||
| As a baseline for monitoring purproses, user could start with a subset of the metrics: current allocations (go_alloc), total allocations (go_total_alloc), system memory (go_sys), mallocs (go_mallocs), frees (go_frees) and garbage collection total pause time (total_gc_pause_ns), next gc target heap size (go_next_gc) and number of garbage collection cycles (num_gc). | ||
| <br> | ||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | go_alloc | The number of bytes of allocated heap objects (same as heap_alloc) | Gauge | name | Dimensionless | Stable | | ||
| | go_total_alloc | The cumulative bytes allocated for heap objects | Gauge | name | Dimensionless | Stable | | ||
| | go_sys | The total bytes of memory obtained from the OS | Gauge | name | Dimensionless | Stable | | ||
| | go_lookups | The number of pointer lookups performed by the runtime | Gauge | name | Dimensionless | Stable | | ||
| | go_mallocs | The cumulative count of heap objects allocated | Gauge | name | Dimensionless | Stable | | ||
| | go_frees | The cumulative count of heap objects freed | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_alloc | The number of bytes of allocated heap objects | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_sys | The number of bytes of heap memory obtained from the OS | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_idle | The number of bytes in idle (unused) spans | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_in_use | The number of bytes in in-use spans | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_released | The number of bytes of physical memory returned to the OS | Gauge | name | Dimensionless | Stable | | ||
| | go_heap_objects | The number of allocated heap objects | Gauge | name | Dimensionless | Stable | | ||
| | go_stack_in_use | The number of bytes in stack spans | Gauge | name | Dimensionless | Stable | | ||
| | go_stack_sys | The number of bytes of stack memory obtained from the OS | Gauge | name | Dimensionless | Stable | | ||
| | go_mspan_in_use | The number of bytes of allocated mspan structures | Gauge | name | Dimensionless | Stable | | ||
| | go_mspan_sys | The number of bytes of memory obtained from the OS for mspan structures | Gauge | name | Dimensionless | Stable | | ||
| | go_mcache_in_use | The number of bytes of allocated mcache structures | Gauge | name | Dimensionless | Stable | | ||
| | go_mcache_sys | The number of bytes of memory obtained from the OS for mcache structures | Gauge | name | Dimensionless | Stable | | ||
| | go_bucket_hash_sys | The number of bytes of memory in profiling bucket hash tables. | Gauge | name | Dimensionless | Stable | | ||
| | go_gc_sys | The number of bytes of memory in garbage collection metadata | Gauge | name | Dimensionless | Stable | | ||
| | go_other_sys | The number of bytes of memory in miscellaneous off-heap runtime allocations | Gauge | name | Dimensionless | Stable | | ||
| | go_next_gc | The target heap size of the next GC cycle | Gauge | name | Dimensionless | Stable | | ||
| | go_last_gc | The time the last garbage collection finished, as nanoseconds since 1970 (the UNIX epoch) | Gauge | name | Nanoseconds | Stable | | ||
| | go_total_gc_pause_ns | The cumulative nanoseconds in GC stop-the-world pauses since the program started | Gauge | name | Nanoseconds | Stable | | ||
| | go_num_gc | The number of completed GC cycles. | Gauge | name | Dimensionless | Stable | | ||
| | go_num_forced_gc | The number of GC cycles that were forced by the application calling the GC function. | Gauge | name | Dimensionless | Stable | | ||
| | go_gc_cpu_fraction | The fraction of this program's available CPU time used by the GC since the program started | Gauge | name | Dimensionless | Stable | | ||
|
|
||
| **NOTE:** name tag is empty. | ||
|
|
||
| ## Developer - User Services | ||
|
|
||
| Every Knative service has a proxy container that proxies the connections to the application container. | ||
| A number of metrics are reported for the queue peroxy performance. Using the following metrics application | ||
| developers, devops and others, could measure if requests are queued at the proxy side (need for backpressure) and what is the actual delay in serving requests at the application side. | ||
|
|
||
| ### Queue proxy | ||
|
|
||
| Requests endpoint | ||
|
|
||
| | Metric Name | Description | Type | Tags | Unit | Status | | ||
| |:-|:-|:-|:-|:-|:-| | ||
| | revision_request_count | The number of requests that are routed to queue-proxy | Counter | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | revision_request_latencies | The response time in millisecond | Histogram | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Milliseconds | Stable | | ||
| | revision_app_request_count | The number of requests that are routed to user-container | Counter | configuration_name<br>container_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Dimensionless | Stable | | ||
| | revision_app_request_latencies | The response time in millisecond | Histogram | configuration_name<br>namespace_name<br>pod_name<br>response_code<br>response_code_class<br>revision_name<br>service_name | Milliseconds | Stable | | ||
| | revision_queue_depth | The current number of items in the serving and waiting queue, or not reported if unlimited concurrency | Gauge | configuration_name<br>event-display<br>container_name<br>namespace_name<br>pod_name<br>response_code_class<br>revision_name<br>service_name | Dimensionless | Stable | | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something here indicating that these are a snapshot of current metrics and may change in the future?