-
Notifications
You must be signed in to change notification settings - Fork 15
docs: Add comprehensive metrics documentation #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,341 @@ | ||
| # Metrics Catalog for VTEX IO Node Apps | ||
|
|
||
| This document provides a comprehensive catalog of all metrics available in the `@vtex/api` library, organized by their implementation (diagnostics-based vs legacy). | ||
|
|
||
| > **Looking for migration guidance?** See [METRICS_OVERVIEW.md](./METRICS_OVERVIEW.md) for migration patterns and best practices. | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Metrics Architecture Overview](#metrics-architecture-overview) | ||
| - [Complete Metrics Visual Summary](#complete-metrics-visual-summary) | ||
| - [Diagnostics-Related Metrics](#diagnostics-related-metrics) | ||
| - [Legacy Metrics (Non-Diagnostics)](#legacy-metrics-non-diagnostics) | ||
|
|
||
| --- | ||
|
|
||
| ## Metrics Architecture Overview | ||
|
|
||
| The `@vtex/api` library has two coexisting metrics systems during the migration period: | ||
|
|
||
| 1. **Diagnostics-Based Metrics** (New) - Uses `@vtex/diagnostics-nodejs` with OpenTelemetry | ||
| 2. **Legacy Metrics** (Existing) - Uses `prom-client`, `MetricsAccumulator`, and console.log exports | ||
|
|
||
| Both systems operate independently and can coexist. The goal is to gradually migrate to diagnostics-based metrics while maintaining backward compatibility. | ||
|
|
||
| ### Two Categories of Metrics | ||
|
|
||
| | Category | Description | Initialization | Customization | | ||
| |----------|-------------|----------------|---------------| | ||
| | **Runtime/Infrastructure** | System-wide metrics for capacity planning and SLOs | Once at startup | Limited (configured at startup) | | ||
| | **App/Middleware** | Operation-specific metrics for debugging and optimization | Per-request/operation | Rich (can add custom attributes) | | ||
|
|
||
| --- | ||
|
|
||
| ## Complete Metrics Visual Summary | ||
|
|
||
| ``` | ||
| All Metrics in node-vtex-api | ||
| │ | ||
| ├── 🆕 Diagnostics-Related Metrics (OpenTelemetry-based) | ||
| │ │ | ||
| │ ├── 🏗️ Runtime/Infrastructure Metrics | ||
| │ │ │ | ||
| │ │ ├── OTel Request Instruments (service/metrics/metrics.ts) | ||
| │ │ │ ├── io_http_requests_current (Gauge) | ||
| │ │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) | ||
| │ │ │ ├── runtime_http_requests_total (Counter) | ||
| │ │ │ ├── runtime_http_response_size_bytes (Histogram) | ||
| │ │ │ └── runtime_http_aborted_requests_total (Counter) | ||
| │ │ │ | ||
| │ │ ├── Auto-instrumentation (telemetry/client.ts) | ||
| │ │ │ ├── http.server.duration (Histogram - HttpInstrumentation) | ||
| │ │ │ ├── http.server.request.size (Histogram) | ||
| │ │ │ ├── http.server.response.size (Histogram) | ||
| │ │ │ ├── http.client.duration (Histogram - HttpInstrumentation) | ||
| │ │ │ ├── http.client.request.size (Histogram) | ||
| │ │ │ ├── http.client.response.size (Histogram) | ||
| │ │ │ └── Koa-enhanced HTTP metrics (KoaInstrumentation) | ||
| │ │ │ | ||
| │ │ └── Host Metrics (HostMetricsInstrumentation) | ||
| │ │ ├── process.runtime.nodejs.memory.heap.used (Gauge) | ||
| │ │ ├── process.runtime.nodejs.memory.heap.total (Gauge) | ||
| │ │ ├── process.runtime.nodejs.memory.rss (Gauge) | ||
| │ │ ├── process.runtime.nodejs.memory.external (Gauge) | ||
| │ │ ├── process.runtime.nodejs.memory.arrayBuffers (Gauge) | ||
| │ │ ├── process.runtime.nodejs.event_loop.lag.max (Gauge) | ||
| │ │ ├── process.runtime.nodejs.event_loop.lag.min (Gauge) | ||
| │ │ ├── process.cpu.utilization (Gauge) | ||
| │ │ ├── system.cpu.utilization (Gauge) | ||
| │ │ ├── system.memory.usage (Gauge) | ||
| │ │ ├── system.memory.utilization (Gauge) | ||
| │ │ ├── system.network.io (Counter) | ||
| │ │ └── system.network.errors (Counter) | ||
| │ │ | ||
| │ └── 📱 App/Middleware Metrics | ||
| │ │ | ||
| │ ├── HTTP Client (HttpClient/middlewares/metrics.ts) | ||
| │ │ ├── latency histogram (via recordLatency) | ||
| │ │ ├── http_client_requests_total (Counter) | ||
| │ │ ├── http_client_cache_total (Counter) | ||
| │ │ └── http_client_requests_retried_total (Counter) | ||
| │ │ | ||
| │ ├── HTTP Handler (worker/runtime/http/middlewares/*) | ||
| │ │ ├── latency histogram (via recordLatency) | ||
| │ │ ├── http_handler_requests_total (Counter) | ||
| │ │ ├── http_server_requests_total (Counter) | ||
| │ │ ├── http_server_requests_closed_total (Counter) | ||
| │ │ └── http_server_requests_aborted_total (Counter) | ||
| │ │ | ||
| │ ├── GraphQL (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) | ||
| │ │ ├── latency histogram (via recordLatency) | ||
| │ │ └── graphql_field_requests_total (Counter) | ||
| │ │ | ||
| │ └── HTTP Agent (HttpClient/middlewares/request/HttpAgentSingleton.ts) | ||
| │ ├── http_agent_sockets_current (Gauge) | ||
| │ ├── http_agent_free_sockets_current (Gauge) | ||
| │ └── http_agent_pending_requests_current (Gauge) | ||
| │ | ||
| └── 🏛️ Legacy Metrics (Non-Diagnostics) | ||
| │ | ||
| ├── 📊 Prometheus Metrics (prom-client, exposed on /metrics) | ||
| │ │ | ||
| │ ├── Request Metrics (service/tracing/metrics/*) | ||
| │ │ ├── runtime_http_requests_total (Counter) - labels: status_code, handler | ||
| │ │ ├── runtime_http_aborted_requests_total (Counter) - labels: handler | ||
| │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) | ||
| │ │ ├── runtime_http_response_size_bytes (Histogram) | ||
| │ │ └── io_http_requests_current (Gauge) | ||
| │ │ | ||
| │ ├── Event Loop Metrics (service/tracing/metrics/measurers/*) | ||
| │ │ ├── runtime_event_loop_lag_max_between_scrapes_seconds (Gauge) | ||
| │ │ └── runtime_event_loop_lag_percentiles_between_scrapes_seconds (Gauge) | ||
| │ │ | ||
| │ └── Default Node.js Metrics (collectDefaultMetrics) | ||
| │ ├── nodejs_gc_duration_seconds (Histogram) | ||
| │ ├── nodejs_active_handles_total (Gauge) | ||
| │ ├── nodejs_active_requests_total (Gauge) | ||
| │ ├── nodejs_heap_size_total_bytes (Gauge) | ||
| │ ├── nodejs_heap_size_used_bytes (Gauge) | ||
| │ ├── nodejs_external_memory_bytes (Gauge) | ||
| │ ├── nodejs_version_info (Gauge) | ||
| │ ├── process_cpu_user_seconds_total (Counter) | ||
| │ ├── process_cpu_system_seconds_total (Counter) | ||
| │ ├── process_resident_memory_bytes (Gauge) | ||
| │ └── process_start_time_seconds (Gauge) | ||
| │ | ||
| ├── 📝 MetricsAccumulator (console.log exports via trackStatus) | ||
| │ │ | ||
| │ ├── HTTP Handler Metrics (worker/runtime/http/middlewares/timings.ts) | ||
| │ │ └── http-handler-{route_id} | ||
| │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max | ||
| │ │ └── Extensions: success, error, timeout, aborted, cancelled | ||
| │ │ | ||
| │ ├── HTTP Client Metrics (HttpClient/middlewares/metrics.ts) | ||
| │ │ └── http-client-{metric_name} | ||
| │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max | ||
| │ │ └── Extensions: | ||
| │ │ ├── Status: success, error, timeout, aborted, cancelled | ||
| │ │ ├── Cache: success-hit, success-miss, success-inflight, success-memoized | ||
| │ │ └── Retry: retry-{status}-{count} | ||
| │ │ | ||
| │ ├── GraphQL Metrics (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) | ||
| │ │ └── graphql-metric-{field_name} | ||
| │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max | ||
| │ │ └── Extensions: success, error | ||
| │ │ | ||
| │ ├── System Metrics (metrics/MetricsAccumulator.ts) | ||
| │ │ ├── cpu - user (μs), system (μs) | ||
| │ │ ├── memory - rss, heapTotal, heapUsed, external, arrayBuffers | ||
| │ │ ├── httpAgent - sockets, freeSockets, pendingRequests | ||
| │ │ └── incomingRequest - total, closed, aborted | ||
| │ │ | ||
| │ └── Cache Metrics (via trackCache) | ||
| │ └── {cache_name}-cache | ||
| │ ├── LRU: itemCount, length, disposedItems, hitRate, hits, max, total | ||
| │ ├── Disk: hits, total | ||
| │ └── Multilayer: hitRate, hits, total | ||
| │ | ||
| └── 💰 Billing Metrics (console.log with __VTEX_IO_BILLING) | ||
| └── Process time per handler | ||
| ├── account, app, handler | ||
| ├── production, routeType (public_route/private_route) | ||
| ├── timestamp, value (milliseconds) | ||
| └── vendor, workspace | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Diagnostics-Related Metrics | ||
|
|
||
| ### Runtime/Infrastructure Metrics | ||
|
|
||
| These are system-wide metrics declared at service initialization level. | ||
|
|
||
| #### OTel Request Instruments | ||
|
|
||
| **Source:** `service/metrics/metrics.ts` | ||
|
|
||
| | Metric Name | Type | Description | | ||
| |-------------|------|-------------| | ||
| | `io_http_requests_current` | Gauge | Current number of requests in progress | | ||
| | `runtime_http_requests_duration_milliseconds` | Histogram | Incoming HTTP request duration | | ||
| | `runtime_http_requests_total` | Counter | Total number of HTTP requests | | ||
| | `runtime_http_response_size_bytes` | Histogram | Outgoing response sizes | | ||
| | `runtime_http_aborted_requests_total` | Counter | Total aborted HTTP requests | | ||
|
|
||
| #### Auto-instrumentation Metrics | ||
|
|
||
| **Source:** `telemetry/client.ts` (via OpenTelemetry instrumentations) | ||
|
|
||
| | Metric Name | Type | Source | Description | | ||
| |-------------|------|--------|-------------| | ||
| | `http.server.duration` | Histogram | HttpInstrumentation | HTTP server request duration | | ||
| | `http.client.duration` | Histogram | HttpInstrumentation | HTTP client request duration | | ||
| | `process.runtime.nodejs.memory.*` | Gauge | HostMetrics | Node.js memory metrics | | ||
| | `process.cpu.utilization` | Gauge | HostMetrics | Process CPU utilization | | ||
| | `system.cpu.utilization` | Gauge | HostMetrics | System CPU utilization | | ||
| | `system.memory.usage` | Gauge | HostMetrics | System memory usage | | ||
|
|
||
| ### App/Middleware Metrics | ||
|
|
||
| These are operation-specific metrics recorded in middleware components. | ||
|
|
||
| #### HTTP Client Metrics | ||
|
|
||
| **Source:** `HttpClient/middlewares/metrics.ts` | ||
|
|
||
| | Metric Name | Type | Attributes | | ||
| |-------------|------|------------| | ||
| | Latency histogram | Histogram | `component`, `client_metric`, `status_code`, `status`, `cache_state` | | ||
| | `http_client_requests_total` | Counter | `component`, `client_metric`, `status_code`, `status` | | ||
| | `http_client_cache_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `cache_state` | | ||
| | `http_client_requests_retried_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `retry_count` | | ||
|
|
||
| #### HTTP Handler Metrics | ||
|
|
||
| **Source:** `worker/runtime/http/middlewares/timings.ts`, `requestStats.ts` | ||
|
|
||
| | Metric Name | Type | Attributes | | ||
| |-------------|------|------------| | ||
| | Latency histogram | Histogram | `component`, `route_id`, `route_type`, `status_code`, `status` | | ||
| | `http_handler_requests_total` | Counter | `component`, `route_id`, `route_type`, `status_code`, `status` | | ||
| | `http_server_requests_total` | Counter | `route_id`, `route_type`, `status_code` | | ||
| | `http_server_requests_closed_total` | Counter | `route_id`, `route_type`, `status_code` | | ||
| | `http_server_requests_aborted_total` | Counter | `route_id`, `route_type`, `status_code` | | ||
|
|
||
| #### GraphQL Metrics | ||
|
|
||
| **Source:** `worker/runtime/graphql/schema/schemaDirectives/Metric.ts` | ||
|
|
||
| | Metric Name | Type | Attributes | | ||
| |-------------|------|------------| | ||
| | Latency histogram | Histogram | `component`, `field_name`, `status` | | ||
| | `graphql_field_requests_total` | Counter | `component`, `field_name`, `status` | | ||
|
|
||
| #### HTTP Agent Metrics | ||
|
|
||
| **Source:** `HttpClient/middlewares/request/HttpAgentSingleton.ts` | ||
|
|
||
| | Metric Name | Type | Description | | ||
| |-------------|------|-------------| | ||
| | `http_agent_sockets_current` | Gauge | Active sockets | | ||
| | `http_agent_free_sockets_current` | Gauge | Free sockets in pool | | ||
| | `http_agent_pending_requests_current` | Gauge | Pending requests waiting for socket | | ||
|
|
||
| --- | ||
|
|
||
| ## Legacy Metrics (Non-Diagnostics) | ||
|
|
||
| ### Prometheus Metrics | ||
|
|
||
| Exposed on the `/metrics` endpoint via `prom-client`. | ||
|
|
||
| #### Request Metrics | ||
|
|
||
| **Source:** `service/tracing/metrics/MetricNames.ts` | ||
|
|
||
| | Metric Name | Type | Labels | Description | | ||
| |-------------|------|--------|-------------| | ||
| | `runtime_http_requests_total` | Counter | `status_code`, `handler` | Total HTTP requests | | ||
| | `runtime_http_aborted_requests_total` | Counter | `handler` | Aborted HTTP requests | | ||
| | `runtime_http_requests_duration_milliseconds` | Histogram | `handler` | Request duration (buckets: 10-5120ms) | | ||
| | `runtime_http_response_size_bytes` | Histogram | `handler` | Response sizes (buckets: 500B-4MB) | | ||
| | `io_http_requests_current` | Gauge | - | Concurrent requests | | ||
|
|
||
| #### Event Loop Metrics | ||
|
|
||
| **Source:** `service/tracing/metrics/measurers/EventLoopLagMeasurer.ts` | ||
|
|
||
| | Metric Name | Type | Labels | Description | | ||
| |-------------|------|--------|-------------| | ||
| | `runtime_event_loop_lag_max_between_scrapes_seconds` | Gauge | - | Max event loop lag | | ||
| | `runtime_event_loop_lag_percentiles_between_scrapes_seconds` | Gauge | `percentile` | Event loop lag percentiles (95, 99) | | ||
|
|
||
| #### Default Node.js Metrics | ||
|
|
||
| Via `collectDefaultMetrics()` from `prom-client`: | ||
|
|
||
| - `nodejs_gc_duration_seconds` - GC duration histogram | ||
| - `nodejs_active_handles_total` - Active handles | ||
| - `nodejs_active_requests_total` - Active requests | ||
| - `nodejs_heap_size_*_bytes` - Heap metrics | ||
| - `nodejs_external_memory_bytes` - External memory | ||
| - `nodejs_version_info` - Node.js version | ||
| - `process_cpu_*_seconds_total` - CPU counters | ||
| - `process_resident_memory_bytes` - RSS memory | ||
| - `process_start_time_seconds` - Process start time | ||
|
|
||
| ### MetricsAccumulator | ||
|
|
||
| Exported via `console.log` as JSON and collected by Splunk. | ||
|
|
||
| **Source:** `metrics/MetricsAccumulator.ts` | ||
|
|
||
| #### Aggregated Metrics Format | ||
|
|
||
| Each metric includes: | ||
| - `name` - Metric identifier | ||
| - `count` - Number of samples | ||
| - `mean`, `median` - Average and middle values | ||
| - `percentile95`, `percentile99` - Tail latencies | ||
| - `max` - Maximum value | ||
| - `production` - Environment flag | ||
| - Plus any custom extensions | ||
|
|
||
| #### System Metrics | ||
|
|
||
| | Metric Name | Properties | | ||
| |-------------|------------| | ||
| | `cpu` | `user` (μs), `system` (μs) | | ||
| | `memory` | `rss`, `heapTotal`, `heapUsed`, `external`, `arrayBuffers` | | ||
| | `httpAgent` | `sockets`, `freeSockets`, `pendingRequests` | | ||
| | `incomingRequest` | `total`, `closed`, `aborted` | | ||
|
|
||
| ### Billing Metrics | ||
|
|
||
| **Source:** `worker/runtime/http/middlewares/timings.ts` | ||
|
|
||
| Exported with `__VTEX_IO_BILLING` flag for usage tracking: | ||
|
|
||
| ```json | ||
| { | ||
| "__VTEX_IO_BILLING": "true", | ||
| "account": "...", | ||
| "app": "...", | ||
| "handler": "...", | ||
| "production": true, | ||
| "routeType": "public_route", | ||
| "timestamp": 1234567890, | ||
| "type": "process-time", | ||
| "value": 150, | ||
| "vendor": "vtex", | ||
| "workspace": "master" | ||
| } | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Related Documentation | ||
|
|
||
| - [Migration Guide](./METRICS_OVERVIEW.md) - Patterns and best practices for migrating to diagnostics-based metrics | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This billing 💰 do you know where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that metric is still used, or where... Maybe @arturpimentel has more context about it