diff --git a/docs/METRICS_CATALOG.md b/docs/METRICS_CATALOG.md new file mode 100644 index 000000000..148d0dad7 --- /dev/null +++ b/docs/METRICS_CATALOG.md @@ -0,0 +1,341 @@ +# Metrics Catalog for VTEX IO Node Apps + +This document provides a comprehensive catalog of all metrics available in the `@vtex/api` library, organized by their implementation (diagnostics-based vs legacy). + +> **Looking for migration guidance?** See [METRICS_OVERVIEW.md](./METRICS_OVERVIEW.md) for migration patterns and best practices. + +## Table of Contents + +- [Metrics Architecture Overview](#metrics-architecture-overview) +- [Complete Metrics Visual Summary](#complete-metrics-visual-summary) +- [Diagnostics-Related Metrics](#diagnostics-related-metrics) +- [Legacy Metrics (Non-Diagnostics)](#legacy-metrics-non-diagnostics) + +--- + +## Metrics Architecture Overview + +The `@vtex/api` library has two coexisting metrics systems during the migration period: + +1. **Diagnostics-Based Metrics** (New) - Uses `@vtex/diagnostics-nodejs` with OpenTelemetry +2. **Legacy Metrics** (Existing) - Uses `prom-client`, `MetricsAccumulator`, and console.log exports + +Both systems operate independently and can coexist. The goal is to gradually migrate to diagnostics-based metrics while maintaining backward compatibility. + +### Two Categories of Metrics + +| Category | Description | Initialization | Customization | +|----------|-------------|----------------|---------------| +| **Runtime/Infrastructure** | System-wide metrics for capacity planning and SLOs | Once at startup | Limited (configured at startup) | +| **App/Middleware** | Operation-specific metrics for debugging and optimization | Per-request/operation | Rich (can add custom attributes) | + +--- + +## Complete Metrics Visual Summary + +``` +All Metrics in node-vtex-api +│ +├── 🆕 Diagnostics-Related Metrics (OpenTelemetry-based) +│ │ +│ ├── 🏗ïļ Runtime/Infrastructure Metrics +│ │ │ +│ │ ├── OTel Request Instruments (service/metrics/metrics.ts) +│ │ │ ├── io_http_requests_current (Gauge) +│ │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) +│ │ │ ├── runtime_http_requests_total (Counter) +│ │ │ ├── runtime_http_response_size_bytes (Histogram) +│ │ │ └── runtime_http_aborted_requests_total (Counter) +│ │ │ +│ │ ├── Auto-instrumentation (telemetry/client.ts) +│ │ │ ├── http.server.duration (Histogram - HttpInstrumentation) +│ │ │ ├── http.server.request.size (Histogram) +│ │ │ ├── http.server.response.size (Histogram) +│ │ │ ├── http.client.duration (Histogram - HttpInstrumentation) +│ │ │ ├── http.client.request.size (Histogram) +│ │ │ ├── http.client.response.size (Histogram) +│ │ │ └── Koa-enhanced HTTP metrics (KoaInstrumentation) +│ │ │ +│ │ └── Host Metrics (HostMetricsInstrumentation) +│ │ ├── process.runtime.nodejs.memory.heap.used (Gauge) +│ │ ├── process.runtime.nodejs.memory.heap.total (Gauge) +│ │ ├── process.runtime.nodejs.memory.rss (Gauge) +│ │ ├── process.runtime.nodejs.memory.external (Gauge) +│ │ ├── process.runtime.nodejs.memory.arrayBuffers (Gauge) +│ │ ├── process.runtime.nodejs.event_loop.lag.max (Gauge) +│ │ ├── process.runtime.nodejs.event_loop.lag.min (Gauge) +│ │ ├── process.cpu.utilization (Gauge) +│ │ ├── system.cpu.utilization (Gauge) +│ │ ├── system.memory.usage (Gauge) +│ │ ├── system.memory.utilization (Gauge) +│ │ ├── system.network.io (Counter) +│ │ └── system.network.errors (Counter) +│ │ +│ └── ðŸ“ą App/Middleware Metrics +│ │ +│ ├── HTTP Client (HttpClient/middlewares/metrics.ts) +│ │ ├── latency histogram (via recordLatency) +│ │ ├── http_client_requests_total (Counter) +│ │ ├── http_client_cache_total (Counter) +│ │ └── http_client_requests_retried_total (Counter) +│ │ +│ ├── HTTP Handler (worker/runtime/http/middlewares/*) +│ │ ├── latency histogram (via recordLatency) +│ │ ├── http_handler_requests_total (Counter) +│ │ ├── http_server_requests_total (Counter) +│ │ ├── http_server_requests_closed_total (Counter) +│ │ └── http_server_requests_aborted_total (Counter) +│ │ +│ ├── GraphQL (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) +│ │ ├── latency histogram (via recordLatency) +│ │ └── graphql_field_requests_total (Counter) +│ │ +│ └── HTTP Agent (HttpClient/middlewares/request/HttpAgentSingleton.ts) +│ ├── http_agent_sockets_current (Gauge) +│ ├── http_agent_free_sockets_current (Gauge) +│ └── http_agent_pending_requests_current (Gauge) +│ +└── 🏛ïļ Legacy Metrics (Non-Diagnostics) + │ + ├── 📊 Prometheus Metrics (prom-client, exposed on /metrics) + │ │ + │ ├── Request Metrics (service/tracing/metrics/*) + │ │ ├── runtime_http_requests_total (Counter) - labels: status_code, handler + │ │ ├── runtime_http_aborted_requests_total (Counter) - labels: handler + │ │ ├── runtime_http_requests_duration_milliseconds (Histogram) + │ │ ├── runtime_http_response_size_bytes (Histogram) + │ │ └── io_http_requests_current (Gauge) + │ │ + │ ├── Event Loop Metrics (service/tracing/metrics/measurers/*) + │ │ ├── runtime_event_loop_lag_max_between_scrapes_seconds (Gauge) + │ │ └── runtime_event_loop_lag_percentiles_between_scrapes_seconds (Gauge) + │ │ + │ └── Default Node.js Metrics (collectDefaultMetrics) + │ ├── nodejs_gc_duration_seconds (Histogram) + │ ├── nodejs_active_handles_total (Gauge) + │ ├── nodejs_active_requests_total (Gauge) + │ ├── nodejs_heap_size_total_bytes (Gauge) + │ ├── nodejs_heap_size_used_bytes (Gauge) + │ ├── nodejs_external_memory_bytes (Gauge) + │ ├── nodejs_version_info (Gauge) + │ ├── process_cpu_user_seconds_total (Counter) + │ ├── process_cpu_system_seconds_total (Counter) + │ ├── process_resident_memory_bytes (Gauge) + │ └── process_start_time_seconds (Gauge) + │ + ├── 📝 MetricsAccumulator (console.log exports via trackStatus) + │ │ + │ ├── HTTP Handler Metrics (worker/runtime/http/middlewares/timings.ts) + │ │ └── http-handler-{route_id} + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max + │ │ └── Extensions: success, error, timeout, aborted, cancelled + │ │ + │ ├── HTTP Client Metrics (HttpClient/middlewares/metrics.ts) + │ │ └── http-client-{metric_name} + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max + │ │ └── Extensions: + │ │ ├── Status: success, error, timeout, aborted, cancelled + │ │ ├── Cache: success-hit, success-miss, success-inflight, success-memoized + │ │ └── Retry: retry-{status}-{count} + │ │ + │ ├── GraphQL Metrics (worker/runtime/graphql/schema/schemaDirectives/Metric.ts) + │ │ └── graphql-metric-{field_name} + │ │ ├── Aggregates: count, mean, median, percentile95, percentile99, max + │ │ └── Extensions: success, error + │ │ + │ ├── System Metrics (metrics/MetricsAccumulator.ts) + │ │ ├── cpu - user (Ξs), system (Ξs) + │ │ ├── memory - rss, heapTotal, heapUsed, external, arrayBuffers + │ │ ├── httpAgent - sockets, freeSockets, pendingRequests + │ │ └── incomingRequest - total, closed, aborted + │ │ + │ └── Cache Metrics (via trackCache) + │ └── {cache_name}-cache + │ ├── LRU: itemCount, length, disposedItems, hitRate, hits, max, total + │ ├── Disk: hits, total + │ └── Multilayer: hitRate, hits, total + │ + └── 💰 Billing Metrics (console.log with __VTEX_IO_BILLING) + └── Process time per handler + ├── account, app, handler + ├── production, routeType (public_route/private_route) + ├── timestamp, value (milliseconds) + └── vendor, workspace +``` + +--- + +## Diagnostics-Related Metrics + +### Runtime/Infrastructure Metrics + +These are system-wide metrics declared at service initialization level. + +#### OTel Request Instruments + +**Source:** `service/metrics/metrics.ts` + +| Metric Name | Type | Description | +|-------------|------|-------------| +| `io_http_requests_current` | Gauge | Current number of requests in progress | +| `runtime_http_requests_duration_milliseconds` | Histogram | Incoming HTTP request duration | +| `runtime_http_requests_total` | Counter | Total number of HTTP requests | +| `runtime_http_response_size_bytes` | Histogram | Outgoing response sizes | +| `runtime_http_aborted_requests_total` | Counter | Total aborted HTTP requests | + +#### Auto-instrumentation Metrics + +**Source:** `telemetry/client.ts` (via OpenTelemetry instrumentations) + +| Metric Name | Type | Source | Description | +|-------------|------|--------|-------------| +| `http.server.duration` | Histogram | HttpInstrumentation | HTTP server request duration | +| `http.client.duration` | Histogram | HttpInstrumentation | HTTP client request duration | +| `process.runtime.nodejs.memory.*` | Gauge | HostMetrics | Node.js memory metrics | +| `process.cpu.utilization` | Gauge | HostMetrics | Process CPU utilization | +| `system.cpu.utilization` | Gauge | HostMetrics | System CPU utilization | +| `system.memory.usage` | Gauge | HostMetrics | System memory usage | + +### App/Middleware Metrics + +These are operation-specific metrics recorded in middleware components. + +#### HTTP Client Metrics + +**Source:** `HttpClient/middlewares/metrics.ts` + +| Metric Name | Type | Attributes | +|-------------|------|------------| +| Latency histogram | Histogram | `component`, `client_metric`, `status_code`, `status`, `cache_state` | +| `http_client_requests_total` | Counter | `component`, `client_metric`, `status_code`, `status` | +| `http_client_cache_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `cache_state` | +| `http_client_requests_retried_total` | Counter | `component`, `client_metric`, `status_code`, `status`, `retry_count` | + +#### HTTP Handler Metrics + +**Source:** `worker/runtime/http/middlewares/timings.ts`, `requestStats.ts` + +| Metric Name | Type | Attributes | +|-------------|------|------------| +| Latency histogram | Histogram | `component`, `route_id`, `route_type`, `status_code`, `status` | +| `http_handler_requests_total` | Counter | `component`, `route_id`, `route_type`, `status_code`, `status` | +| `http_server_requests_total` | Counter | `route_id`, `route_type`, `status_code` | +| `http_server_requests_closed_total` | Counter | `route_id`, `route_type`, `status_code` | +| `http_server_requests_aborted_total` | Counter | `route_id`, `route_type`, `status_code` | + +#### GraphQL Metrics + +**Source:** `worker/runtime/graphql/schema/schemaDirectives/Metric.ts` + +| Metric Name | Type | Attributes | +|-------------|------|------------| +| Latency histogram | Histogram | `component`, `field_name`, `status` | +| `graphql_field_requests_total` | Counter | `component`, `field_name`, `status` | + +#### HTTP Agent Metrics + +**Source:** `HttpClient/middlewares/request/HttpAgentSingleton.ts` + +| Metric Name | Type | Description | +|-------------|------|-------------| +| `http_agent_sockets_current` | Gauge | Active sockets | +| `http_agent_free_sockets_current` | Gauge | Free sockets in pool | +| `http_agent_pending_requests_current` | Gauge | Pending requests waiting for socket | + +--- + +## Legacy Metrics (Non-Diagnostics) + +### Prometheus Metrics + +Exposed on the `/metrics` endpoint via `prom-client`. + +#### Request Metrics + +**Source:** `service/tracing/metrics/MetricNames.ts` + +| Metric Name | Type | Labels | Description | +|-------------|------|--------|-------------| +| `runtime_http_requests_total` | Counter | `status_code`, `handler` | Total HTTP requests | +| `runtime_http_aborted_requests_total` | Counter | `handler` | Aborted HTTP requests | +| `runtime_http_requests_duration_milliseconds` | Histogram | `handler` | Request duration (buckets: 10-5120ms) | +| `runtime_http_response_size_bytes` | Histogram | `handler` | Response sizes (buckets: 500B-4MB) | +| `io_http_requests_current` | Gauge | - | Concurrent requests | + +#### Event Loop Metrics + +**Source:** `service/tracing/metrics/measurers/EventLoopLagMeasurer.ts` + +| Metric Name | Type | Labels | Description | +|-------------|------|--------|-------------| +| `runtime_event_loop_lag_max_between_scrapes_seconds` | Gauge | - | Max event loop lag | +| `runtime_event_loop_lag_percentiles_between_scrapes_seconds` | Gauge | `percentile` | Event loop lag percentiles (95, 99) | + +#### Default Node.js Metrics + +Via `collectDefaultMetrics()` from `prom-client`: + +- `nodejs_gc_duration_seconds` - GC duration histogram +- `nodejs_active_handles_total` - Active handles +- `nodejs_active_requests_total` - Active requests +- `nodejs_heap_size_*_bytes` - Heap metrics +- `nodejs_external_memory_bytes` - External memory +- `nodejs_version_info` - Node.js version +- `process_cpu_*_seconds_total` - CPU counters +- `process_resident_memory_bytes` - RSS memory +- `process_start_time_seconds` - Process start time + +### MetricsAccumulator + +Exported via `console.log` as JSON and collected by Splunk. + +**Source:** `metrics/MetricsAccumulator.ts` + +#### Aggregated Metrics Format + +Each metric includes: +- `name` - Metric identifier +- `count` - Number of samples +- `mean`, `median` - Average and middle values +- `percentile95`, `percentile99` - Tail latencies +- `max` - Maximum value +- `production` - Environment flag +- Plus any custom extensions + +#### System Metrics + +| Metric Name | Properties | +|-------------|------------| +| `cpu` | `user` (Ξs), `system` (Ξs) | +| `memory` | `rss`, `heapTotal`, `heapUsed`, `external`, `arrayBuffers` | +| `httpAgent` | `sockets`, `freeSockets`, `pendingRequests` | +| `incomingRequest` | `total`, `closed`, `aborted` | + +### Billing Metrics + +**Source:** `worker/runtime/http/middlewares/timings.ts` + +Exported with `__VTEX_IO_BILLING` flag for usage tracking: + +```json +{ + "__VTEX_IO_BILLING": "true", + "account": "...", + "app": "...", + "handler": "...", + "production": true, + "routeType": "public_route", + "timestamp": 1234567890, + "type": "process-time", + "value": 150, + "vendor": "vtex", + "workspace": "master" +} +``` + +--- + +## Related Documentation + +- [Migration Guide](./METRICS_OVERVIEW.md) - Patterns and best practices for migrating to diagnostics-based metrics + diff --git a/docs/METRICS_OVERVIEW.md b/docs/METRICS_OVERVIEW.md new file mode 100644 index 000000000..c6aa1a535 --- /dev/null +++ b/docs/METRICS_OVERVIEW.md @@ -0,0 +1,573 @@ +# Metrics Migration Guide for VTEX IO Apps + +This document provides comprehensive guidance for migrating from the legacy `MetricsAccumulator` API to the new `DiagnosticsMetrics` API, including patterns, best practices, and production-validated examples. + +> **Looking for the complete metrics catalog?** See [METRICS_CATALOG.md](./METRICS_CATALOG.md) for a comprehensive list of all available metrics. + +## Table of Contents + +- [Why Migrate?](#why-migrate) +- [Quick Start](#quick-start) +- [Common Migration Patterns](#common-migration-patterns) +- [What Doesn't Need Migration](#what-doesnt-need-migration) +- [Additional Examples from Production Apps](#additional-examples-from-production-apps) +- [Best Practices for Metrics Design](#best-practices-for-metrics-design) +- [Troubleshooting](#troubleshooting) +- [FAQ](#faq) + +--- + +## Why Migrate? + +The new `DiagnosticsMetrics` API provides: + +✅ **Better Performance**: No in-memory aggregation, lower memory overhead +✅ **Modern Observability**: OpenTelemetry-based metrics exported to backend +✅ **Better Dashboards**: Attribute-based metrics for flexible querying +✅ **Cardinality Control**: Built-in limits to prevent metric explosion +✅ **Type Safety**: Full TypeScript support with clear APIs + +--- + +## Quick Start + +### Before (Legacy API) + +```typescript +import { MetricsAccumulator } from '@vtex/api' + +const metrics = new MetricsAccumulator() + +const start = process.hrtime() +const result = await fetchData() +metrics.batch('fetch-data', process.hrtime(start), { success: 1 }) +``` + +### After (New API) + +```typescript +// DiagnosticsMetrics is available globally +const { diagnosticsMetrics } = global + +const start = process.hrtime() +const result = await fetchData() +diagnosticsMetrics.recordLatency(process.hrtime(start), { + operation: 'fetch-data', + status: 'success' +}) +diagnosticsMetrics.incrementCounter('fetch_data_total', 1, { + status: 'success' +}) +``` + +--- + +## Common Migration Patterns + +### Pattern 1: Simple Latency Recording + +**Before:** +```typescript +const start = process.hrtime() +const result = await apiCall() +metrics.batch('api-call', process.hrtime(start)) +``` + +**After:** +```typescript +const start = process.hrtime() +const result = await apiCall() +global.diagnosticsMetrics.recordLatency(process.hrtime(start), { + operation: 'api-call', + status: 'success' +}) +``` + +> 📌 **Production Example:** See this pattern in action in [render-to-string's `trackOperation()` utility](https://github.com/vtex/render-to-string/blob/master/node/utils/metrics.ts), which wraps operations with `recordLatency()`. + +### Pattern 2: Latency with Success/Error Tracking + +**Before:** +```typescript +const start = process.hrtime() +try { + const result = await apiCall() + metrics.batch('api-call', process.hrtime(start), { success: 1 }) + return result +} catch (error) { + metrics.batch('api-call', process.hrtime(start), { error: 1 }) + throw error +} +``` + +**After:** +```typescript +const start = process.hrtime() +try { + const result = await apiCall() + global.diagnosticsMetrics.recordLatency(process.hrtime(start), { + operation: 'api-call', + status: 'success' + }) + return result +} catch (error) { + global.diagnosticsMetrics.recordLatency(process.hrtime(start), { + operation: 'api-call', + status: 'error' + }) + throw error +} +``` + +> 📌 **Production Example:** The [render-to-string's `emitMetrics()` function](https://github.com/vtex/render-to-string/blob/master/node/utils/metrics.ts) records latency with `status: 'success'` or `status: 'error'` based on whether the operation succeeded or failed. + +### Pattern 3: Mixed Extensions (Numbers and Strings) + +**Before:** +```typescript +metrics.batch('http-request', elapsed, { + success: 1, // Counter + '2xx': 1, // Counter + 'cache-hit': 1, // Counter + region: 'us' // Attribute +}) +``` + +**After:** +```typescript +// Record latency with attributes +global.diagnosticsMetrics.recordLatency(elapsed, { + operation: 'http-request', + status: '2xx', + cache: 'hit', + region: 'us' +}) +``` + +> 📌 **Production Example:** In [render-to-string's render middleware](https://github.com/vtex/render-to-string/blob/master/node/middleware/render.ts), extra attributes like `{ template: templateName }` are passed alongside latency recordings. + +### Pattern 4: Cache Tracking + +**Before:** +```typescript +// Manual cache stats collection +const stats = cache.getStats() +console.log(`Cache hits: ${stats.hits}, misses: ${stats.misses}`) +``` + +**After:** +```typescript +// Direct API calls for cache metrics +const stats = cache.getStats() +global.diagnosticsMetrics.incrementCounter('cache_hits_total', stats.hits, { + cache: 'my-cache' +}) +global.diagnosticsMetrics.incrementCounter('cache_misses_total', stats.misses, { + cache: 'my-cache' +}) +global.diagnosticsMetrics.setGauge('cache_items_current', stats.size, { + cache: 'my-cache' +}) +``` + +> 📌 **Production Example:** The [render-to-string's `recordCacheMetric()` function](https://github.com/vtex/render-to-string/blob/master/node/utils/metrics.ts) uses `incrementCounter('cache_operations_total', 1, { cache, cache_state })` for unified cache tracking. + +--- + +## What Doesn't Need Migration + +### HTTP Client with `metric:` Config Option + +**No changes needed!** The HTTP client middleware was already migrated internally. + +```typescript +// This already uses DiagnosticsMetrics internally +this.http.get(`/user/${email}/isAdmin`, { + metric: 'sphinx-is-admin' // ✅ Works automatically +}) +``` + +> 📌 **Production Example:** See this in [render-to-string's Assets client](https://github.com/vtex/render-to-string/blob/master/node/clients/assets.ts) using `metric: 'assets-fetch'`. + +### GraphQL `@metric` Directive + +**No changes needed!** The directive was already migrated internally. + +```typescript +export const resolvers = { + Query: { + @metric // ✅ Already using DiagnosticsMetrics + async products(_: any, __: any, ctx: Context) { + return ctx.clients.catalog.getProducts() + } + } +} +``` + +--- + +## Additional Examples from Production Apps + +Beyond the basic migration patterns (1-4), apps with complex instrumentation needs can benefit from additional patterns. The **[render-to-string](https://github.com/vtex/render-to-string)** app demonstrates these advanced techniques in production. + +### Centralized Metrics Utility + +When your app has many operations that need instrumentation, reduce boilerplate by creating a centralized utility that combines metrics, logging, and backward compatibility. + +**render-to-string implementation:** [`node/utils/metrics.ts`](https://github.com/vtex/render-to-string/blob/master/node/utils/metrics.ts) + +```typescript +// The trackOperation() utility wraps any operation with standardized instrumentation +import { hrToMillis } from '@vtex/api' + +export interface OperationContext { + account: string + workspace: string + operationId: string + logger: { + info: (data: Record) => void + error: (data: Record) => void + } +} + +export interface TrackOperationOptions { + name: string + ctx: OperationContext + extraAttributes?: Record +} + +function emitMetrics( + options: TrackOperationOptions, + elapsed: [number, number], + status: 'success' | 'error' +): void { + const { name, ctx, extraAttributes = {} } = options + const { account, workspace, operationId, logger } = ctx + const timeMs = hrToMillis(elapsed) + + // Legacy metrics API (backward compatibility) + metrics.batchMetric(name, timeMs, { account, workspace, status }) + + // New diagnostics metrics API (OTel-compliant) + global.diagnosticsMetrics?.recordLatency(elapsed, { + operation: name, + status, + ...extraAttributes, + }) + + // Structured logging + const logData = { message: name, operationId, timeMs, ...extraAttributes } + status === 'success' ? logger.info(logData) : logger.error({ ...logData, error: true }) +} + +export async function trackOperation( + options: TrackOperationOptions, + fn: () => T | Promise +): Promise { + const start = process.hrtime() + try { + const result = await fn() + emitMetrics(options, process.hrtime(start), 'success') + return result + } catch (error) { + emitMetrics(options, process.hrtime(start), 'error') + throw error + } +} +``` + +**Why this works:** Reduces ~250 lines of boilerplate to ~30 lines while maintaining dual API support. + +### Tracking Multiple Operations in a Flow + +For complex middleware with multiple sub-operations, use the utility for each logical step. + +**render-to-string implementation:** [`node/middleware/render.ts`](https://github.com/vtex/render-to-string/blob/master/node/middleware/render.ts) + +```typescript +export async function render(ctx: Context, next: () => Promise) { + const { vtex: { account, workspace, operationId } } = ctx + const metricsCtx = { account, workspace, operationId, logger: vtex.logger } + + // Each operation is tracked independently + const compiledScripts = await trackOperation( + { name: 'vm-script', ctx: metricsCtx }, + () => getCompiledScripts(assetsClient, assets) + ) + + await trackOperation( + { name: 'vm-run-in-context', ctx: metricsCtx }, + () => compiledScripts.forEach(script => vm.run(script)) + ) + + const rendered = await trackOperation( + { name: 'vm-global-rendered', ctx: metricsCtx }, + () => vm.run('global.rendered') + ) + + await next() +} +``` + +### Recording Pre-Measured Metrics + +When timing data comes from external sources (e.g., VM sandbox, third-party libraries), record it separately. + +**render-to-string implementation:** [`node/middleware/render.ts`](https://github.com/vtex/render-to-string/blob/master/node/middleware/render.ts) + +```typescript +// Timings captured inside VM are recorded after extraction +const { getDataFromTree, renderToString } = renderMetrics[templateName] +if (getDataFromTree) { + recordExternalMetrics( + { name: 'data-from-tree-ssr', ctx: metricsCtx, extraAttributes: { template: templateName } }, + getDataFromTree + ) +} +if (renderToString) { + recordExternalMetrics( + { name: 'to-string-ssr', ctx: metricsCtx, extraAttributes: { template: templateName } }, + renderToString + ) +} +``` + +### Error Classification for Debugging + +Classify error types as attributes for better debugging and alerting. + +**render-to-string implementation:** [`node/middleware/render.ts`](https://github.com/vtex/render-to-string/blob/master/node/middleware/render.ts) + +```typescript +try { + // ... operation +} catch (e) { + // Classify error type for metrics + let errorType = 'unknown' + if (e instanceof SSRFailError) { + errorType = 'ssr-fail' + } else if (e.code === 'ETIMEDOUT') { + errorType = 'timeout' + } else if (e.code === 'ECONNREFUSED') { + errorType = 'connection-refused' + } else if (e.name === 'TimeoutError') { + errorType = 'vm-timeout' + } + + global.diagnosticsMetrics?.incrementCounter('render_errors_total', 1, { + error_type: errorType, + }) + + throw e +} +``` + +### Unified Cache Metrics + +Create a reusable function for consistent cache tracking across the app. + +**render-to-string implementation:** [`node/utils/metrics.ts`](https://github.com/vtex/render-to-string/blob/master/node/utils/metrics.ts) + +```typescript +export function recordCacheMetric( + cacheName: string, + state: 'hit' | 'miss' | 'bypass' +): void { + global.diagnosticsMetrics?.incrementCounter('cache_operations_total', 1, { + cache: cacheName, + cache_state: state, + }) +} +``` + +--- + +## Best Practices for Metrics Design + +These best practices are based on OpenTelemetry guidelines and the DiagnosticsMetrics API design. They have been validated in production through apps like render-to-string. + +### 1. Use a Single Histogram with Attributes + +**Don't:** Create separate histograms for each operation +```typescript +// ❌ Bad - creates cardinality explosion +global.diagnosticsMetrics.recordLatency(elapsed, { operation: 'fetch-user-123' }) +``` + +**Do:** Use consistent operation names with attributes +```typescript +// ✅ Good - low cardinality +global.diagnosticsMetrics.recordLatency(elapsed, { + operation: 'fetch-user', + status: 'success' +}) +``` + +### 2. Maintain Backward Compatibility + +**Do:** Emit to both APIs during migration +```typescript +// ✅ Emit to both during migration +metrics.batchMetric(name, timeMs, { account, workspace, status }) +global.diagnosticsMetrics?.recordLatency(elapsed, { operation: name, status }) +``` + +### 3. Use Optional Chaining for Safety + +**Do:** Handle uninitialized state gracefully +```typescript +// ✅ Safe - won't crash if not initialized +global.diagnosticsMetrics?.recordLatency(elapsed, attributes) +``` + +### 4. Keep Attributes Low Cardinality + +**Don't:** +```typescript +// ❌ Millions of unique values +{ user_id: '12345', request_id: 'abc-123' } +``` + +**Do:** +```typescript +// ✅ Limited set of values +{ endpoint: '/users', status: 'success', region: 'us-east' } +``` + +### 5. Limit to 5 Attributes Maximum + +```typescript +// ✅ Good - 5 attributes +global.diagnosticsMetrics.recordLatency(elapsed, { + operation: 'api-call', + status: 'success', + endpoint: '/users', + region: 'us-east', + cache: 'hit' +}) + +// ❌ Too many - extra will be dropped +global.diagnosticsMetrics.recordLatency(elapsed, { + attr1: 'val1', attr2: 'val2', attr3: 'val3', + attr4: 'val4', attr5: 'val5', attr6: 'val6', // Dropped! + attr7: 'val7' // Dropped! +}) +``` + +### 6. Follow Naming Conventions + +| Metric Type | Pattern | Example | +|-------------|---------|---------| +| Histogram | `{component}_{measurement}_duration_ms` | `http_client_request_duration_ms` | +| Counter | `{component}_{event}_total` | `http_requests_total` | +| Gauge | `{component}_{measurement}_current` | `cache_items_current` | + +### 7. Include Context in Logs + +```typescript +// ✅ Good - traceable logs +logger.info({ + message: 'operation-name', + operationId, // Correlation ID + timeMs, // Duration + ...extraAttributes +}) +``` + +--- + +## Troubleshooting + +### Problem: `diagnosticsMetrics is undefined` + +**Cause:** Accessing `global.diagnosticsMetrics` before service initialization. + +**Solution:** Add a check: +```typescript +if (!global.diagnosticsMetrics) { + console.warn('DiagnosticsMetrics not initialized') + return +} + +global.diagnosticsMetrics.recordLatency(...) +``` + +Or use optional chaining: +```typescript +global.diagnosticsMetrics?.recordLatency(...) +``` + +### Problem: Metrics not appearing in dashboards + +**Checklist:** +1. ✅ Verify `@vtex/api` version is up to date +2. ✅ Check `DIAGNOSTICS_TELEMETRY_ENABLED` environment variable is set +3. ✅ Ensure operation names are consistent (no typos) +4. ✅ Verify attributes have low cardinality (avoid unique IDs) +5. ✅ Check observability backend for metric ingestion + +### Problem: High cardinality warnings + +**Cause:** Too many unique attribute combinations. + +**Solution:** Normalize attribute values to a limited set: +```typescript +// ❌ Bad +{ user_id: userId } + +// ✅ Good +{ user_type: 'premium' } // or 'standard', 'guest' +``` + +### Problem: More than 5 attributes warning + +**Solution:** Reduce to most important attributes. The library will truncate to 5. + +--- + +## FAQ + +### Q: Do I need to migrate immediately? + +**A:** No. The legacy `MetricsAccumulator` API continues to work. Both APIs coexist independently. Migrate at your own pace. + +### Q: Can I use both APIs in the same app? + +**A:** Yes! Both `global.metrics` (legacy) and `global.diagnosticsMetrics` (new) are available. You can migrate gradually. + +### Q: What happens to my existing dashboards? + +**A:** Legacy metrics continue to be exported. New metrics have different names and use attributes. You'll need to update dashboards when you migrate. + +### Q: How do I know which metric names to use? + +**A:** Follow these conventions: +- Histograms: `{component}_{measurement}_duration_ms` (e.g., `http_client_request_duration_ms`) +- Counters: `{component}_{event}_total` (e.g., `http_requests_total`) +- Gauges: `{component}_{measurement}_current` (e.g., `cache_items_current`) + +### Q: What about the `metric:` parameter in HTTP client config? + +**A:** It continues to work! The HTTP client was updated internally to use `DiagnosticsMetrics` while maintaining backward compatibility. + +### Q: Should I remove `MetricsAccumulator` imports? + +**A:** Not required, but recommended for new code. For existing code, migrate when you touch that code. + +### Q: What's the performance impact? + +**A:** The new API has lower overhead (<500ns per recording) and uses less memory (no in-memory aggregation). + +--- + +## Additional Resources + +- [Metrics Catalog](./METRICS_CATALOG.md) - Complete list of all available metrics +- [VTEX IO Documentation](https://developers.vtex.com/docs/guides/vtex-io-documentation-what-is-vtex-io) + +## Support + +If you need help with migration: +- Check the [Troubleshooting](#troubleshooting) section +- Review the [production examples](#additional-examples-from-production-apps) +- Open an issue in the node-vtex-api repository