Skip to content

feat(api): add Prometheus metrics endpoint#63

Merged
pescn merged 8 commits into
mainfrom
feat/prometheus-metrics-api
Jan 28, 2026
Merged

feat(api): add Prometheus metrics endpoint#63
pescn merged 8 commits into
mainfrom
feat/prometheus-metrics-api

Conversation

@pescn
Copy link
Copy Markdown
Contributor

@pescn pescn commented Jan 25, 2026

Summary

  • Implements Add metrics API to NexusGate #28: Prometheus Metrics API
  • Adds GET /metrics endpoint returning Prometheus exposition format
  • Includes optional Prometheus + Grafana monitoring stack
  • Updates quick-start.sh with optional monitoring installation

Metrics Exposed

Metric Type Labels
nexusgate_completions_total Counter model, status, api_format, api_key
nexusgate_embeddings_total Counter model, status, api_key
nexusgate_tokens_prompt_total Counter model
nexusgate_tokens_completion_total Counter model
nexusgate_tokens_embedding_total Counter model
nexusgate_completion_duration_seconds Histogram model
nexusgate_completion_ttft_seconds Histogram model
nexusgate_embedding_duration_seconds Histogram model
nexusgate_active_api_keys Gauge -
nexusgate_active_providers Gauge -
nexusgate_active_models Gauge type
nexusgate_info Gauge version

Files Added/Modified

  • backend/src/api/metrics.ts - Route definition
  • backend/src/services/prometheus.ts - Prometheus format serialization
  • backend/src/db/index.ts - Added 6 query functions for metrics aggregation
  • backend/src/index.ts - Registered /metrics route
  • docker-compose.monitoring.yaml - Docker Compose override for Prometheus + Grafana
  • grafana/provisioning/ - Grafana datasource and dashboard provisioning
  • prometheus/prometheus.yml - Prometheus scrape configuration
  • scripts/quick-start.sh - Added optional monitoring installation
  • python_test_code/test_metrics.py - Integration tests

Test plan

  • Build succeeds (bun run build)
  • Type check passes (bun run check)
  • Lint passes (bun run lint)
  • Python integration tests pass (7/7 tests)
  • Manual curl test: curl http://localhost:3000/metrics
  • Test with Prometheus scraper
  • Test Grafana dashboard visualization

Usage

# Fetch metrics
curl http://localhost:3000/metrics

# Start with monitoring stack
docker compose -f docker-compose.yaml -f docker-compose.monitoring.yaml up -d

Closes #28

🤖 Generated with Claude Code

Summary by CodeRabbit

  • 新功能

    • 新增 /metrics 端点,导出 Prometheus 格式的运行时指标
    • 内置指标生成器,支持计数、仪表与直方图,并对输出做缓存与回退
  • 文档/配置

    • 可选安装 Prometheus + Grafana 监控堆栈,预置数据源与仪表盘与采集配置
    • 提供监控相关启动与环境配置提示
  • 测试

    • 新增独立的 /metrics 验证测试脚本
  • 杂项

    • 将速率限制拒绝计数纳入监控并记录
    • 更新忽略规则以跟踪新增测试文件

✏️ Tip: You can customize this high-level summary in your review settings.

Implements #28: Prometheus Metrics API

New /metrics endpoint exposing:
- nexusgate_completions_total (counter by model, status, api_format, api_key)
- nexusgate_embeddings_total (counter by model, status, api_key)
- nexusgate_tokens_prompt_total, nexusgate_tokens_completion_total (counters)
- nexusgate_completion_duration_seconds, nexusgate_completion_ttft_seconds (histograms)
- nexusgate_embedding_duration_seconds (histogram)
- nexusgate_active_api_keys, nexusgate_active_providers, nexusgate_active_models (gauges)
- nexusgate_info (build info gauge)

Also includes:
- Optional Prometheus + Grafana monitoring stack via docker-compose.monitoring.yaml
- Pre-configured Grafana dashboard with request rates, latency, tokens, errors
- Updated quick-start.sh with optional monitoring installation
- Integration tests for metrics endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @pescn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of the LLM gateway by integrating a comprehensive Prometheus and Grafana monitoring solution. It provides a dedicated API endpoint to expose key operational metrics, backed by new database queries for data aggregation. The addition of an optional monitoring stack and an updated quick-start script simplifies deployment and allows users to gain immediate insights into the system's performance, usage, and health through pre-built dashboards.

Highlights

  • Prometheus Metrics Endpoint: Introduced a new /metrics endpoint that exposes operational metrics in Prometheus exposition format, allowing for detailed monitoring of the LLM gateway.
  • Integrated Monitoring Stack: Added an optional Prometheus and Grafana monitoring stack via a docker-compose.monitoring.yaml override, complete with pre-configured Grafana dashboards for NexusGate.
  • Database Metrics Aggregation: Implemented new database query functions to aggregate various metrics, including completion and embedding counts, token usage, and latency histograms (duration and time-to-first-token).
  • Enhanced Quick-Start Script: Updated the quick-start.sh script to offer an optional installation of the monitoring stack, guiding users through its setup and configuration.
  • Metrics Integration Tests: Added a new Python integration test suite (test_metrics.py) to validate the functionality and format of the /metrics endpoint.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 25, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

新增 /metrics Prometheus 指标端点、指标生成服务(含 Redis 缓存)、数据库聚合与直方图查询、速率限制拒绝计数、监控 Stack(Prometheus+Grafana)配置、安装脚本支持及独立 Python 测试脚本。

Changes

Cohort / File(s) 变更摘要
API 端点
backend/src/api/metrics.ts
新增 /metrics GET 路由并导出 metricsApi,返回 Prometheus 文本(text/plain)。
指标生成服务
backend/src/services/prometheus.ts
新增 generatePrometheusMetrics():并行拉取 DB/Redis 数据、构造 counters/gauges/histograms 并输出 Prometheus exposition 文本;含标签转义、直方图渲染、Redis 缓存与回退逻辑。
数据库聚合
backend/src/db/index.ts
新增聚合函数与常量(按模型/状态汇总 completions/embeddings、各类直方图、API key 限额配置与活跃实体计数):getCompletionMetricsByModelAndStatusgetEmbeddingMetricsByModelAndStatusgetCompletionDurationHistogramgetCompletionTTFTHistogramgetEmbeddingDurationHistogramgetApiKeyRateLimitConfiggetActiveEntityCountsLATENCY_BUCKETS_MS
应用集成
backend/src/index.ts
metricsApi 注入 Elysia 应用;更新 SPA 路由跳过逻辑以包含 /metrics
速率限制监控
backend/src/plugins/apiKeyRateLimitPlugin.ts, backend/src/utils/redisClient.ts
在速率限制插件中新增 trackRateLimitRejection 并导出 getRateLimitRejections()RedisClient 新增 hincrbyhgetall 用于 hash 计数与读取,并在速率限制拒绝处调用以记录拒绝。
配置
backend/src/utils/config.ts
新增环境配置 METRICS_CACHE_TTL_SECONDS(默认 30 秒)。
监控部署
docker-compose.monitoring.yaml, prometheus/prometheus.yml
新增 docker-compose overlay(Prometheus、Grafana)与 Prometheus 抓取配置(抓取 nexusgate:3000/metrics)。
Grafana 配置
grafana/provisioning/datasources/prometheus.yml, grafana/provisioning/dashboards/dashboards.yml, grafana/provisioning/dashboards/json/nexusgate-dashboard.json
新增 Prometheus 数据源、仪表板 provision 配置及大型 dashboard JSON(多面板、按模型/状态聚合的延迟/请求率/直方图等)。
安装脚本
scripts/quick-start.sh
新增交互式 ask_monitoring、监控相关变量与下载/启动逻辑;条件性下载并启动监控配置。
测试
python_test_code/test_metrics.py
新增独立 Python 测试套件,包含 Prometheus 文本解析器与多项 /metrics 验证(HTTP 状态、Content-Type、格式、直方图与样例输出)。
版本控制
.gitignore
添加例外以跟踪 python_test_code/test_metrics.py(使其被提交)。

Sequence Diagram(s)

sequenceDiagram
    participant Client as HTTP Client
    participant API as /metrics Endpoint
    participant Service as Metrics Service
    participant Redis as Redis Cache
    participant DB as Database
    participant Formatter as Formatter

    rect rgba(200,200,255,0.5)
    Client->>API: GET /metrics
    end
    API->>Service: call generatePrometheusMetrics()
    Service->>Redis: GET cached metrics
    alt cache hit
        Redis-->>Service: cached metrics string
    else cache miss
        par Parallel DB queries
            Service->>DB: getCompletionMetricsByModelAndStatus()
            Service->>DB: getEmbeddingMetricsByModelAndStatus()
            Service->>DB: getCompletionDurationHistogram()
            Service->>DB: getCompletionTTFTHistogram()
            Service->>DB: getEmbeddingDurationHistogram()
            Service->>DB: getActiveEntityCounts()
            Service->>DB: getApiKeyRateLimitConfig()
        end
        DB-->>Service: aggregated rows
        Service->>Formatter: render counters/gauges/histograms (ms→s, escape labels)
        Formatter-->>Service: metrics text
        Service->>Redis: SET cache (METRICS_CACHE_TTL_SECONDS)
    end
    Service-->>API: metrics text
    API-->>Client: 200 + text/plain
Loading

Estimated code review effort

🎯 4 (复杂) | ⏱️ ~70 分钟

Possibly related PRs

Poem

🐰 我在代码田里嗅萝卜,
指标一行行像胡萝卜排着,
Prometheus 点亮夜里的灯,Grafana 摆上宴席,
我轻跳缓存与直方图之间,
欢庆上线,来一口甜甜的萝卜! 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR标题'feat(api): add Prometheus metrics endpoint'清晰准确地概括了主要变更——添加Prometheus指标端点,与整个变更集完全相关。
Linked Issues check ✅ Passed 代码变更完全满足Issue #28的需求:实现了/metrics端点,返回Prometheus格式数据,暴露了指定的计数器、直方图和仪表,包含所需标签维度,并提供Prometheus和Grafana集成配置。
Out of Scope Changes check ✅ Passed 所有变更都与Issue #28的目标紧密相关。添加的文件和修改都直接支持Prometheus指标端点的实现、测试和监控栈集成,未发现不相关的改动。
Docstring Coverage ✅ Passed Docstring coverage is 94.74% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
backend/src/db/index.ts (1)

1509-1527: 实现良好,可考虑增加防御性校验。

单次查询配合子查询的设计高效地减少了数据库往返次数。Number() 转换配合空值合并操作处理了边界情况。

可选的小改进:考虑在 rowundefined 时添加更明确的错误处理或日志记录,尽管对于聚合查询这种情况理论上不会发生。

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cef8036 and 91bce92.

📒 Files selected for processing (1)
  • backend/src/db/index.ts
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-24T18:23:42.635Z
Learnt from: pescn
Repo: EM-GeekLab/NexusGate PR: 59
File: backend/src/api/v1/responses.ts:566-577
Timestamp: 2026-01-24T18:23:42.635Z
Learning: When using Elysia with the apiKeyPlugin, if a route option has checkApiKey: true, the apiKeyRecord parameter is guaranteed to be non-null. Do not add explicit null checks or non-null assertions for apiKeyRecord in such routes; rely on this contract and avoid unnecessary guards to improve readability. If there is any doubt about the guarantee in a specific place, add a runtime assertion at the boundary where the plugin is configured, but otherwise omit null checks for apiKeyRecord in backend code under backend/.

Applied to files:

  • backend/src/db/index.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Code Quality Check
🔇 Additional comments (6)
backend/src/db/index.ts (6)

1360-1385: LGTM!

函数结构合理,使用 LEFT JOIN 正确处理了可能缺失的 api_key 记录,COALESCE 确保了 api_key_comment 始终有值。类型断言对于 Drizzle 原始 SQL 查询是必要的。


1392-1413: LGTM!

getCompletionMetricsByModelAndStatus 保持一致的实现模式。


1415-1425: LGTM!

桶边界常量已正确移至模块作用域(根据之前的审查反馈)。数值范围(100ms 到 2 分钟)对于 LLM 延迟监控来说是合理的。


1435-1448: LGTM!

sql.raw() 的使用是安全的,因为 DURATION_BUCKET_CASES 来源于数值常量。注释清楚地解释了为什么使用 SUM 而不是 AVG(Prometheus 直方图格式要求)。根据之前的审查反馈,重复的 COUNT(*) 已被移除。


1454-1467: LGTM!

正确地添加了 status = 'completed' 过滤条件,因为 TTFT(首字节时间)仅对成功完成的请求有意义。


1473-1486: LGTM!

getCompletionDurationHistogram 保持一致的实现模式。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Prometheus metrics endpoint and integrates it with an optional Grafana monitoring stack. The changes include new database queries for metrics aggregation, a service to format metrics in Prometheus exposition format, updates to the main application to expose the endpoint, and a comprehensive update to the quick-start.sh script for optional monitoring setup. The Grafana dashboard configuration is also included, providing good visualization for the new metrics. The Python integration tests for the metrics endpoint are a valuable addition, ensuring the endpoint's correctness and adherence to the Prometheus format. Some minor code redundancies were identified, and a potential security concern regarding API key exposure in metrics was noted, suggesting anonymization.

Comment thread backend/src/db/index.ts Outdated
Comment thread backend/src/services/prometheus.ts Outdated
Comment thread backend/src/services/prometheus.ts Outdated
Comment thread python_test_code/test_metrics.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/src/api/metrics.ts`:
- Around line 4-18: The /metrics endpoint (metricsApi) currently exposes
Prometheus metrics including an api_key/api_key_id label from
generatePrometheusMetrics, which leaks key identifiers; change the endpoint to
enforce an optional Bearer token: read a METRICS_BEARER_TOKEN (or similar) env
var and if it is set, validate the incoming Authorization header ("Bearer
<token>") and return 401 when missing/invalid; if the env var is unset keep
current public behavior. Additionally modify generatePrometheusMetrics (or add a
parameter like redactApiKeyLabels) to strip or omit any api_key/api_key_id
labels from the output so key identifiers are never emitted even when metrics
are public, and ensure metricsApi uses the redaction option when calling
generatePrometheusMetrics.

In `@backend/src/index.ts`:
- Around line 155-157: The current guard only checks path === "/metrics" so a
request to "/metrics/" falls through to the SPA; update the conditional that
checks the request path (the expression using path.startsWith("/api") ||
path.startsWith("/v1") || path === "/metrics") to also match the trailing-slash
variant (e.g., add path === "/metrics/" or equivalent) so requests to both
"/metrics" and "/metrics/" return the 404 via status(404).
🧹 Nitpick comments (3)
scripts/quick-start.sh (1)

456-459: 从 .env 读取配置时可能存在边界情况问题

.env 文件中 ENABLE_MONITORING 的值包含特殊字符或空格时,grep | cut 的方式可能无法正确解析。

♻️ 建议使用更健壮的解析方式
         # 从现有 .env 读取监控配置
         if [ -f ".env" ]; then
-            ENABLE_MONITORING=$(grep "ENABLE_MONITORING=" .env 2>/dev/null | cut -d '=' -f2 | tr -d ' ' || echo "false")
+            ENABLE_MONITORING=$(grep "^ENABLE_MONITORING=" .env 2>/dev/null | cut -d '=' -f2- | tr -d ' "'"'"'' || echo "false")
         fi
python_test_code/test_metrics.py (1)

238-241: 移除不必要的 f-string 前缀

这些字符串不包含占位符,不需要 f-string 前缀。静态分析工具标记了此问题。

♻️ 建议的修改
-            print(f"    Has _bucket: Yes")
-            print(f"    Has _sum: Yes")
-            print(f"    Has _count: Yes")
-            print(f"    Has +Inf bucket: Yes")
+            print("    Has _bucket: Yes")
+            print("    Has _sum: Yes")
+            print("    Has _count: Yes")
+            print("    Has +Inf bucket: Yes")
docker-compose.monitoring.yaml (1)

6-6: 使用固定的镜像版本标签替代 :latest

使用 :latest 标签可能导致生产环境中出现不可预测的行为变化。建议使用固定的版本标签以确保可重复的部署。

♻️ 建议的修改
-    image: "prom/prometheus:latest"
+    image: "prom/prometheus:v3.0.1"
-    image: "grafana/grafana:latest"
+    image: "grafana/grafana:11.4.0"

也适用于:第 23 行

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50b8b8e and 82eb65a.

📒 Files selected for processing (12)
  • .gitignore
  • backend/src/api/metrics.ts
  • backend/src/db/index.ts
  • backend/src/index.ts
  • backend/src/services/prometheus.ts
  • docker-compose.monitoring.yaml
  • grafana/provisioning/dashboards/dashboards.yml
  • grafana/provisioning/dashboards/json/nexusgate-dashboard.json
  • grafana/provisioning/datasources/prometheus.yml
  • prometheus/prometheus.yml
  • python_test_code/test_metrics.py
  • scripts/quick-start.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-24T18:23:42.635Z
Learnt from: pescn
Repo: EM-GeekLab/NexusGate PR: 59
File: backend/src/api/v1/responses.ts:566-577
Timestamp: 2026-01-24T18:23:42.635Z
Learning: When using Elysia with the apiKeyPlugin, if a route option has checkApiKey: true, the apiKeyRecord parameter is guaranteed to be non-null. Do not add explicit null checks or non-null assertions for apiKeyRecord in such routes; rely on this contract and avoid unnecessary guards to improve readability. If there is any doubt about the guarantee in a specific place, add a runtime assertion at the boundary where the plugin is configured, but otherwise omit null checks for apiKeyRecord in backend code under backend/.

Applied to files:

  • backend/src/api/metrics.ts
  • backend/src/services/prometheus.ts
  • backend/src/db/index.ts
  • backend/src/index.ts
🧬 Code graph analysis (3)
backend/src/api/metrics.ts (1)
backend/src/services/prometheus.ts (1)
  • generatePrometheusMetrics (105-299)
backend/src/services/prometheus.ts (2)
backend/src/db/index.ts (7)
  • LATENCY_BUCKETS_MS (1412-1412)
  • getCompletionMetricsByModelAndStatus (1359-1383)
  • getEmbeddingMetricsByModelAndStatus (1389-1409)
  • getCompletionDurationHistogram (1418-1436)
  • getCompletionTTFTHistogram (1442-1460)
  • getEmbeddingDurationHistogram (1466-1484)
  • getActiveEntityCounts (1489-1528)
backend/src/utils/config.ts (1)
  • COMMIT_SHA (104-104)
backend/src/index.ts (1)
backend/src/api/metrics.ts (1)
  • metricsApi (9-26)
🪛 Ruff (0.14.13)
python_test_code/test_metrics.py

238-238: f-string without any placeholders

Remove extraneous f prefix

(F541)


239-239: f-string without any placeholders

Remove extraneous f prefix

(F541)


240-240: f-string without any placeholders

Remove extraneous f prefix

(F541)


241-241: f-string without any placeholders

Remove extraneous f prefix

(F541)


290-290: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Code Quality Check
🔇 Additional comments (22)
.gitignore (1)

32-32: 已将 metrics 测试文件纳入版本控制
这能确保新增集成测试随 PR 一起进入代码库并被 CI 覆盖。

backend/src/index.ts (1)

19-19: metrics 路由已正确挂载到主应用
引入并注册位置合理,路由层面集成清晰。

Also applies to: 209-209

grafana/provisioning/dashboards/dashboards.yml (1)

1-16: 仪表盘自动加载配置清晰
目录路径与 JSON 仪表盘位置一致,设置合理。

grafana/provisioning/datasources/prometheus.yml (1)

1-15: Prometheus 数据源预配符合预期
默认数据源与采样间隔配置合理。

prometheus/prometheus.yml (1)

1-19: Prometheus 抓取配置符合监控栈预期
NexusGate 与自监控 job 都已覆盖,OK。

docker-compose.monitoring.yaml (1)

4-41: 监控栈配置整体 LGTM!

Docker Compose overlay 配置结构合理:

  • Prometheus 配置了合适的 15 天数据保留期
  • 启用了 lifecycle API 便于热重载配置
  • Grafana 正确挂载了 provisioning 目录实现自动配置
  • 端口通过环境变量配置,提供了灵活性
backend/src/services/prometheus.ts (5)

19-24: Prometheus 标签转义实现正确!

escapeLabelValue 函数正确地转义了 Prometheus 格式要求的三种特殊字符:反斜杠、双引号和换行符。


29-37: LGTM!

formatLabels 函数正确处理了标签格式化,包括跳过 null/undefined/空字符串值的逻辑。


82-100: 直方图格式化实现正确!

函数正确输出了 Prometheus 直方图所需的所有组件:

  • 各桶的累积计数 (_bucket)
  • +Inf 桶(总计数)
  • _sum_count

105-121: 并行获取指标数据,性能良好!

使用 Promise.all 并行获取所有指标数据是正确的做法,避免了串行请求导致的延迟累加。


304-335: 直方图数据解析逻辑正确!

函数正确处理了:

  • 桶边界从毫秒到秒的转换
  • 总和从毫秒到秒的转换
  • 使用 bucket_inf 作为计数的回退值
grafana/provisioning/dashboards/json/nexusgate-dashboard.json (1)

1-1887: Grafana 仪表板配置全面且结构良好!

仪表板包含了 LLM 网关监控所需的关键面板:

  • 概览统计(请求数、成功率、活跃资源)
  • 请求速率和吞吐量(按模型和状态分组)
  • 延迟分布(P50/P95/P99 和 TTFT)
  • Token 使用量
  • 错误率和缓存命中率
  • API 格式分布

Prometheus 查询与后端暴露的指标名称一致。

scripts/quick-start.sh (3)

70-105: 监控组件询问流程清晰友好!

ask_monitoring 函数提供了清晰的描述,帮助用户了解监控组件的功能和资源占用,并提供了合理的默认选项。


157-205: 监控配置文件下载逻辑完整!

正确创建了所需的目录结构并下载所有必要的配置文件:

  • docker-compose.monitoring.yaml
  • prometheus/prometheus.yml
  • Grafana provisioning 文件

同时保持了"已存在则跳过"的幂等行为。


479-485: Docker Compose 多文件启动命令正确!

使用 -f docker-compose.yaml -f docker-compose.monitoring.yaml 正确实现了监控栈的叠加部署。

backend/src/db/index.ts (4)

1359-1383: 补全指标查询实现正确!

SQL 查询正确聚合了按 model、status、api_format 和 api_key_id 分组的数据,并使用 COALESCE 处理了 NULL 值。


1411-1412: 延迟桶边界定义合理!

LATENCY_BUCKETS_MS 覆盖了从 100ms 到 120s 的范围,适合 LLM 请求的典型延迟分布。


1418-1436: 直方图查询使用 sql.raw 构建动态 SQL

虽然 sql.raw 通常需要谨慎使用以防止 SQL 注入,但这里 LATENCY_BUCKETS_MS 是一个硬编码的数字数组常量,因此是安全的。

如果将来 LATENCY_BUCKETS_MS 变为可配置项,需要添加输入验证。


1489-1528: 活跃实体计数查询实现良好!

使用 Promise.all 并行执行四个独立的查询是高效的做法。各查询正确过滤了已删除/已撤销的实体。

python_test_code/test_metrics.py (3)

30-80: Prometheus 指标解析器实现正确!

parse_prometheus_metrics 函数正确解析了带标签和不带标签的指标行,并处理了 # HELP# TYPE 注释行。


108-149: 测试覆盖了必要和可选指标!

测试正确区分了必须存在的指标(如 nexusgate_info)和可选指标(如 nexusgate_completions_total),后者可能因无数据而不存在。


282-293: 测试运行器的异常处理

虽然捕获通用 Exception 通常不推荐,但在测试场景中这是合理的,可以确保所有测试都能运行并报告结果,而不是因单个测试失败而中断整个测试套件。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread backend/src/api/metrics.ts
Comment thread backend/src/index.ts Outdated
…d caching

- Add METRICS_CACHE_TTL_SECONDS config for Redis-based metrics caching
- Replace api_key_id with api_key_comment label for meaningful aggregation
- Add API key rate limit metrics (rpm/tpm usage and limits)
- Add rate limit rejection counter (429 errors by api_key and limit_type)
- Add Redis hash operations (hincrby, hgetall) for rejection tracking
- Add error handling with fallback metrics on failure
- Add security documentation for public metrics endpoint
- Update Grafana dashboard with:
  - Error Type Distribution pie chart
  - Rate Limit Rejections timeseries panel
  - API Key Rate Limit Usage gauge panel

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 25, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Prometheus metrics endpoint, which is a fantastic addition for observability. The implementation is well-structured, including caching, fallback metrics on error, and a full monitoring stack with Docker Compose, Prometheus, and a detailed Grafana dashboard. The code is generally of high quality. I've identified one bug in the metrics generation logic and a potential performance bottleneck that could be addressed to make it even more robust. Overall, great work on this feature.

Comment thread backend/src/services/prometheus.ts Outdated
Comment thread backend/src/services/prometheus.ts Outdated
pescn and others added 2 commits January 26, 2026 03:49
Address CodeRabbit review: requests to /metrics/ were falling through
to SPA routing, returning HTML instead of 404. This caused Prometheus
scraping errors when using trailing slash URLs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix colon parsing in apiKeyComment by using pop() for limit type
- Optimize rate limit status fetching with Promise.all for parallel execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn
Copy link
Copy Markdown
Contributor Author

pescn commented Jan 25, 2026

Addressing Review Comments

Fixed in commit 9eae9aa:

  1. [HIGH] Colon parsing in apiKeyComment (prometheus.ts:433)

    • Fixed by using parts.pop() to extract limit type and parts.join(":") for the comment
    • This correctly handles comments containing colons like "my:team:key"
  2. [MEDIUM] Sequential Redis calls (prometheus.ts:391)

    • Refactored to use Promise.all() for parallel Redis fetching
    • This reduces total latency from O(2N) sequential calls to O(1) parallel calls

Already Addressed:

  1. [MEDIUM] Security - api_key_id exposure (prometheus.ts, metrics.ts)
    • Already addressed in commit af35f8e by using api_key_comment label instead of api_key_id
    • This provides meaningful aggregation without exposing sensitive key identifiers
    • Added security documentation to metrics.ts explaining the public endpoint rationale

No Change Needed:

  1. [MEDIUM] Redundant count logic (prometheus.ts:472)

    • The suggested change (row.total_count) would break the code as there's no total_count field in the SQL query
    • The current logic row[${sumField}_count] ?? row.bucket_inf ?? 0 correctly handles the fallback for different histogram types (duration vs ttft) which may have different field names
    • The "redundancy" in SQL is intentional to support different sum field naming patterns
  2. [MEDIUM] Duplicate COUNT(*) in SQL (db/index.ts:1434)

    • Valid observation but functionally correct and low priority
    • Both bucket_inf and duration_count serve different semantic purposes in the code even though they have the same value
    • Will consider consolidating in a future cleanup PR
  3. [MEDIUM] Optional bearer token for /metrics (metrics.ts)

    • Valid suggestion for high-security environments
    • Current approach uses api_key_comment (not the actual key) which mitigates the main concern
    • Network-level security (firewall, VPC) is the recommended approach per industry practice
    • Adding optional auth would add complexity; will consider for a future enhancement if there's user demand

Comment thread backend/src/db/index.ts Outdated
Comment thread backend/src/db/index.ts Outdated
Comment thread backend/src/db/index.ts Outdated
pescn and others added 2 commits January 26, 2026 22:08
- Remove duplicate COUNT(*) in histogram queries (use single total_count)
- Move bucketCases SQL fragments to module scope (computed once at load)
- Add comment explaining why SUM is used instead of AVG for Prometheus histograms

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prometheus scrapers may access /metrics/ with trailing slash, which would
previously fall through to SPA routing and return HTML instead of metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@EM-GeekLab EM-GeekLab deleted a comment from coderabbitai Bot Jan 26, 2026
@EM-GeekLab EM-GeekLab deleted a comment from gemini-code-assist Bot Jan 26, 2026
pescn and others added 2 commits January 26, 2026 22:37
Replace 4 parallel queries with a single query using subqueries.
This reduces database round-trips from 4 to 1.

Addresses review comment from @koitococo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pescn pescn merged commit 57900ef into main Jan 28, 2026
2 checks passed
@pescn pescn deleted the feat/prometheus-metrics-api branch January 28, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metrics API to NexusGate

2 participants