Skip to content

feat: add Grafana integration and alert system#67

Merged
koitococo merged 5 commits into
mainfrom
feat/grafana-alert-integration
Feb 1, 2026
Merged

feat: add Grafana integration and alert system#67
koitococo merged 5 commits into
mainfrom
feat/grafana-alert-integration

Conversation

@pescn
Copy link
Copy Markdown
Contributor

@pescn pescn commented Jan 31, 2026

Summary

  • Add a complete alert system with notification channels (webhook, email, Feishu), configurable alert rules (budget, error rate, latency, quota), and alert history tracking
  • Add Grafana integration for syncing alert rules as Prometheus-based Grafana alerts via the Provisioning API — when Grafana is connected, the built-in alert engine defers evaluation to Grafana
  • Restructure frontend navigation: model configuration at /models (Providers + Registry), system settings at /settings (Alerts + Grafana) as separate sidebar items

Backend changes

  • New alert API (/api/admin/alerts) with channels, rules, history, and toggle endpoints
  • New Grafana API (/api/admin/grafana) with connection management, test, and sync endpoints
  • Grafana sync service with PromQL mapping for all alert rule types
  • Alert engine with periodic evaluation and dispatcher for webhook/email/Feishu
  • DB migrations adding alert tables and Grafana sync tracking columns

Frontend changes

  • Alerts settings page with channel/rule management and Grafana sync badges
  • Grafana settings page with API connection config and dashboard embed management
  • Moved model config routes from /settings to /models with sub-nav (Providers, Registry)
  • Moved system settings routes to /settings with sub-nav (Alerts, Grafana)
  • Updated sidebar, i18n (en-US + zh-CN), and route tree

Test plan

  • Navigate to /settings/grafana, configure Grafana API URL + token, test connection
  • Navigate to /settings/alerts, create channels and alert rules
  • With Grafana connected: verify sync badges and "Sync to Grafana" buttons work
  • With Grafana disconnected: verify built-in alert evaluation runs
  • Navigate to /models/providers and /models/registry to verify model config pages
  • Verify sidebar navigation links to correct routes
  • bun run check && bun run lint pass with 0 errors

🤖 Generated with Claude Code

Summary by CodeRabbit

  • 新功能

    • 完整告警管理:通道(Webhook/邮箱/飞书)、规则(预算/错误率/延迟/配额)与告警历史视图
    • Grafana 集成:连接配置、同步通道与规则、同步状态与仪表板支持
    • 前端新增“告警”“Grafana”设置页及路由,侧栏新增“模型”入口
  • 改进

    • 后端新增告警引擎与分发器,支持周期评估、冷却、测试通知与邮件发送
    • 前端表单校验、同步状态徽章、本地化文案与小组件(Alert)增强
  • 其他

    • 数据库模式与快照更新;添加邮件库依赖(nodemailer)

✏️ Tip: You can customize this high-level summary in your review settings.

Add a complete alert system with notification channels (webhook, email,
Feishu), configurable alert rules (budget, error rate, latency, quota),
and alert history tracking.

Add Grafana integration for syncing alert rules as Prometheus-based
Grafana alerts via the Provisioning API. When Grafana is connected,
the built-in alert engine defers to Grafana for evaluation.

Restructure frontend navigation: model configuration moves to /models
(Providers + Registry sub-nav), system settings moves to /settings
(Alerts + Grafana sub-nav) as separate sidebar items.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

新增完整告警子系统:数据库枚举与表、Drizzle schema 与快照、后端 CRUD/API、告警引擎与多通道分发(Webhook/Email/Feishu)、Grafana 同步客户端与服务,以及对应前端页面、路由、hooks 与国际化条目。

Changes

Cohort / File(s) Summary
数据库迁移与快照
backend/drizzle/0013_flowery_maria_hill.sql, backend/drizzle/0014_opposite_dragon_man.sql, backend/drizzle/0015_green_kate_bishop.sql, backend/drizzle/meta/0013_snapshot.json, backend/drizzle/meta/0014_snapshot.json, backend/drizzle/meta/0015_snapshot.json, backend/drizzle/meta/_journal.json
新增 alert enums 与三表(alert_channels/alert_rules/alert_history),为 channels/rules 增加 Grafana 同步字段,调整外键为 cascade,提交 schema 快照与 journal 条目。
后端 Schema 与 DB 接口
backend/src/db/schema.ts, backend/src/db/index.ts
新增告警类型定义(channel config/conditions/payload)、Drizzle 表声明,并导出 Alert 类型与 CRUD、历史查询、统计与 Grafana 同步更新函数。
Admin API:Alerts & Grafana
backend/src/api/admin/alerts.ts, backend/src/api/admin/grafana.ts, backend/src/api/admin/index.ts
新增 /alerts 管理路由(channels/rules/history 的 CRUD、测试发送)及 /grafana 管理路由(连接管理、测试、Prometheus 发现、同步),并在 admin 路由链中注入。
告警引擎与分发
backend/src/services/alertEngine.ts, backend/src/services/alertDispatcher.ts
新增周期性告警引擎(60s 循环、Redis 冷却、各类评估)、多通道分发器(Webhook HMAC、nodemailer 邮件、Feishu 卡片)、测试通知与历史记录写入。
Grafana 同步与客户端
backend/src/services/grafanaSync.ts, backend/src/utils/grafanaClient.ts
新增 GrafanaClient(folder、alert rule、contact point、datasource 操作)和同步服务(syncRulesToGrafanasyncChannelsToGrafanasyncAllToGrafana),以及 PromQL/负载构建与 DB 同步状态更新。
后端集成与依赖
backend/src/index.ts, backend/package.json, backend/src/adapters/upstream/anthropic.ts
应用启动时启动告警引擎并在退出时停止;新增 nodemailer 依赖与类型声明;微小格式调整。
前端页、路由与导航
frontend/src/pages/settings/alerts-settings-page.tsx, frontend/src/pages/settings/grafana-settings-page.tsx, frontend/src/routes/settings/*, frontend/src/routes/models/*, frontend/src/routeTree.gen.ts, frontend/src/components/app/app-sidebar.tsx
新增 Alerts 与 Grafana 设置页及路由,路由树扩展引入 Models 顶级路由;侧边栏与导航项调整(Models、Settings、Grafana、Alerts)。
前端 hooks / UI / i18n
frontend/src/hooks/use-settings.ts, frontend/src/hooks/use-copy.tsx, frontend/src/components/ui/alert.tsx, frontend/src/i18n/locales/en-US.json, frontend/src/i18n/locales/zh-CN.json
新增 Grafana 相关 hooks、Alert UI 组件、修正 use-copy 依赖;扩展中/英文翻译覆盖告警与 Grafana 页面文本与标签。
前端页面集成
frontend/src/pages/settings/*
实现 AlertsSettingsPage 与 GrafanaSettingsPage(表单验证、创建/删除/测试/同步、历史展示、Grafana 状态徽章及 react-query 集成)。

Sequence Diagram(s)

sequenceDiagram
    participant Engine as AlertEngine
    participant DB as Database
    participant Redis as Redis
    participant Dispatcher as AlertDispatcher
    participant GrafanaSvc as GrafanaSync
    participant HTTP as External HTTP / SMTP

    loop Every 60s
        Engine->>DB: fetch enabled alert rules
        DB-->>Engine: rules list
        loop per rule
            Engine->>Redis: check/set cooldown(ruleId)
            Redis-->>Engine: cooldown status
            Engine->>DB: query metrics / evaluate condition
            DB-->>Engine: metrics/current value
            alt triggered & not cooled down
                Engine->>DB: fetch associated channels
                DB-->>Engine: channel configs
                loop per channel
                    Engine->>Dispatcher: dispatchToChannel(type, config, payload)
                    alt Grafana-managed
                        Dispatcher->>GrafanaSvc: skip local dispatch (Grafana manages)
                        GrafanaSvc-->>Dispatcher: acknowledged
                    else Local dispatch
                        Dispatcher->>HTTP: send webhook/email/feishu
                        HTTP-->>Dispatcher: response
                    end
                    Dispatcher-->>Engine: result(success/failure)
                end
                Engine->>DB: insert alert_history(record)
            end
        end
    end
Loading
sequenceDiagram
    participant AdminAPI as Admin API
    participant Sync as GrafanaSync
    participant DB as Database
    participant Client as GrafanaClient
    participant Grafana as Grafana API

    AdminAPI->>Sync: syncAllToGrafana()
    Sync->>DB: fetch enabled rules & channels
    DB-->>Sync: items
    loop per item
        Sync->>Client: build payload
        Client->>Grafana: create/update resource
        Grafana-->>Client: response(uid/status)
        Client-->>Sync: result(uid)
        Sync->>DB: update grafana sync fields(uid, timestamp, error?)
    end
    Sync-->>AdminAPI: SyncResult
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • koitococo

"
我是一只小兔子在代码田,
挖出告警表与枚举叶,
Grafana 同步把规则牵,
Webhook、邮件与飞书连,
轮询冷却,历史入仓甜。 🐇✨
"

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat: add Grafana integration and alert system' clearly and concisely summarizes the main changes in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/grafana-alert-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pescn pescn requested review from Copilot and koitococo January 31, 2026 15:30
@pescn pescn linked an issue Jan 31, 2026 that may be closed by this pull request
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @pescn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the platform's operational monitoring capabilities by introducing a complete alert system and deep integration with Grafana. It empowers administrators to proactively manage system health and performance through configurable alerts and leverages Grafana's robust features for advanced visualization and alerting, streamlining the overall management experience.

Highlights

  • New Alert System: A comprehensive alert management system has been introduced, allowing users to define alert channels (webhook, email, Feishu) and create various alert rules based on metrics such as budget, error rate, latency, and quota. The system also tracks alert history.
  • Grafana Integration: The new system integrates with Grafana, enabling the synchronization of alert rules as Prometheus-based Grafana alerts via the Provisioning API. When Grafana is connected and verified, the built-in alert engine defers evaluation to Grafana, providing a centralized monitoring solution.
  • Dynamic Alert Evaluation Engine: A dedicated alert engine runs periodically to evaluate configured alert rules. It includes logic for managing alert cooldowns and intelligently switches between its own evaluation and delegating to Grafana based on connection status.
  • Frontend Navigation Restructure: The frontend navigation has been reorganized to improve user experience. Model configurations (Providers + Registry) are now located under a new '/models' section, while system settings (Alerts + Grafana) are consolidated under '/settings' as separate sidebar items.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive alert system and Grafana integration, which is a significant and well-executed feature addition. The backend implementation is robust, with new database schemas, APIs, and services for alerting and Grafana synchronization. The frontend is also updated with new pages and restructured navigation to accommodate these features. My review focuses on improving data integrity in the database schema, enhancing API validation for better type safety, optimizing performance in the alert evaluation engine, and improving TypeScript type usage on the frontend. Overall, this is a great contribution.

Comment thread backend/src/api/admin/alerts.ts
Comment thread backend/src/db/schema.ts Outdated
Comment thread backend/src/services/alertEngine.ts
Comment thread frontend/src/routes/settings/alerts.tsx
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In `@backend/drizzle/0013_flowery_maria_hill.sql`:
- Around line 14-35: The foreign key constraint
alert_history_rule_id_alert_rules_id_fk on table alert_history currently uses ON
DELETE no action which blocks deleting alert_rules rows; update the ALTER TABLE
statement that defines alert_history_rule_id_alert_rules_id_fk to either use ON
DELETE CASCADE to cascade deletes to alert_history, or alter
alert_history.rule_id to be nullable and change the foreign key to ON DELETE SET
NULL so histories are retained but dissociated from deleted alert_rules.

In `@backend/package.json`:
- Around line 28-34: When creating the nodemailer transport
(nodemailer.createTransport / transporter), explicitly set tls.servername to the
SMTP host to enable SNI under Bun (e.g., set tls.servername = the same host used
in host field) so Bun's tls.connect will present the correct SNI; if STARTTLS
(port 587) still fails, try using SMTPS (port 465) as an alternative.

In `@backend/src/api/admin/grafana.ts`:
- Around line 96-167: The Grafana connection test lacks request timeouts and
when it fails the old datasourceUid from config can be retained; update the
logic in the POST "/connection/test" handler (referencing getGrafanaConnection,
the fetch calls, and upsertSetting with GRAFANA_CONNECTION_KEY) to use
AbortController-based timeouts for the health and datasources fetches (abort
after a reasonable timeout) and ensure in the failure branch you explicitly
clear datasourceUid (e.g., set to undefined/null) when upserting verified:
false/verifiedAt: null so stale datasourceUid is not preserved.

In `@backend/src/services/alertDispatcher.ts`:
- Around line 160-172: 在 dispatchToChannel 中目前缺少 default 分支,导致传入未知 channelType
时静默返回,增加排查成本;请在 switch (channelType) 的末尾添加一个 default 分支(或在 switch 之后)抛出明确的错误,例如
throw new Error(`Unsupported channelType: ${channelType}`),以便在 dispatchToChannel
调用者能立即感知并定位问题;引用符号:dispatchToChannel, channelType, AlertChannelTypeEnumType。
- Around line 42-66: Replace the raw fetch calls with the project's
timeout-aware helper: import fetchWithTimeout from backend/src/services/failover
and in both dispatchWebhook and dispatchFeishu use fetchWithTimeout instead of
fetch, passing the same request options (method, headers, body) plus a timeout
(use a configured value if present on the channel config, otherwise a sensible
default like 10000 ms), keep the same response.ok/error handling and HMAC header
logic in dispatchWebhook.

In `@backend/src/services/alertEngine.ts`:
- Around line 104-152: In evaluateQuota, avoid calling listApiKeys() twice by
fetching the API keys once and reusing them: call listApiKeys() at the start of
evaluateQuota (store result in a local variable) and use that variable both when
condition.apiKeyId is present (to find the single apiKey) and in the "check all
active API keys" branch; update references to apiKeys and remove the second
listApiKeys() call so the function uses the cached result throughout.

In `@backend/src/services/grafanaSync.ts`:
- Around line 91-101: The PromQL in the "latency" case produces invalid
selectors like `{,model="xxx"}` because modelFilter is prefixed with a comma;
update the modelFilter construction in the latency branch (where `const c =
rule.condition as LatencyCondition` and the returned `expr` is built) to follow
the same trailing-comma style used in the `error_rate` implementation (e.g.,
`model="${c.model}",` when c.model exists) so the selector becomes
`{${modelFilter}}` and avoids a leading comma; ensure threshold/forDuration
logic remains unchanged.

In `@backend/src/utils/grafanaClient.ts`:
- Around line 61-88: The request<T> method currently uses fetch without a
cancel/timeout mechanism; add an AbortController-based timeout and allow callers
to provide their own signal: create a default timeout (e.g. DEFAULT_TIMEOUT_MS),
inside request<T> create an AbortController, if options.signal is provided wire
it so that caller signal aborts the controller, set a timer that calls
controller.abort() after the timeout, pass controller.signal to fetch, and clear
the timer after fetch completes; ensure the Authorization/headers merge remains
and surface fetch abort errors as normal.

In `@frontend/src/pages/settings/alerts-settings-page.tsx`:
- Around line 1005-1028: The UI for the FormField named "channelIds" currently
renders a single-select Select (in alerts-settings-page.tsx) while the
backend/schema expects a comma-separated string (z.string().min(1)) that is
later split in createMutation; either update the UI to allow multiple selections
(replace the single Select with a multi-select component or checkbox list that
stores a comma-separated string in form.control.field.value) or simplify the
schema/type to a single selection (change the schema from z.string() to a
single-valued type and remove the split logic in createMutation); locate and
modify the FormField render block for "channelIds" and correspondingly update
the schema definition and createMutation handling so the UI and data model
match.

In `@frontend/src/pages/settings/grafana-settings-page.tsx`:
- Around line 199-207: The Test button uses the wrong i18n key
(t('pages.settings.alerts.grafana.Syncing')) causing the raw key to render; in
the GrafanaSettingsPage component update the translation call used in the Button
with isTesting (the onClick that calls testMutation.mutate()) to the consistent
namespace t('pages.settings.grafana.Syncing') so it matches the other keys
(e.g., t('pages.settings.grafana.TestConnection')) and renders the correct
localized label.

In `@frontend/src/routes/settings/grafana.tsx`:
- Around line 12-20: The local grafanaConnectionQueryOptions duplicate should be
removed and the shared implementation from use-settings should be imported and
used instead; replace the local definition of grafanaConnectionQueryOptions in
grafana.tsx with an import of the same-named export from the use-settings
module, ensure the import name matches (grafanaConnectionQueryOptions) and that
any surrounding references (e.g., query usage or types like
GrafanaConnectionResponse) remain unchanged so behavior and typing are
preserved.
🧹 Nitpick comments (13)
frontend/src/hooks/use-settings.ts (1)

70-79: 添加 retry: false 以与同文件其他 Grafana 查询保持一致。

grafanaSyncStatusQueryOptions 当前缺少显式重试配置,会使用 React Query 默认的 3 次重试策略。建议禁用重试,避免 Grafana 后端异常时产生不必要的额外请求,与 grafanaConnectionQueryOptionsdashboardsQueryOptions 的配置保持一致。

♻️ 建议调整
 export const grafanaSyncStatusQueryOptions = () =>
   queryOptions({
     queryKey: ['grafanaSyncStatus'],
     queryFn: async () => {
       const { data, error } = await api.admin.grafana.sync.status.get()
       if (error) return null
       return data
     },
     staleTime: 30 * 1000,
+    retry: false,
   })
frontend/src/routes/models/route.tsx (1)

74-80: 屏幕阅读器文本应使用 i18n 国际化

sr-only 元素和 SheetHeader 中的字符串是硬编码的英文,应该使用 i18n 翻译以保持一致性。

♻️ 建议的修复
                   <Button variant="ghost" size="icon" className="h-7 w-7">
                     <MenuIcon className="size-4" />
-                    <span className="sr-only">Toggle Models Menu</span>
+                    <span className="sr-only">{t('routes.models.ToggleMenu')}</span>
                   </Button>
                 </SheetTrigger>
                 <SheetContent side="left" className="w-56 p-0">
                   <SheetHeader className="sr-only">
-                    <SheetTitle>Models Navigation</SheetTitle>
-                    <SheetDescription>Navigate between model pages</SheetDescription>
+                    <SheetTitle>{t('routes.models.Navigation')}</SheetTitle>
+                    <SheetDescription>{t('routes.models.NavigationDescription')}</SheetDescription>
                   </SheetHeader>
backend/drizzle/meta/0013_snapshot.json (1)

77-152: 考虑为 alert_history 表添加索引

alert_history 表目前没有定义索引。对于以下常见查询场景,建议在 schema 定义中添加索引:

  • rule_id 上的索引:用于按规则查询历史记录
  • triggered_at 上的索引:用于按时间范围查询历史记录

由于这是自动生成的快照文件,实际更改应在 backend/src/db/schema.ts 中进行。

frontend/src/routes/settings/alerts.tsx (1)

64-74: 避免使用 any[] 类型转换

使用 eslint-disable 禁用 @typescript-eslint/no-explicit-any 并强制转换为 any[] 会损失类型安全性。建议为 API 响应定义正确的类型,或从 API 客户端推断类型。

♻️ 建议的改进方向
// 定义或导入正确的类型
import type { AlertChannel, AlertRule, AlertHistory } from '@/types/alerts'

// 在 RouteComponent 中使用正确的类型
return (
  <AlertsSettingsPage
    channels={channels as AlertChannel[]}
    rules={rules as AlertRule[]}
    history={history as AlertHistory[]}
    grafanaConnected={grafanaConnected}
    grafanaApiUrl={grafanaApiUrl}
  />
)
backend/src/services/alertEngine.ts (2)

155-187: 循环中的顺序 await 可能影响性能

当检查所有 API keys 的配额时,每个 key 都是顺序等待 getRateLimitStatus。如果有大量 API keys,这可能导致评估延迟。

考虑使用 Promise.allPromise.allSettled 进行并行处理。

♻️ 建议的并行处理方案
// 并行获取所有 key 的状态
const statuses = await Promise.all(
  apiKeys.map(async (key) => ({
    key,
    status: await getRateLimitStatus(key.id, {
      rpmLimit: key.rpmLimit,
      tpmLimit: key.tpmLimit,
    }),
  }))
);

let maxUsagePercent = 0;
for (const { status } of statuses) {
  // 计算 usagePercent...
}

363-365: void evaluateAlerts() 可能导致未处理的 Promise 拒绝

使用 void 忽略 Promise 返回值时,如果 evaluateAlerts() 内部抛出未捕获的异常(尽管当前有 try-catch),可能不会被正确处理。

建议添加 .catch() 处理或确保所有异常都在函数内部被捕获。

♻️ 建议的改进
   intervalId = setInterval(() => {
-    void evaluateAlerts();
+    evaluateAlerts().catch((error) => {
+      logger.error("Uncaught error in alert evaluation", {
+        error: error instanceof Error ? error.message : String(error),
+      });
+    });
   }, ALERT_CHECK_INTERVAL_MS);
backend/src/api/admin/alerts.ts (2)

25-159: 建议对 config/condition 做类型化校验,避免写入无效结构。
目前使用 Unknown 直接入库,后续调度/同步阶段更容易失败且难排查;建议依据 channel/rule 的 type 做结构校验或在写入前做显式验证。

Also applies to: 163-287


291-309: 历史查询建议限制 offset/limit 范围。
负数或过大 limit 会带来性能风险,可在处理时做下限/上限约束。

🧩 示例约束方式
-      const offset = query.offset ?? 0;
-      const limit = query.limit ?? 50;
+      const offset = Math.max(0, query.offset ?? 0);
+      const limit = Math.min(200, Math.max(1, query.limit ?? 50));
backend/drizzle/0013_flowery_maria_hill.sql (1)

14-33: 建议为历史查询增加索引以支撑分页/过滤。
按 rule_id 与触发时间过滤/排序的历史查询会受益于索引。

📈 示例索引
CREATE INDEX alert_history_rule_id_idx ON alert_history (rule_id);
CREATE INDEX alert_history_triggered_at_idx ON alert_history (triggered_at DESC);
backend/src/services/grafanaSync.ts (1)

242-291: 建议:增加对已删除 Grafana 规则的清理逻辑

当前 syncRulesToGrafana 只同步启用的规则(enabledRules),但如果某个规则被禁用或删除后,其对应的 Grafana 规则不会被移除,可能导致 Grafana 中存在孤立的告警规则。

需要我帮助实现一个清理逻辑,删除 Grafana 中不再对应本地规则的孤立告警吗?

frontend/src/pages/settings/alerts-settings-page.tsx (2)

399-402: 类型断言可以改进

使用 as any 进行类型断言会丢失类型安全性。考虑为 syncStatus 定义明确的接口类型。

💡 建议定义明确的 SyncStatus 响应类型
interface GrafanaSyncStatusResponse {
  channels: SyncStatusItem[]
  rules: SyncStatusItem[]
}

const getChannelSyncStatus = (id: number): SyncStatusItem | undefined => {
  return (syncStatus as GrafanaSyncStatusResponse | undefined)?.channels?.find(
    (c) => c.id === id
  )
}

97-113: 表单验证 schema 未对必填字段进行条件校验

当前 channelSchema 将所有类型特定字段(webhook、email、feishu)都标记为 optional(),这意味着用户可以提交一个没有任何配置的 webhook 渠道。

建议使用 zod 的 discriminatedUnionsuperRefine 添加条件验证,确保根据选择的 type 验证相应的必填字段。

♻️ 使用 superRefine 添加条件验证示例
const channelSchema = z.object({
  name: z.string().min(1).max(100),
  type: z.enum(CHANNEL_TYPES),
  webhookUrl: z.string().optional(),
  webhookSecret: z.string().optional(),
  // ... other fields
}).superRefine((data, ctx) => {
  if (data.type === 'webhook' && !data.webhookUrl) {
    ctx.addIssue({
      code: z.ZodIssueCode.custom,
      message: 'Webhook URL is required',
      path: ['webhookUrl'],
    })
  }
  if (data.type === 'email' && !data.emailTo) {
    ctx.addIssue({
      code: z.ZodIssueCode.custom,
      message: 'Email recipients are required',
      path: ['emailTo'],
    })
  }
  // ... feishu validation
})
backend/src/db/index.ts (1)

1901-1927: Grafana 同步辅助函数缺少返回值确认

updateAlertRuleGrafanaSyncupdateAlertChannelGrafanaSync 返回 Promise<void>,没有返回更新后的记录或受影响的行数。调用方无法确认更新是否成功。

虽然在当前使用场景(grafanaSync.ts)中这不会造成问题,但考虑返回更新结果会更健壮。

💡 建议返回更新后的记录
 export async function updateAlertRuleGrafanaSync(
   id: number,
   fields: {
     grafanaUid?: string | null;
     grafanaSyncedAt?: Date | null;
     grafanaSyncError?: string | null;
   },
-): Promise<void> {
-  await db
+): Promise<AlertRule | null> {
+  const r = await db
     .update(schema.AlertRulesTable)
     .set(fields)
-    .where(eq(schema.AlertRulesTable.id, id));
+    .where(eq(schema.AlertRulesTable.id, id))
+    .returning();
+  const [first] = r;
+  return first ?? null;
 }

Comment thread backend/drizzle/0013_flowery_maria_hill.sql
Comment thread backend/package.json
Comment thread backend/src/api/admin/grafana.ts
Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/grafanaSync.ts
Comment thread backend/src/utils/grafanaClient.ts
Comment thread frontend/src/pages/settings/alerts-settings-page.tsx
Comment thread frontend/src/pages/settings/grafana-settings-page.tsx
Comment thread frontend/src/routes/settings/grafana.tsx Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end alerting system (channels/rules/history + engine) and integrates with Grafana (connection + syncing alerts/contact points), while restructuring the frontend navigation to separate Models from Settings.

Changes:

  • Introduces alert channel/rule/history APIs, DB schema/migrations, and a periodic alert evaluation/dispatch engine.
  • Adds Grafana connection management plus sync services/APIs to provision Grafana alert rules/contact points.
  • Restructures frontend routes/navigation and adds new Settings pages for Alerts and Grafana.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
frontend/src/routes/settings/route.tsx Updates Settings sub-nav to Alerts/Grafana.
frontend/src/routes/settings/index.tsx Redirects /settings/ to /settings/alerts.
frontend/src/routes/settings/grafana.tsx Adds Grafana settings route with data preloading.
frontend/src/routes/settings/alerts.tsx Adds Alerts settings route with data preloading.
frontend/src/routes/models/route.tsx Adds /models layout route with Providers/Registry sub-nav.
frontend/src/routes/models/index.tsx Redirects /models/ to /models/providers.
frontend/src/routes/models/providers.tsx Moves providers route from /settings/providers to /models/providers.
frontend/src/routes/models/registry.tsx Moves registry route from /settings/models to /models/registry.
frontend/src/routeTree.gen.ts Regenerates route tree for new/relocated routes.
frontend/src/pages/settings/grafana-settings-page.tsx Adds Grafana connection + dashboard embed management UI.
frontend/src/pages/settings/alerts-settings-page.tsx Adds Alerts UI (channels/rules/history + Grafana sync status).
frontend/src/i18n/locales/en-US.json Adds/updates strings for new navigation and pages.
frontend/src/i18n/locales/zh-CN.json Adds/updates strings for new navigation and pages.
frontend/src/hooks/use-settings.ts Adds Grafana connection + sync status query helpers/types.
frontend/src/hooks/use-copy.tsx Fixes hook dependency list to include t.
frontend/src/components/ui/alert.tsx Adds new Alert UI primitive.
frontend/src/components/app/app-sidebar.tsx Splits sidebar into separate Models and Settings entries.
bun.lock Adds nodemailer + types to lockfile.
backend/src/utils/grafanaClient.ts Adds Grafana Provisioning API client wrapper.
backend/src/services/grafanaSync.ts Implements Grafana sync logic + PromQL mapping for rule types.
backend/src/services/alertEngine.ts Adds periodic alert evaluation engine + cooldown + history.
backend/src/services/alertDispatcher.ts Adds dispatchers for webhook/email/Feishu (nodemailer).
backend/src/index.ts Starts the alert engine at server startup.
backend/src/db/schema.ts Adds alert-related enums/types/tables + Grafana sync columns.
backend/src/db/index.ts Adds alert CRUD, history queries, and aggregation helpers.
backend/src/api/admin/index.ts Wires new admin routes for alerts and Grafana.
backend/src/api/admin/grafana.ts Adds Grafana connection/test/sync/status endpoints.
backend/src/api/admin/alerts.ts Adds alert channels/rules/history endpoints.
backend/src/adapters/upstream/anthropic.ts Modifies Anthropic request build logic (tool_choice).
backend/package.json Adds nodemailer + types dependencies.
backend/drizzle/meta/_journal.json Records new drizzle migrations.
backend/drizzle/meta/0013_snapshot.json Adds snapshot for initial alert tables/enums migration.
backend/drizzle/meta/0014_snapshot.json Adds snapshot for Grafana sync columns migration.
backend/drizzle/0013_flowery_maria_hill.sql Creates alert enums/tables + FK.
backend/drizzle/0014_opposite_dragon_man.sql Adds Grafana sync columns to alert tables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/src/db/index.ts
Comment thread backend/src/api/admin/alerts.ts
Comment thread frontend/src/components/ui/alert.tsx
Comment thread frontend/src/routes/settings/alerts.tsx
Comment thread backend/src/services/grafanaSync.ts Outdated
Comment thread backend/src/api/admin/grafana.ts
Comment thread frontend/src/pages/settings/alerts-settings-page.tsx
Comment thread frontend/src/pages/settings/grafana-settings-page.tsx Outdated
Comment thread backend/src/db/index.ts
Comment thread backend/src/db/index.ts
- Fix PromQL label selector syntax (no leading/trailing commas)
- Add fetch timeouts (AbortSignal.timeout) to all external HTTP calls
- Add ON DELETE CASCADE to alert_history FK referencing alert_rules
- Deduplicate grafanaConnectionQueryOptions (reuse from use-settings hook)
- Add default case to dispatchToChannel switch statement
- Fix duplicate listApiKeys() call in evaluateQuota
- Add missing i18n key pages.settings.grafana.Testing
- Fix wrong i18n key reference in grafana-settings-page.tsx
- Clear datasourceUid on connection test failure

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/src/services/alertDispatcher.ts`:
- Around line 86-95: The HTML email template in alertDispatcher.ts directly
injects unescaped payload fields (subject, html using payload.ruleName,
payload.ruleType, payload.message, payload.currentValue, payload.threshold, and
payload.details) creating an XSS risk; fix by creating and using an
HTML-escaping helper (e.g., escapeHtml) and apply it to every dynamic insertion
before building the html string (also escape the result of
JSON.stringify(payload.details) if present), or import a vetted escaper from a
utils module, ensuring all payload.* values are escaped in the template and the
subject is sanitized as well.

In `@backend/src/services/grafanaSync.ts`:
- Around line 110-125: The quota PromQL built in the quota branch of buildPromQL
(in grafanaSync.ts) doesn't filter by apiKeyId; modify the case "quota" code so
when rule.condition.apiKeyId is present you append a label matcher like
{apiKeyId="<value>"} to each metric identifier used
(nexusgate_api_key_rpm_usage, nexusgate_api_key_rpm_limit,
nexusgate_api_key_tpm_usage, nexusgate_api_key_tpm_limit), otherwise leave the
raw metric names unchanged; ensure the apiKeyId value is properly quoted/escaped
when inserted into the label matcher and keep the rest of the expression and
returned object (expr, threshold, forDuration) unchanged.
🧹 Nitpick comments (7)
backend/src/db/schema.ts (1)

35-38: AlertChannelConfig 联合类型缺少类型区分字段。

当前联合类型 AlertChannelConfig 没有类型区分器(discriminator),在运行时从 JSON 反序列化时可能难以确定具体类型。建议在每个配置类型中添加 type 字段作为区分器,或依赖外部的 AlertChannelTypeEnum 进行类型判断。

当前实现依赖 AlertChannelsTable.type 字段进行区分,这是可行的设计选择。

backend/src/services/alertDispatcher.ts (1)

77-84: 每次调用都创建新的 SMTP transport 可能影响性能。

dispatchEmail 每次调用时都创建新的 nodemailer.createTransport。对于高频告警场景,建议复用 transport 实例或使用连接池。不过对于当前告警系统的使用频率,这可能不是紧迫问题。

♻️ 可选优化:transport 缓存
// 可以考虑使用 WeakMap 或简单缓存来复用 transport
const transportCache = new Map<string, nodemailer.Transporter>();

function getOrCreateTransport(config: EmailChannelConfig) {
  const key = `${config.host}:${config.port}:${config.user}`;
  let transport = transportCache.get(key);
  if (!transport) {
    transport = createTransport({ ... });
    transportCache.set(key, transport);
  }
  return transport;
}
backend/src/utils/grafanaClient.ts (1)

118-122: 列表方法未处理分页,大型 Grafana 实例可能返回不完整数据。

listAlertRules()listContactPoints() 直接返回 API 响应,未处理 Grafana API 的分页。对于告警规则和联系点数量较多的 Grafana 实例,可能无法获取完整列表。

建议在后续迭代中考虑添加分页支持,或在文档中说明此限制。

Also applies to: 160-164

backend/src/services/grafanaSync.ts (1)

198-237: buildContactPoint 缺少 default 分支。

虽然 TypeScript 可以通过类型检查确保 channel.type 覆盖所有枚举值,但添加 default 分支可以在运行时捕获意外情况,与 dispatchToChannel 保持一致。

♻️ 建议添加 default 分支
     case "feishu": {
       const c = channel.config as FeishuChannelConfig;
       return {
         name: `[NexusGate] ${channel.name}`,
         type: "webhook",
         settings: {
           url: c.webhookUrl,
           httpMethod: "POST",
         },
       };
     }
+    default: {
+      const _exhaustive: never = channel.type;
+      throw new Error(`Unsupported channel type: ${channel.type}`);
+    }
   }
backend/src/api/admin/grafana.ts (1)

20-26: GrafanaConnection 接口在多处重复定义。

此接口在 backend/src/api/admin/grafana.tsbackend/src/services/grafanaSync.ts 中都有定义。建议提取到共享位置(如 @/db/schema 或新建类型文件)以保持一致性。

frontend/src/routes/settings/grafana.tsx (1)

12-20: 查询选项定义模式不一致。

dashboardsQueryOptions 定义为函数 () => queryOptions({...}),而导入的 grafanaConnectionQueryOptions 是直接的 queryOptions({...}) 对象。这导致使用方式不一致:

queryClient.ensureQueryData(grafanaConnectionQueryOptions)      // 不带括号
queryClient.ensureQueryData(dashboardsQueryOptions())           // 带括号

建议统一为相同模式以提高代码可读性。

♻️ 建议统一为直接对象模式
-const dashboardsQueryOptions = () =>
-  queryOptions({
+const dashboardsQueryOptions = queryOptions({
     queryKey: ['dashboards'],
     queryFn: async () => {
       const { data, error } = await api.admin.dashboards.get()
       if (error) throw formatError(error, i18n.t('pages.settings.grafana.FetchError'))
       return data as DashboardsResponse
     },
   })

// 然后在使用处:
-queryClient.ensureQueryData(dashboardsQueryOptions())
+queryClient.ensureQueryData(dashboardsQueryOptions)

Also applies to: 24-26, 34-35

frontend/src/i18n/locales/en-US.json (1)

448-449: 键名与值的单复数不一致。

键名 NotifyChannels 是复数形式,但值 "Notify Channel" 是单数。如果 UI 支持选择多个通知渠道,建议改为 "Notify Channels" 以保持一致性。

建议修改
-  "pages.settings.alerts.NotifyChannels": "Notify Channel",
+  "pages.settings.alerts.NotifyChannels": "Notify Channels",

Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/grafanaSync.ts
- Add escapeHtml() helper and apply to all dynamic fields in email template
- Add apiKeyId label selector to quota PromQL when a specific key is configured

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/src/services/alertDispatcher.ts`:
- Around line 150-162: The Feishu webhook signature is computed incorrectly:
change the call in alertDispatcher where signature is built (currently const
signature = await computeHmacSha256(config.secret, stringToSign)) to pass the
HMAC key as stringToSign (timestamp + "\n" + secret) and the message as an empty
string, and update computeHmacSha256 to return a Base64-encoded HMAC (not hex)
so the returned value is the Base64 signature Feishu expects; ensure the symbol
names referenced are computeHmacSha256, stringToSign, timestamp, config.secret
and the assignment that adds { timestamp, sign: signature } to body remain
consistent.

In `@backend/src/services/grafanaSync.ts`:
- Around line 372-378: The isGrafanaConnected function currently returns true
when config.verified is true even if datasourceUid is missing, which can cause
the alert engine to skip evaluation while Grafana is unusable; change
isGrafanaConnected to mirror getGrafanaClient’s requirements by checking both
that config.verified is true and that config.datasourceUid (or the exact
property used in getGrafanaClient) is present/non-empty before returning true so
the connectivity check only succeeds when datasourceUid exists and Grafana is
truly reachable.

Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/grafanaSync.ts
Per Feishu docs, the signature must use `timestamp\nsecret` as the HMAC key
and sign an empty string, then Base64-encode the result. The previous
implementation incorrectly used `secret` as key, `timestamp\nsecret` as
data, and hex-encoded the output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/src/services/alertDispatcher.ts`:
- Around line 102-135: The email subject assembled in dispatchEmail (variable
subject using payload.ruleName and payload.ruleType) is vulnerable to header
injection because it may contain CR/LF; sanitize or validate those fields before
building subject by stripping or replacing CR and LF (e.g., remove \r and \n
from payload.ruleName and payload.ruleType or run the full subject through a
validation/sanitizer) and then use the sanitized values when assigning subject
and calling transport.sendMail; ensure the sanitization logic is applied to any
other user-provided pieces used in the subject.
- Around line 50-69: In computeFeishuSignature replace the browser-only btoa
usage with Node-compatible Base64 encoding: take the ArrayBuffer result
(signature), convert to a Uint8Array and call
Buffer.from(uint8Array).toString('base64') so the function works on Node.js
^12.22.0 and ^14.17.0; keep the existing TextEncoder/crypto.subtle importKey and
sign steps and only change the final encoding step to use
Buffer.from(...).toString('base64').

Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/services/alertDispatcher.ts Outdated
Comment thread backend/src/services/alertDispatcher.ts
Comment thread backend/src/index.ts
- Replace manual Array.from hex encoding with Buffer.from().toString("hex")
- Add SIGINT/SIGTERM handlers to stop alert engine and server on shutdown

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@koitococo koitococo merged commit fc9db19 into main Feb 1, 2026
2 checks passed
@pescn pescn deleted the feat/grafana-alert-integration branch February 1, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Alert System

3 participants