docs(cerebro): add production alerting guidance#745
Conversation
Deploying corvus with
|
| Latest commit: |
5b55915
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://039b258b.corvus-42x.pages.dev |
| Branch Preview URL: | https://docs-cerebro-alerting-guidan.corvus-42x.pages.dev |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe PR expands production observability guidance across documentation and specifications by introducing Prometheus metrics documentation, defining concrete alerting thresholds for readiness probe failures, authentication anomalies, error rate elevation, storage errors, and latency spikes, and establishing alerting requirements for exposed Cerebro deployments. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 46 minutes and 14 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
clients/cerebro/README.md (1)
118-127:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix p95 latency PromQL in README.md to include histogram aggregation (
sum by (le, tool)).Line [124] latency spike expression is missing histogram aggregation:
- Current:
histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))by tool- But your operations.md example correctly aggregates buckets:
sum by (le, tool) (rate(..._bucket...)).Update README.md to match the working example to keep it copy-paste runnable.
✅ Proposed doc-only fix
-| Latency spike | `histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))` by `tool` | warn above p95 `1s`; page above p95 `2s` for 10 minutes | +| Latency spike | `histogram_quantile(0.95, sum by (le, tool) (rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])))` | warn above p95 `1s`; page above p95 `2s` for 10 minutes |As per coding guidelines,
**/*: Security first, performance second. Validate input boundaries...**— here that means validating alert-rule “inputs” (PromQL) so production operators don’t deploy broken queries.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clients/cerebro/README.md` around lines 118 - 127, The p95 PromQL in the README's "Latency spike" row uses rate(...) on buckets without the required histogram aggregation; update the expression used in the latency spike (the line containing histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match the working example by wrapping rate(...) with sum by (le, tool) — i.e., use histogram_quantile(0.95, sum by (le, tool) (rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket aggregation is correct and the query is copy-paste runnable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/web/apps/docs/src/content/docs/cerebro/operations.md`:
- Around line 199-210: The table has two broken PromQL snippets: for "Elevated
MCP error rate" replace the status matcher
status=~"storage_error/internal_error" with the proper regex alternation
status=~"storage_error|internal_error" (matching the example block), and for
"Latency spike" use the required histogram aggregation like in the example block
by applying sum by (le, tool) to the bucket series before rate, e.g.
histogram_quantile(0.95, sum by (le,
tool)(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))); update the
table entries for "Elevated MCP error rate" and "Latency spike" accordingly to
match the working expressions shown later.
- Around line 211-230: The server-side MCP error ratio PromQL example uses a
division across two sum(rate(...)) expressions which is correct but ambiguous;
update the example around the expression using cerebro_requests_total (the
numerator
sum(rate(cerebro_requests_total{status=~"storage_error|internal_error"}[5m]))
and the denominator sum(rate(cerebro_requests_total[5m]))) by wrapping the
entire numerator/denominator division in parentheses before applying > 0.02 so
it reads ( numerator / denominator ) > 0.02 for clarity.
In `@openspec/specs/client-surfaces/gateway-api.md`:
- Around line 39-50: The spec’s alert descriptions must match the corrected
PromQL in the docs: update any operator/implementation guidance so the
server-side error selector uses the alternation form
`storage_error|internal_error` (referencing the cerebro_requests_total outcome
selector) and ensure the p95 latency guidance explicitly calls out the histogram
aggregation pattern used in operations.md (e.g., sum by (le, tool) over
cerebro_tool_latency_seconds buckets to compute the 95th percentile for
successful tool calls); also confirm the other metric names in the text
(cerebro_readiness_failures_total, cerebro_auth_failures_total,
cerebro_storage_errors_total) and their threshold expressions map exactly to the
PromQL in operations.md/README.md so the spec and docs stay consistent.
---
Outside diff comments:
In `@clients/cerebro/README.md`:
- Around line 118-127: The p95 PromQL in the README's "Latency spike" row uses
rate(...) on buckets without the required histogram aggregation; update the
expression used in the latency spike (the line containing
histogram_quantile(0.95,
rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match
the working example by wrapping rate(...) with sum by (le, tool) — i.e., use
histogram_quantile(0.95, sum by (le, tool)
(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket
aggregation is correct and the query is copy-paste runnable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6cfbe037-f3e9-4812-9144-f935902c11a9
📒 Files selected for processing (3)
clients/cerebro/README.mdclients/web/apps/docs/src/content/docs/cerebro/operations.mdopenspec/specs/client-surfaces/gateway-api.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: pr-checks
- GitHub Check: sonar
- GitHub Check: semgrep-cloud-platform/scan
- GitHub Check: Cloudflare Pages
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: submit-gradle
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{md,mdx}
⚙️ CodeRabbit configuration file
**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.
Files:
clients/web/apps/docs/src/content/docs/cerebro/operations.mdopenspec/specs/client-surfaces/gateway-api.mdclients/cerebro/README.md
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/web/apps/docs/src/content/docs/cerebro/operations.mdopenspec/specs/client-surfaces/gateway-api.mdclients/cerebro/README.md
|



Related Issues
Fixes #701
Summary
/metrics,/readyz, and structured log signals.Tested Information
git diff --checksuccessfully.Documentation Impact
clients/web/apps/docs/src/content/docs/cerebro/operations.mdclients/cerebro/README.mdopenspec/specs/client-surfaces/gateway-api.mdBreaking Changes
None.
Checklist