Skip to content

docs(cerebro): add production alerting guidance#745

Merged
yacosta738 merged 3 commits into
mainfrom
docs/cerebro-alerting-guidance-701
May 1, 2026
Merged

docs(cerebro): add production alerting guidance#745
yacosta738 merged 3 commits into
mainfrom
docs/cerebro-alerting-guidance-701

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Related Issues

Fixes #701


Summary

  • Adds Cerebro production alerting guidance for readiness degradation, auth anomalies, server-side error rates, storage error spikes, and latency spikes.
  • Ties recommended alerts to existing /metrics, /readyz, and structured log signals.
  • Records the gateway/client-surface requirement in OpenSpec with internal production threshold examples.

Tested Information

  • Ran git diff --check successfully.
  • Commit hooks ran the staged docs/config lychee offline check successfully.

Documentation Impact

  • Docs updated in:
    • clients/web/apps/docs/src/content/docs/cerebro/operations.md
    • clients/cerebro/README.md
    • openspec/specs/client-surfaces/gateway-api.md
  • I verified the documentation matches the current behavior.

Breaking Changes

None.


Checklist

  • I have checked that there isn’t already a PR solving the same problem.
  • I have read the Contributing Guidelines.
  • I ensured my code follows the project's style guidelines.
  • I have added or updated tests that prove my fix is effective or that my feature works.
  • I have updated the documentation, or I explained above why no documentation update is needed.
  • I verified the documentation matches the current behavior.
  • I have documented any breaking changes in the Breaking Changes section.
  • I have linked the related issue (if any).

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 1, 2026

Deploying corvus with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5b55915
Status: ✅  Deploy successful!
Preview URL: https://039b258b.corvus-42x.pages.dev
Branch Preview URL: https://docs-cerebro-alerting-guidan.corvus-42x.pages.dev

View logs

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Warning

Rate limit exceeded

@yacosta738 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 14 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 187f6e62-c94e-48d9-9997-4ef593a1d2be

📥 Commits

Reviewing files that changed from the base of the PR and between bcdaf79 and 5b55915.

📒 Files selected for processing (3)
  • clients/cerebro/README.md
  • clients/web/apps/docs/src/content/docs/cerebro/operations.md
  • openspec/specs/client-surfaces/gateway-api.md
📝 Walkthrough

Walkthrough

The PR expands production observability guidance across documentation and specifications by introducing Prometheus metrics documentation, defining concrete alerting thresholds for readiness probe failures, authentication anomalies, error rate elevation, storage errors, and latency spikes, and establishing alerting requirements for exposed Cerebro deployments.

Changes

Cohort / File(s) Summary
Cerebro Documentation
clients/cerebro/README.md, clients/web/apps/docs/src/content/docs/cerebro/operations.md
Adds Prometheus metrics definitions (cerebro_requests_total, cerebro_tool_latency_seconds, cerebro_auth_failures_total, cerebro_readiness_failures_total, cerebro_storage_errors_total) with label sets and example PromQL-based alerting rules for readiness degradation, auth anomalies, MCP error rates, storage operation errors, and tool latency p95.
Specification Update
openspec/specs/client-surfaces/gateway-api.md
Mandates alerting guidance requirements for exposed Cerebro deployments, including readiness monitoring, auth failure detection, server-side error rate calculation, storage error spikes, and tool latency thresholds with example alert conditions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

area:docs, area:web

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title follows Conventional Commits style with 'docs' prefix and clearly describes adding production alerting guidance for Cerebro, matching the main changeset purpose.
Description check ✅ Passed The PR description is comprehensive, including related issues, summary, tested information, documentation impact, breaking changes, and a completed checklist matching the template structure.
Linked Issues check ✅ Passed The PR fully addresses issue #701 requirements: defines alerts for readiness failures, auth anomalies, error rates, storage spikes, and latency spikes; ties guidance to /metrics, /readyz, and logs; provides threshold examples and records requirements in OpenSpec.
Out of Scope Changes check ✅ Passed All changes are scoped to documentation updates directly addressing issue #701 requirements; no unrelated code or configuration changes are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/cerebro-alerting-guidance-701

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 46 minutes and 14 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/s Denotes a small change size label May 1, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
clients/cerebro/README.md (1)

118-127: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix p95 latency PromQL in README.md to include histogram aggregation (sum by (le, tool)).

Line [124] latency spike expression is missing histogram aggregation:

  • Current: histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool
  • But your operations.md example correctly aggregates buckets: sum by (le, tool) (rate(..._bucket...)).

Update README.md to match the working example to keep it copy-paste runnable.

✅ Proposed doc-only fix
-| Latency spike | `histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))` by `tool` | warn above p95 `1s`; page above p95 `2s` for 10 minutes |
+| Latency spike | `histogram_quantile(0.95, sum by (le, tool) (rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])))` | warn above p95 `1s`; page above p95 `2s` for 10 minutes |

As per coding guidelines, **/*: Security first, performance second. Validate input boundaries...** — here that means validating alert-rule “inputs” (PromQL) so production operators don’t deploy broken queries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/cerebro/README.md` around lines 118 - 127, The p95 PromQL in the
README's "Latency spike" row uses rate(...) on buckets without the required
histogram aggregation; update the expression used in the latency spike (the line
containing histogram_quantile(0.95,
rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match
the working example by wrapping rate(...) with sum by (le, tool) — i.e., use
histogram_quantile(0.95, sum by (le, tool)
(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket
aggregation is correct and the query is copy-paste runnable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/web/apps/docs/src/content/docs/cerebro/operations.md`:
- Around line 199-210: The table has two broken PromQL snippets: for "Elevated
MCP error rate" replace the status matcher
status=~"storage_error/internal_error" with the proper regex alternation
status=~"storage_error|internal_error" (matching the example block), and for
"Latency spike" use the required histogram aggregation like in the example block
by applying sum by (le, tool) to the bucket series before rate, e.g.
histogram_quantile(0.95, sum by (le,
tool)(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))); update the
table entries for "Elevated MCP error rate" and "Latency spike" accordingly to
match the working expressions shown later.
- Around line 211-230: The server-side MCP error ratio PromQL example uses a
division across two sum(rate(...)) expressions which is correct but ambiguous;
update the example around the expression using cerebro_requests_total (the
numerator
sum(rate(cerebro_requests_total{status=~"storage_error|internal_error"}[5m]))
and the denominator sum(rate(cerebro_requests_total[5m]))) by wrapping the
entire numerator/denominator division in parentheses before applying > 0.02 so
it reads ( numerator / denominator ) > 0.02 for clarity.

In `@openspec/specs/client-surfaces/gateway-api.md`:
- Around line 39-50: The spec’s alert descriptions must match the corrected
PromQL in the docs: update any operator/implementation guidance so the
server-side error selector uses the alternation form
`storage_error|internal_error` (referencing the cerebro_requests_total outcome
selector) and ensure the p95 latency guidance explicitly calls out the histogram
aggregation pattern used in operations.md (e.g., sum by (le, tool) over
cerebro_tool_latency_seconds buckets to compute the 95th percentile for
successful tool calls); also confirm the other metric names in the text
(cerebro_readiness_failures_total, cerebro_auth_failures_total,
cerebro_storage_errors_total) and their threshold expressions map exactly to the
PromQL in operations.md/README.md so the spec and docs stay consistent.

---

Outside diff comments:
In `@clients/cerebro/README.md`:
- Around line 118-127: The p95 PromQL in the README's "Latency spike" row uses
rate(...) on buckets without the required histogram aggregation; update the
expression used in the latency spike (the line containing
histogram_quantile(0.95,
rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match
the working example by wrapping rate(...) with sum by (le, tool) — i.e., use
histogram_quantile(0.95, sum by (le, tool)
(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket
aggregation is correct and the query is copy-paste runnable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6cfbe037-f3e9-4812-9144-f935902c11a9

📥 Commits

Reviewing files that changed from the base of the PR and between d3d8402 and bcdaf79.

📒 Files selected for processing (3)
  • clients/cerebro/README.md
  • clients/web/apps/docs/src/content/docs/cerebro/operations.md
  • openspec/specs/client-surfaces/gateway-api.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: pr-checks
  • GitHub Check: sonar
  • GitHub Check: semgrep-cloud-platform/scan
  • GitHub Check: Cloudflare Pages
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: submit-gradle
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

  • clients/web/apps/docs/src/content/docs/cerebro/operations.md
  • openspec/specs/client-surfaces/gateway-api.md
  • clients/cerebro/README.md
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/web/apps/docs/src/content/docs/cerebro/operations.md
  • openspec/specs/client-surfaces/gateway-api.md
  • clients/cerebro/README.md

Comment thread clients/web/apps/docs/src/content/docs/cerebro/operations.md
Comment thread clients/web/apps/docs/src/content/docs/cerebro/operations.md
Comment thread openspec/specs/client-surfaces/gateway-api.md
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

@yacosta738 yacosta738 merged commit 116a4b4 into main May 1, 2026
18 checks passed
@yacosta738 yacosta738 deleted the docs/cerebro-alerting-guidance-701 branch May 1, 2026 22:19
@coderabbitai coderabbitai Bot mentioned this pull request May 2, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs area:web size/s Denotes a small change size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cerebro: add production alerting guidance for readiness, auth, and error spikes

1 participant