docs(cerebro): add production alerting guidance by yacosta738 · Pull Request #745 · dallay/corvus

yacosta738 · 2026-05-01T21:29:36Z

Related Issues

Fixes #701

Summary

Adds Cerebro production alerting guidance for readiness degradation, auth anomalies, server-side error rates, storage error spikes, and latency spikes.
Ties recommended alerts to existing /metrics, /readyz, and structured log signals.
Records the gateway/client-surface requirement in OpenSpec with internal production threshold examples.

Tested Information

Ran git diff --check successfully.
Commit hooks ran the staged docs/config lychee offline check successfully.

Documentation Impact

Docs updated in:
- clients/web/apps/docs/src/content/docs/cerebro/operations.md
- clients/cerebro/README.md
- openspec/specs/client-surfaces/gateway-api.md
I verified the documentation matches the current behavior.

Breaking Changes

None.

Checklist

I have checked that there isn’t already a PR solving the same problem.
I have read the Contributing Guidelines.
I ensured my code follows the project's style guidelines.
I have added or updated tests that prove my fix is effective or that my feature works.
I have updated the documentation, or I explained above why no documentation update is needed.
I verified the documentation matches the current behavior.
I have documented any breaking changes in the Breaking Changes section.
I have linked the related issue (if any).

cloudflare-workers-and-pages · 2026-05-01T21:29:40Z

Deploying corvus with Cloudflare Pages

Latest commit:	`5b55915`
Status:	✅ Deploy successful!
Preview URL:	https://039b258b.corvus-42x.pages.dev
Branch Preview URL:	https://docs-cerebro-alerting-guidan.corvus-42x.pages.dev

View logs

coderabbitai · 2026-05-01T21:29:50Z

Warning

Rate limit exceeded

@yacosta738 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 14 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 187f6e62-c94e-48d9-9997-4ef593a1d2be

📥 Commits

Reviewing files that changed from the base of the PR and between bcdaf79 and 5b55915.

📒 Files selected for processing (3)

clients/cerebro/README.md
clients/web/apps/docs/src/content/docs/cerebro/operations.md
openspec/specs/client-surfaces/gateway-api.md

📝 Walkthrough

Walkthrough

The PR expands production observability guidance across documentation and specifications by introducing Prometheus metrics documentation, defining concrete alerting thresholds for readiness probe failures, authentication anomalies, error rate elevation, storage errors, and latency spikes, and establishing alerting requirements for exposed Cerebro deployments.

Changes

Cohort / File(s)	Summary
Cerebro Documentation `clients/cerebro/README.md`, `clients/web/apps/docs/src/content/docs/cerebro/operations.md`	Adds Prometheus metrics definitions (cerebro_requests_total, cerebro_tool_latency_seconds, cerebro_auth_failures_total, cerebro_readiness_failures_total, cerebro_storage_errors_total) with label sets and example PromQL-based alerting rules for readiness degradation, auth anomalies, MCP error rates, storage operation errors, and tool latency p95.
Specification Update `openspec/specs/client-surfaces/gateway-api.md`	Mandates alerting guidance requirements for exposed Cerebro deployments, including readiness monitoring, auth failure detection, server-side error rate calculation, storage error spikes, and tool latency thresholds with example alert conditions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

area:docs, area:web

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title follows Conventional Commits style with 'docs' prefix and clearly describes adding production alerting guidance for Cerebro, matching the main changeset purpose.
Description check	✅ Passed	The PR description is comprehensive, including related issues, summary, tested information, documentation impact, breaking changes, and a completed checklist matching the template structure.
Linked Issues check	✅ Passed	The PR fully addresses issue `#701` requirements: defines alerts for readiness failures, auth anomalies, error rates, storage spikes, and latency spikes; ties guidance to /metrics, /readyz, and logs; provides threshold examples and records requirements in OpenSpec.
Out of Scope Changes check	✅ Passed	All changes are scoped to documentation updates directly addressing issue `#701` requirements; no unrelated code or configuration changes are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/cerebro-alerting-guidance-701

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 46 minutes and 14 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

clients/cerebro/README.md (1)
118-127: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix p95 latency PromQL in README.md to include histogram aggregation (sum by (le, tool)).

Line [124] latency spike expression is missing histogram aggregation:

Current: histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool

But your operations.md example correctly aggregates buckets: sum by (le, tool) (rate(..._bucket...)).

Update README.md to match the working example to keep it copy-paste runnable.
✅ Proposed doc-only fix
-| Latency spike | `histogram_quantile(0.95, rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))` by `tool` | warn above p95 `1s`; page above p95 `2s` for 10 minutes |
+| Latency spike | `histogram_quantile(0.95, sum by (le, tool) (rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])))` | warn above p95 `1s`; page above p95 `2s` for 10 minutes |
As per coding guidelines, **/*: Security first, performance second. Validate input boundaries...** — here that means validating alert-rule “inputs” (PromQL) so production operators don’t deploy broken queries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/cerebro/README.md` around lines 118 - 127, The p95 PromQL in the
README's "Latency spike" row uses rate(...) on buckets without the required
histogram aggregation; update the expression used in the latency spike (the line
containing histogram_quantile(0.95,
rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match
the working example by wrapping rate(...) with sum by (le, tool) — i.e., use
histogram_quantile(0.95, sum by (le, tool)
(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket
aggregation is correct and the query is copy-paste runnable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/web/apps/docs/src/content/docs/cerebro/operations.md`:
- Around line 199-210: The table has two broken PromQL snippets: for "Elevated
MCP error rate" replace the status matcher
status=~"storage_error/internal_error" with the proper regex alternation
status=~"storage_error|internal_error" (matching the example block), and for
"Latency spike" use the required histogram aggregation like in the example block
by applying sum by (le, tool) to the bucket series before rate, e.g.
histogram_quantile(0.95, sum by (le,
tool)(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m]))); update the
table entries for "Elevated MCP error rate" and "Latency spike" accordingly to
match the working expressions shown later.
- Around line 211-230: The server-side MCP error ratio PromQL example uses a
division across two sum(rate(...)) expressions which is correct but ambiguous;
update the example around the expression using cerebro_requests_total (the
numerator
sum(rate(cerebro_requests_total{status=~"storage_error|internal_error"}[5m]))
and the denominator sum(rate(cerebro_requests_total[5m]))) by wrapping the
entire numerator/denominator division in parentheses before applying > 0.02 so
it reads ( numerator / denominator ) > 0.02 for clarity.

In `@openspec/specs/client-surfaces/gateway-api.md`:
- Around line 39-50: The spec’s alert descriptions must match the corrected
PromQL in the docs: update any operator/implementation guidance so the
server-side error selector uses the alternation form
`storage_error|internal_error` (referencing the cerebro_requests_total outcome
selector) and ensure the p95 latency guidance explicitly calls out the histogram
aggregation pattern used in operations.md (e.g., sum by (le, tool) over
cerebro_tool_latency_seconds buckets to compute the 95th percentile for
successful tool calls); also confirm the other metric names in the text
(cerebro_readiness_failures_total, cerebro_auth_failures_total,
cerebro_storage_errors_total) and their threshold expressions map exactly to the
PromQL in operations.md/README.md so the spec and docs stay consistent.

---

Outside diff comments:
In `@clients/cerebro/README.md`:
- Around line 118-127: The p95 PromQL in the README's "Latency spike" row uses
rate(...) on buckets without the required histogram aggregation; update the
expression used in the latency spike (the line containing
histogram_quantile(0.95,
rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) by tool) to match
the working example by wrapping rate(...) with sum by (le, tool) — i.e., use
histogram_quantile(0.95, sum by (le, tool)
(rate(cerebro_tool_latency_seconds_bucket{status="ok"}[10m])) ) so the bucket
aggregation is correct and the query is copy-paste runnable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6cfbe037-f3e9-4812-9144-f935902c11a9

📥 Commits

Reviewing files that changed from the base of the PR and between d3d8402 and bcdaf79.

📒 Files selected for processing (3)

clients/cerebro/README.md
clients/web/apps/docs/src/content/docs/cerebro/operations.md
openspec/specs/client-surfaces/gateway-api.md

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: pr-checks
GitHub Check: sonar
GitHub Check: semgrep-cloud-platform/scan
GitHub Check: Cloudflare Pages
GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: submit-gradle

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

clients/web/apps/docs/src/content/docs/cerebro/operations.md
openspec/specs/client-surfaces/gateway-api.md
clients/cerebro/README.md

**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

clients/web/apps/docs/src/content/docs/cerebro/operations.md
openspec/specs/client-surfaces/gateway-api.md
clients/cerebro/README.md

sonarqubecloud · 2026-05-01T22:05:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

docs(cerebro): add production alerting guidance

bcdaf79

github-actions Bot added the size/s Denotes a small change size label May 1, 2026

coderabbitai Bot added area:docs area:web labels May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread clients/web/apps/docs/src/content/docs/cerebro/operations.md

Comment thread clients/web/apps/docs/src/content/docs/cerebro/operations.md

Comment thread openspec/specs/client-surfaces/gateway-api.md

yacosta738 added 2 commits May 1, 2026 23:32

docs(cerebro): use supported alert query code fence

f9a871a

docs(cerebro): correct alert query examples

5b55915

yacosta738 merged commit 116a4b4 into main May 1, 2026
18 checks passed

yacosta738 deleted the docs/cerebro-alerting-guidance-701 branch May 1, 2026 22:19

coderabbitai Bot mentioned this pull request May 2, 2026

docs(cerebro): add deployment runbook #757

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(cerebro): add production alerting guidance#745

docs(cerebro): add production alerting guidance#745
yacosta738 merged 3 commits into
mainfrom
docs/cerebro-alerting-guidance-701

yacosta738 commented May 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yacosta738 commented May 1, 2026

Related Issues

Summary

Tested Information

Documentation Impact

Breaking Changes

Checklist

Uh oh!

cloudflare-workers-and-pages Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying corvus with Cloudflare Pages

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented May 1, 2026 •

edited

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading