Skip to content

feat(rook): add production observability metrics#713

Merged
yacosta738 merged 5 commits into
mainfrom
rook
Apr 28, 2026
Merged

feat(rook): add production observability metrics#713
yacosta738 merged 5 commits into
mainfrom
rook

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Related Issues

closes #680


Summary

  • add bounded Prometheus metrics for Rook request volume, latency, upstream failures, rate-limit outcomes, and idempotency behavior
  • cover both /api/* and /v1/* while preserving /api/metrics as the single scrape endpoint
  • harden metric label normalization and expand verification coverage, including doctor-path clippy cleanup required for a clean verification pass

Tested Information

  • cargo fmt --all -- --check
  • cargo clippy --all-targets -- -D warnings
  • cargo test

Documentation Impact

  • Docs updated in:
  • openspec/changes/rook-680-production-observability-and-metrics/proposal.md
  • openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md
  • openspec/changes/rook-680-production-observability-and-metrics/design.md
  • openspec/changes/rook-680-production-observability-and-metrics/tasks.md
  • openspec/changes/rook-680-production-observability-and-metrics/verify-report.md
  • No docs update required because:
  • I verified the documentation matches the current behavior.

Breaking Changes

  • None.

Checklist

  • I have checked that there isn’t already a PR solving the same problem.
  • I have read the Contributing Guidelines.
  • I ensured my code follows the project's style guidelines.
  • I have added or updated tests that prove my fix is effective or that my feature works.
  • I have updated the documentation, or I explained above why no documentation update is needed.
  • I verified the documentation matches the current behavior.
  • I have documented any breaking changes in the Breaking Changes section.
  • I have linked the related issue (if any).

Add bounded Prometheus metrics for Rook request volume, latency, upstream failures, rate-limit outcomes, and idempotency behavior across /api/* and /v1/*. Tighten label safety, preserve /api/metrics as the scrape endpoint, and add verification coverage for production observability paths.
@github-actions github-actions Bot added the size/xl Denotes an extra large change size label Apr 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (2)
  • wip
  • do-not-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 114de0fc-c42e-452f-be4b-165ca0cd8445

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds production-grade observability and metrics across Rook’s gateway: normalized metric labels, outcome-based counters for rate-limit/upstream/idempotency, routing logical_model exposure, DB test serialization and tempdir-backed helpers, dashboard override serialization, and comprehensive docs/specs/tasks/verification artifacts.

Changes

Cohort / File(s) Summary
Observability Registry & Normalizers
clients/rook/src/observability.rs
Replaces prior metric families with outcome-labeled counters, adds normalization helpers for surfaces/endpoints/status/vendor/account/model, renames metric handles, and updates tests to new names/labels.
Gateway Handlers & Upstream Metrics
clients/rook/src/gateway/handlers.rs
Introduces UpstreamMetricContext, records upstream failure-focused metrics with vendor/account/model labels, and removes vendor-only success counting.
Routing
clients/rook/src/routing/mod.rs
Adds pub logical_model: String to RoutingDecision and sets it during route resolution (including fallbacks).
Transport & Rate-Limit
clients/rook/src/transport/middleware.rs, clients/rook/src/transport/rate_limit.rs
Refactors to call shared normalizers for surface/endpoint/status; consolidates rate-limit metrics into outcome-labeled rook_rate_limit_outcomes_total.
Idempotency Middleware
clients/rook/src/idempotency/middleware.rs
Emits "unavailable" outcome when storage/reservation/serialization fails; moves "pass" emission to finalize-success path.
Dashboard Sync Guard
clients/rook/src/dashboard/mod.rs
AssetsReadyOverrideGuard now holds a process-wide MutexGuard to serialize override lifetime.
DB Test Helpers & Tests
clients/rook/src/doctor.rs, clients/rook/tests/doctor_operational_diagnostics.rs
Serializes DB-backed tests with an async mutex; helper now returns InitializedDbEnv owning a TempDir and env map; tests use per-test tempdir + fixed DB filename.
Server Tests / Metrics Coverage
clients/rook/src/server/mod.rs
Expands metrics tests to exercise gateway surfaces, updated metric names/labels, secret-safety checks in /api/metrics, adds seed_route_with_display_name, and adds idempotency failure simulation.
Docs / Spec / Design / Tasks / Verify
openspec/changes/rook-680-production-observability-and-metrics/*
Adds proposal, design, spec, tasks, state manifest, and verification report documenting metrics contract, normalization rules, and rollout/testing guidance.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TransportMW as Transport<br/>Middleware
    participant RateLimitMW as Rate-Limit<br/>Middleware
    participant GatewayHandler as Gateway<br/>Handler
    participant RoutingEngine as Routing<br/>Engine
    participant UpstreamService as Upstream<br/>Service
    participant Observability as Observability<br/>Registry

    Client->>TransportMW: HTTP Request
    TransportMW->>RateLimitMW: Forward Request

    alt Allow
        RateLimitMW->>Observability: rate_limit_outcomes_total<br/>(surface, endpoint, outcome=allow)
        RateLimitMW->>GatewayHandler: Forward Request
    else Reject
        RateLimitMW->>Observability: rate_limit_outcomes_total<br/>(surface, endpoint, outcome=reject)
        RateLimitMW->>Client: 429 Response
    end

    GatewayHandler->>RoutingEngine: Resolve Route
    alt Resolved
        RoutingEngine-->>GatewayHandler: RoutingDecision (includes logical_model)
        GatewayHandler->>UpstreamService: Forward to Provider
        alt Upstream Success
            UpstreamService-->>GatewayHandler: Success
            GatewayHandler->>TransportMW: Response
        else Upstream Failure
            UpstreamService-->>GatewayHandler: Error
            GatewayHandler->>Observability: upstream_failures_total<br/>(vendor, account, model, outcome)
            GatewayHandler->>TransportMW: Error Response
        end
    else Route Rejected
        RoutingEngine-->>GatewayHandler: Route Rejected
        GatewayHandler->>Observability: upstream_failures_total<br/>(vendor=unrouted, outcome=route_rejected)
        GatewayHandler->>TransportMW: Error Response
    end

    TransportMW->>Observability: request_total / duration (surface, endpoint, status_class)
    TransportMW->>Client: Final Response
Loading
sequenceDiagram
    participant Client
    participant GatewayHandler as Gateway Handler<br/>(POST /v1/chat/completions)
    participant IdempotencyMW as Idempotency<br/>Middleware
    participant Storage as Storage<br/>Backend
    participant Observability as Observability<br/>Registry
    participant UpstreamService as Upstream<br/>Service

    Client->>GatewayHandler: POST with Idempotency-Key
    GatewayHandler->>IdempotencyMW: Check Idempotency Key

    alt Not Cached
        IdempotencyMW->>Storage: Reserve Entry
        alt Reservation Success
            Storage-->>IdempotencyMW: Reserved
            IdempotencyMW->>UpstreamService: Forward Request
            UpstreamService-->>IdempotencyMW: Response
            IdempotencyMW->>Storage: Store Response
            alt Store Success
                Storage-->>IdempotencyMW: Stored
                IdempotencyMW->>Observability: idempotency_outcomes_total(outcome=pass)
                IdempotencyMW->>Client: Response
            else Store Failure
                Storage-->>IdempotencyMW: Error
                IdempotencyMW->>Observability: idempotency_outcomes_total(outcome=unavailable)
                IdempotencyMW->>Client: Response + Error
            end
        else Reservation Failure
            Storage-->>IdempotencyMW: Error
            IdempotencyMW->>Observability: idempotency_outcomes_total(outcome=unavailable)
            IdempotencyMW->>Client: Error Response
        end
    else Cached
        IdempotencyMW->>Storage: Fetch Cached Response
        Storage-->>IdempotencyMW: Cached Response
        IdempotencyMW->>Observability: idempotency_outcomes_total(outcome=replay)
        IdempotencyMW->>Client: Cached Response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

area:rust, area:docs, risk:high, risk:security

Suggested reviewers

  • yuniel-acosta
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.99% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title follows Conventional Commit style with 'feat' prefix and clearly describes the main change: adding production observability metrics to Rook.
Description check ✅ Passed The description includes all required sections: Related Issues, Summary, Tested Information, Documentation Impact, Breaking Changes, and a completed Checklist matching the template structure.
Linked Issues check ✅ Passed The PR implementation comprehensively addresses all coding requirements from issue #680: request/error metrics for /api/* and /v1/*, upstream failure metrics with vendor/account/model labels, rate-limit outcomes, idempotency behavior tracking, and extensive test coverage.
Out of Scope Changes check ✅ Passed All changes directly support production observability objectives: metric instrumentation across gateway handlers, idempotency middleware, rate limiting, transport, and normalization helpers; test enhancements; and specification/design documentation. No extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rook

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

✅ Contributor Report

User: @yacosta738
Status: Passed (12/13 metrics passed)

Metric Description Value Threshold Status
PR Merge Rate PRs merged vs closed 92% >= 30%
Repo Quality Repos with ≥100 stars 0 >= 0
Positive Reactions Positive reactions received 9 >= 1
Negative Reactions Negative reactions received 0 <= 5
Account Age GitHub account age 3104 days >= 30 days
Activity Consistency Regular activity over time 108% >= 0%
Issue Engagement Issues with community engagement 0 >= 0
Code Reviews Code reviews given to others 604 >= 0
Merger Diversity Unique maintainers who merged PRs 2 >= 0
Repo History Merge Rate Merge rate in this repo 93% >= 0%
Repo History Min PRs Previous PRs in this repo 293 >= 0
Profile Completeness Profile richness (bio, followers) 90 >= 0
Suspicious Patterns Spam-like activity detection 1 N/A

Contributor Report evaluates based on public GitHub activity. Analysis period: 2025-04-28 to 2026-04-28

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
clients/rook/src/idempotency/middleware.rs (1)

175-183: ⚠️ Potential issue | 🟠 Major

Keep idempotency outcomes mutually exclusive.

ReservedNew still records "pass" before finalize_response() persists the response. With the new "unavailable" increments below, any store/body-read failure now emits both "pass" and "unavailable" for the same request, which makes the counter inaccurate.

Suggested fix
         ReserveResult::ReservedNew => {
-            state
-                .observability
-                .idempotency_outcomes_total()
-                .inc(IDEMPOTENCY_SURFACE, "pass");
             parts.extensions.insert(raw_body.clone());
             let request = Request::from_parts(parts, Body::from(raw_body.clone()));
             let response = next.run(request).await;
             finalize_response(state, scope, request_hash, response).await
         }
@@
     if state
         .service
         .inner()
         .complete_chat_completion(
@@
         return gateway_idempotency_error_response(
             StatusCode::SERVICE_UNAVAILABLE,
             "idempotency storage unavailable",
             "idempotency_unavailable",
         );
     }
 
+    state
+        .observability
+        .idempotency_outcomes_total()
+        .inc(IDEMPOTENCY_SURFACE, "pass");
     Response::from_parts(parts, Body::from(body_bytes))
 }

Also applies to: 195-244

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/idempotency/middleware.rs` around lines 175 - 183,
ReserveResult::ReservedNew currently increments
observability.idempotency_outcomes_total("pass") before calling
finalize_response, which can lead to duplicate/inaccurate labels when
finalize_response later fails and increments "unavailable"; move or remove the
pre-finalize "pass" increment so that outcomes remain mutually exclusive and
only finalized results increment the counter. Specifically, in the
ReserveResult::ReservedNew branch (and the similar sections around lines
195-244), avoid calling state.observability.idempotency_outcomes_total().inc(...
"pass") before calling finalize_response(request_hash, ...); instead let
finalize_response (or its success path) be responsible for incrementing "pass"
and ensure all failure paths (store/body-read errors) increment "unavailable"
(or the appropriate label) so each request produces exactly one outcome
increment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/observability.rs`:
- Around line 206-215: The secret-detection function looks_secret_like currently
checks lowercase.contains("api_key") and contains("api-key") but misses the
hyphenless form; update looks_secret_like to also check for
lowercase.contains("apikey") (and any other common concatenated variants you
deem necessary) so the function will catch tokens like "apikey" in addition to
the existing patterns (bearer, sk-, token, api_key, api-key).

In `@clients/rook/src/routing/mod.rs`:
- Around line 213-217: Add a regression test that asserts the fallback
logical_model is set on RoutingDecision when the upstream-failure path is taken:
simulate the upstream-failure route (the same test harness that currently
asserts account/pool/route) and add an assertion that decision.logical_model
equals the expected fallback value; target the code paths constructing
RoutingDecision (check where RoutingDecision is returned and where
decision.logical_model is read) so the test will fail if future changes silently
relabel metrics for the fallback case.

In `@openspec/changes/rook-680-production-observability-and-metrics/state.yaml`:
- Around line 4-19: The top-level status is "completed" but the
"phases.explore.status" is still "pending", creating an inconsistent state;
update the "explore" phase under "phases" so its status value is "completed"
(i.e., change phases.explore.status from pending to completed) to make all phase
statuses consistent with the overall status.

---

Outside diff comments:
In `@clients/rook/src/idempotency/middleware.rs`:
- Around line 175-183: ReserveResult::ReservedNew currently increments
observability.idempotency_outcomes_total("pass") before calling
finalize_response, which can lead to duplicate/inaccurate labels when
finalize_response later fails and increments "unavailable"; move or remove the
pre-finalize "pass" increment so that outcomes remain mutually exclusive and
only finalized results increment the counter. Specifically, in the
ReserveResult::ReservedNew branch (and the similar sections around lines
195-244), avoid calling state.observability.idempotency_outcomes_total().inc(...
"pass") before calling finalize_response(request_hash, ...); instead let
finalize_response (or its success path) be responsible for incrementing "pass"
and ensure all failure paths (store/body-read errors) increment "unavailable"
(or the appropriate label) so each request produces exactly one outcome
increment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 787801d1-87f1-4fe3-bd62-33492b51d76a

📥 Commits

Reviewing files that changed from the base of the PR and between 2da2bcc and 0583815.

📒 Files selected for processing (16)
  • clients/rook/src/dashboard/mod.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/routing/mod.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/src/transport/middleware.rs
  • clients/rook/src/transport/rate_limit.rs
  • clients/rook/tests/doctor_operational_diagnostics.rs
  • openspec/changes/rook-680-production-observability-and-metrics/design.md
  • openspec/changes/rook-680-production-observability-and-metrics/proposal.md
  • openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md
  • openspec/changes/rook-680-production-observability-and-metrics/state.yaml
  • openspec/changes/rook-680-production-observability-and-metrics/tasks.md
  • openspec/changes/rook-680-production-observability-and-metrics/verify-report.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: report / Contributor Quality Report
  • GitHub Check: pr-checks
  • GitHub Check: sonar
  • GitHub Check: submit-gradle
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (3)
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • openspec/changes/rook-680-production-observability-and-metrics/state.yaml
  • clients/rook/src/routing/mod.rs
  • openspec/changes/rook-680-production-observability-and-metrics/tasks.md
  • openspec/changes/rook-680-production-observability-and-metrics/proposal.md
  • openspec/changes/rook-680-production-observability-and-metrics/design.md
  • openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md
  • openspec/changes/rook-680-production-observability-and-metrics/verify-report.md
  • clients/rook/src/transport/rate_limit.rs
  • clients/rook/src/dashboard/mod.rs
  • clients/rook/src/transport/middleware.rs
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/tests/doctor_operational_diagnostics.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/src/observability.rs
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/routing/mod.rs
  • clients/rook/src/transport/rate_limit.rs
  • clients/rook/src/dashboard/mod.rs
  • clients/rook/src/transport/middleware.rs
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/tests/doctor_operational_diagnostics.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/src/observability.rs
**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

  • openspec/changes/rook-680-production-observability-and-metrics/tasks.md
  • openspec/changes/rook-680-production-observability-and-metrics/proposal.md
  • openspec/changes/rook-680-production-observability-and-metrics/design.md
  • openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md
  • openspec/changes/rook-680-production-observability-and-metrics/verify-report.md
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Include threat/risk notes and rollback strategy for security, runtime, and gateway changes; add or update tests for boundary checks and failure modes
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Include threat/risk notes and rollback strategy for security, runtime, and gateway changes; add or update tests for boundary checks and failure modes

Applied to files:

  • openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/**/*.rs : Run `cargo fmt --all -- --check`, `cargo clippy --all-targets -- -D warnings`, and `cargo test` for code validation, or document which checks were skipped and why

Applied to files:

  • openspec/changes/rook-680-production-observability-and-metrics/verify-report.md
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/{security,gateway,tools,config}/**/*.rs : Do not silently weaken security policy or access constraints; keep default behavior secure-by-default with deny-by-default where applicable

Applied to files:

  • clients/rook/src/dashboard/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths

Applied to files:

  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/src/gateway/handlers.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests

Applied to files:

  • clients/rook/src/server/mod.rs
🪛 LanguageTool
openspec/changes/rook-680-production-observability-and-metrics/proposal.md

[style] ~15-~15: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ather than only individual log lines. - Expose upstream failure metrics for routed pro...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~16-~16: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e already available and safe to emit. - Expose counters for rate-limit outcomes and id...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

openspec/changes/rook-680-production-observability-and-metrics/design.md

[style] ~253-~253: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ams or request bodies. - status_class MUST stay coarse (2xx, 4xx, etc.). ### ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.22.1)
openspec/changes/rook-680-production-observability-and-metrics/proposal.md

[warning] 11-11: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 20-20: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

openspec/changes/rook-680-production-observability-and-metrics/design.md

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)


[warning] 8-8: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 17-17: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 32-32: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 46-46: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 60-60: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 71-71: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 81-81: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 98-98: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 112-112: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 124-124: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🔇 Additional comments (20)
clients/rook/src/dashboard/mod.rs (3)

30-33: LGTM — serial mutex for test isolation.

The dual-mutex approach cleanly separates concerns: ASSETS_READY_OVERRIDE protects the value, ASSETS_READY_OVERRIDE_SERIAL serializes test execution. Pattern aligns with DOCTOR_TEST_SERIAL in doctor.rs.


42-47: Poison recovery is appropriate here.

For test-only infrastructure, recovering from poisoned mutexes via into_inner() prevents cascading test failures. This matches the existing pattern at line 39.


58-69: Guard design ensures test serialization.

The _serial_guard is held for the struct's lifetime, ensuring exclusive access during the test. Lock ordering is safe: serial acquired first in new(), override lock is only held transiently (lines 65, 74). Drop order correctly clears the override before releasing serialization.

openspec/changes/rook-680-production-observability-and-metrics/proposal.md (1)

1-77: LGTM — Proposal is well-structured with appropriate risk mitigations.

The proposal correctly scopes bounded metrics, addresses label cardinality risks, and includes a rollback strategy for the additive observability changes. The explicit exclusion of secrets/PII from metric labels aligns with the coding guidelines' security-first principle.

clients/rook/src/doctor.rs (2)

316-347: LGTM — Clean test serialization pattern.

The InitializedDbEnv wrapper correctly owns the TempDir to ensure the database file remains valid throughout test execution. The OnceLock<Mutex<()>> pattern for lazy-init serialization is idiomatic and appropriate for preventing test interference.


349-366: LGTM — Tests correctly acquire serialization guard.

The tests now properly await the serial guard before accessing shared state, preventing race conditions in DB-backed doctor tests.

clients/rook/src/server/mod.rs (5)

508-569: LGTM — Clean test helper extraction.

The seed_route_with_display_name helper provides configurable account naming for metrics label assertions without duplicating seeding logic.


724-764: LGTM — Comprehensive endpoint label coverage with secret leak prevention.

Good coverage of both /v1/models and /v1/chat/completions metrics with stable surface labels. The assertions at lines 762-763 explicitly verify that request body content and model names don't leak into metric labels.


800-811: LGTM — Rate-limit outcome metrics properly tested.

Assertions correctly validate the new outcome-based metrics (allow/reject) for both admin_api and gateway_v1 surfaces.


1515-1673: LGTM — Thorough upstream failure metrics coverage.

Excellent coverage of failure outcomes (http_error, account_misconfigured, network_error, route_rejected) with explicit assertions that secret-like account names ("Bearer sk-secret") are normalized to unlabeled. The test at line 1669-1670 properly verifies safe bounded labels.


2148-2176: LGTM — Idempotency unavailable outcome now observable.

The test correctly validates that storage failures emit the unavailable outcome metric, closing the observability gap for idempotency service failures.

openspec/changes/rook-680-production-observability-and-metrics/design.md (1)

1-351: LGTM — Comprehensive design with clear safety boundaries.

The design correctly documents the bounded label contract, instrumentation placement decisions, and explicitly addresses secret-safe normalization for account and model labels (lines 82-97). The Rust struct examples match the actual implementation in observability.rs.

openspec/changes/rook-680-production-observability-and-metrics/verify-report.md (1)

1-128: LGTM — Verification report is thorough.

The report documents all required validation commands (cargo fmt, cargo clippy, cargo test) and provides specific test evidence for each metrics category. The evidence references match actual test names in the codebase.

openspec/changes/rook-680-production-observability-and-metrics/specs/gateway/spec.md (1)

1-192: LGTM — Spec defines clear, testable requirements with strong safety boundaries.

The requirement at lines 161-175 explicitly forbids secret-bearing values and high-cardinality identifiers as metric labels. The scenarios provide testable criteria that align with the test coverage in server/mod.rs.

clients/rook/src/observability.rs (6)

1-26: LGTM — Clean metric family renames and new normalization infrastructure.

The transition from rejection-only to outcome-based rate-limit metrics and the addition of bounded upstream failure metrics align with the design spec.


121-145: LGTM — Surface normalization helpers are well-structured.

The normalization functions correctly map internal RouteSurface and RateLimitedSurface enums to stable, bounded string labels for metrics.


147-185: LGTM — Endpoint normalization with safe fallbacks.

The normalize_surface_endpoint function correctly:

  • Returns UNMATCHED_ENDPOINT for missing or malformed paths
  • Prefixes surface-relative paths with /api or /v1 as appropriate
  • Preserves already-prefixed paths unchanged

217-263: LGTM — Bounded label normalization with length and character safety.

The normalize_bounded_label function correctly:

  • Trims and rejects empty/whitespace-only input
  • Rejects secret-like values early
  • Normalizes characters to a safe alphanumeric subset
  • Enforces MAX_LABEL_LEN to prevent cardinality explosion
  • Removes leading/trailing separators

The early return on length overflow (line 253) is a good defensive measure against unbounded input.


302-354: LGTM — Handle implementations match new label contracts.

The RateLimitOutcomesHandle and UpstreamFailuresHandle correctly accept the new label dimensions (outcome for rate-limit, vendor/account/model/outcome for upstream failures).


513-562: LGTM — Comprehensive normalization tests with secret-safety validation.

Tests cover:

  • Surface/endpoint normalization
  • Status class bucketing
  • Secret-like value detection (Bearer sk-secret, sk-secret-value)
  • Fallback to bounded defaults for invalid input
  • Length overflow handling

Comment thread clients/rook/src/observability.rs
Comment thread clients/rook/src/routing/mod.rs
Tighten secret-like label detection, lock fallback logical-model coverage, and keep idempotency outcome metrics mutually exclusive on finalize failure. Also align the OpenSpec phase state with the completed change status.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 28, 2026

Deploying corvus with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6455235
Status: ✅  Deploy successful!
Preview URL: https://e1e5c6cb.corvus-42x.pages.dev
Branch Preview URL: https://rook.corvus-42x.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/server/mod.rs`:
- Around line 2170-2178: Add a new test case that exercises a finalize-time
idempotency failure: simulate a successful reserve phase but make
complete_chat_completion return an error, then call request_text(app,
"/api/metrics") and assert StatusCode::OK and that the metrics body contains the
"rook_idempotency_outcomes_total{surface=\"gateway_chat_completions\",outcome=\"unavailable\"}
1" entry while still not containing the "…outcome=\"pass\"} 1" entry; to
implement this, reuse the existing test scaffold around request_text and the
metrics assertions, arrange the test double or mock so reserve succeeds but
complete_chat_completion (or the function you call to finish completions) fails,
and then verify the two metric strings in the response body.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f4a4ed60-f542-4589-b7e9-1fa6efd93b76

📥 Commits

Reviewing files that changed from the base of the PR and between 0583815 and cfdeff2.

📒 Files selected for processing (5)
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/routing/mod.rs
  • clients/rook/src/server/mod.rs
  • openspec/changes/rook-680-production-observability-and-metrics/state.yaml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pr-checks
  • GitHub Check: sonar
  • GitHub Check: submit-gradle
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • openspec/changes/rook-680-production-observability-and-metrics/state.yaml
  • clients/rook/src/routing/mod.rs
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/server/mod.rs
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/routing/mod.rs
  • clients/rook/src/idempotency/middleware.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/server/mod.rs
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Include threat/risk notes and rollback strategy for security, runtime, and gateway changes; add or update tests for boundary checks and failure modes
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Include threat/risk notes and rollback strategy for security, runtime, and gateway changes; add or update tests for boundary checks and failure modes

Applied to files:

  • clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths

Applied to files:

  • clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests

Applied to files:

  • clients/rook/src/server/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/{security,gateway,tools,config}/**/*.rs : Do not silently weaken security policy or access constraints; keep default behavior secure-by-default with deny-by-default where applicable

Applied to files:

  • clients/rook/src/server/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/{security,gateway,tools}/**/*.rs : Treat `src/security/`, `src/gateway/`, `src/tools/` as high-risk surfaces and never broaden filesystem/network execution scope without explicit policy checks

Applied to files:

  • clients/rook/src/server/mod.rs
🔇 Additional comments (5)
openspec/changes/rook-680-production-observability-and-metrics/state.yaml (1)

4-19: State lifecycle is now internally consistent.

status and all phase statuses are aligned to completed.

clients/rook/src/routing/mod.rs (1)

66-67: logical_model propagation and fallback regression coverage look correct.

The decision now carries route model context, and the fallback assertion locks expected labeling behavior.

Also applies to: 213-216, 931-934

clients/rook/src/idempotency/middleware.rs (1)

133-137: Idempotency outcome metrics are now correctly failure-aware and mutually exclusive.

"pass" is emitted only after successful completion, while reserve/finalize failures emit "unavailable".

Also applies to: 194-198, 229-243

clients/rook/src/observability.rs (1)

49-67: Metrics contract migration and secret-label hardening look solid.

The renamed families/label schema and normalization tests are consistent, including the added apikey secret-like detection path.

Also applies to: 102-118, 206-216, 443-567

clients/rook/src/server/mod.rs (1)

1515-1523: Great coverage for upstream failure-label hardening and secret leakage prevention.

The added scenarios for misconfigured/network/rejected routes plus secret-like account-name redaction materially strengthen the observability contract tests.

Also applies to: 1655-1673

Comment thread clients/rook/src/server/mod.rs
Add a regression test for the reserved-then-complete-fails path so metrics continue to emit only the unavailable outcome and never a pass counter when finalize-time idempotency storage fails.
@sonarqubecloud
Copy link
Copy Markdown

@yacosta738 yacosta738 merged commit ef845e1 into main Apr 28, 2026
16 of 19 checks passed
@yacosta738 yacosta738 deleted the rook branch April 28, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add production observability and metrics for Rook

1 participant