feat(rook): add production readiness operations by yacosta738 · Pull Request #768 · dallay/corvus

yacosta738 · 2026-05-04T12:51:15Z

Related Issues

Related to #677

Summary

Adds production-readiness operator surfaces for Rook: machine-readable rook doctor --json, /api/status, provider health metrics, retry outcome metrics, and a local-first rook usage report CLI.
Wires upstream resilience policy through config/env/CLI into server startup, with doctor visibility and production posture advisories.
Expands Rook operations documentation with deployment triage guidance, retry exhaustion alerting, provider health metrics, and usage report runbook notes.

Tested Information

cargo fmt --manifest-path clients/rook/Cargo.toml
cargo test --manifest-path clients/rook/Cargo.toml
- 391 lib tests passed
- 23 main tests passed
- 2 distribution contract tests passed
- 4 doctor operational diagnostics tests passed
- 1 doc-test passed
cargo check --manifest-path clients/rook/Cargo.toml
Pre-commit and pre-push doc link checks passed.

Documentation Impact

Docs updated in:
- clients/rook/README.md
I verified the documentation matches the current behavior.

Breaking Changes

None.

Checklist

I have checked that there isn’t already a PR solving the same problem.
I have read the Contributing Guidelines.
I ensured my code follows the project's style guidelines.
I have added or updated tests that prove my fix is effective or that my feature works.
I have updated the documentation, or I explained above why no documentation update is needed.
I verified the documentation matches the current behavior.
I have documented any breaking changes in the Breaking Changes section.
I have linked the related issue (if any).

coderabbitai · 2026-05-04T12:51:29Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added /api/status operator status endpoint and expanded /api/metrics with provider health and cooldown gauges
- Introduced upstream resilience settings (buffered retries, cooldown, backoff, concurrency) and CLI flags
- Added structured diagnostics (--json) and a usage report CLI (JSON output option)
- New retry outcome metrics for upstream retry decisions
Documentation
- Expanded operational runbook, observability guidance, probes, and troubleshooting (retry exhaustion, provider cooldowns, empty usage reports)

Walkthrough

Adds a configurable upstream resilience feature (buffered retries, cooldown, backoff, concurrency) plus observability, operator status endpoint, CLI usage/reporting and doctor JSON output; wires config → server → gateway retry behavior, emits Prometheus metrics for retry outcomes and per-account health/cooldown, and documents operational runbook changes.

Changes

Upstream resilience, observability, admin & CLI

Layer / File(s)	Summary
Config Shape & Validation `clients/rook/src/config/mod.rs`	Adds `UpstreamResilienceConfig`, `PartialUpstreamResilienceConfig`, validation, env overlay parsing (`ROOK_UPSTREAM_RESILIENCE_*`), conversion to `gateway::UpstreamResiliencePolicy`, and export view.
Server wiring `clients/rook/src/server/mod.rs`	`ServerConfig` gains `upstream_resilience`; app builder uses configured policy when creating `UpstreamConcurrency`/`GatewayState`.
Observability metric registration `clients/rook/src/observability.rs`	Registers `rook_upstream_retry_outcomes` counter family, adds `UpstreamRetryLabels`/`UpstreamRetryOutcomesHandle` and accessor to record labeled retry outcomes.
Gateway retry instrumentation `clients/rook/src/gateway/handlers.rs`	Buffered retry loop caches last error/context, records `rook_upstream_retry_outcomes_total` with outcomes (`retry_scheduled`, `retry_exhausted`, `not_retryable`) at decision points.
Admin API types & endpoint `clients/rook/src/admin/types.rs`, `clients/rook/src/admin/handlers.rs`, `clients/rook/src/admin/mod.rs`	Adds `OperatorStatusView`/`OperatorRuntimeView`, `handle_operator_status` wired to `GET /status`; `handle_get_metrics` now appends per-account `rook_provider_account_health` and `rook_provider_account_cooldown_active` gauge lines. Tests added for `/status`.
CLI & Doctor `clients/rook/src/main.rs`, `clients/rook/src/doctor.rs`	`serve` gains upstream resilience CLI flags; `ServeOverrides` overlays support resilience; new `usage report` command (`--period/--limit/--format json`); `doctor --json` and `render_json_report`; advisory checks for production posture and aggressive resilience settings.
Documentation & Runbook `clients/rook/README.md`	Documents `[upstream_resilience]` example/defaults, env overrides, tuning notes (no streaming retries after bytes committed), new `/api/status` and `/api/metrics` endpoints, usage-report CLI, and updated runbook/troubleshooting steps.
Tests / Examples `clients/rook/src/{admin,config,doctor,gateway,main,observability,server}/*.rs`, `clients/rook/README.md`	Adds/updates unit and integration tests covering config env overlays, validation, metrics presence and samples, retry-outcome metric cases, `/status` behaviors, usage report DB-backed rendering, and JSON doctor output.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Gateway as Gateway (Buffered Proxy)
  participant Provider as Upstream Provider
  participant Registry as Registry/HealthService
  participant Observ as Observability

  Client->>Gateway: POST /v1/chat/completions (request)
  Gateway->>Provider: Forward attempt (account A)
  alt Provider responds error (retryable)
    Provider-->>Gateway: Error
    Gateway->>Observ: inc(upstream_retry_outcomes_total{outcome="retry_scheduled", vendor..., account..., model...})
    Gateway->>Registry: record tentative account health/cooldown?
    Gateway->>Gateway: schedule retry (backoff)
    Gateway->>Provider: Forward retry attempt (account B or A)
    alt Exhausted attempts
      Gateway->>Observ: inc(...{outcome="retry_exhausted"})
      Gateway-->>Client: Fail response
    else Non-retryable error
      Gateway->>Observ: inc(...{outcome="not_retryable"})
      Gateway-->>Client: Fail response
    end
  else Provider responds success
    Provider-->>Gateway: Stream/response
    Gateway-->>Client: Stream/response (no further retries)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(rook): add admin API for operator management #629: Related admin API additions and router/handler wiring that this PR extends with operator status and metrics rendering.
feat(rook): add production observability metrics #713: Overlaps in observability and gateway-level metric instrumentation for upstream outcomes.
feat: Persist Rook provider account health state #759: Related provider account health/cooldown persistence and health-state wiring that this PR surfaces via metrics and /status.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows Conventional Commit style with 'feat' prefix, includes a clear scope 'rook', and is concise (47 characters), accurately reflecting the PR's focus on production readiness operations.
Description check	✅ Passed	The PR description covers all required template sections: Related Issues (`#677`), Summary, Tested Information (with detailed test results), Documentation Impact (README.md updated), Breaking Changes (none), and a complete checklist with all items checked.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/rook-production-readiness-677

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/README.md`:
- Around line 331-332: Update the README description for
rook_provider_account_health to remove the “one-hot” claim and instead state
that the status label (values: healthy, degraded, unhealthy, unknown) indicates
persisted provider health but is not guaranteed to be unique per vendor/account
in a single scrape (multiple status values may appear, as asserted in server
logic such as the checks around rook_provider_account_health in the server
module). Adjust the wording to avoid implying exclusivity and make it clear that
consumers should not assume only one status label value per vendor/account
during a scrape.

In `@clients/rook/src/admin/handlers.rs`:
- Around line 389-395: The metric currently uses account.display_name via
normalize_account_label to populate the account label (see account.display_name,
normalize_account_label, account_label, and the rook_provider_account_health
string in body.push_str), which can leak tenant identifiers; change this to use
an opaque identifier instead (prefer account.id if available) or replace the
display name with a stable irreversible hash of the ID (or ID+salt) before
passing to normalize_account_label/provider_health_status_label so the metric
exposes no raw customer/displayable identifiers. Ensure you update the label
value construction and any tests to expect the opaque value rather than
display_name.

In `@clients/rook/src/config/mod.rs`:
- Around line 170-183: The validate method currently checks
max_buffered_attempts and max_concurrent_upstream_requests but misses validating
failure_cooldown_seconds and retry_backoff_milliseconds; update the
validate(&self) -> Result<(), RookError> function to also return
Err(RookError::Config(...)) when self.failure_cooldown_seconds == 0 or
self.retry_backoff_milliseconds == 0, using the same error pattern and
descriptive messages (e.g., "upstream_resilience.failure_cooldown_seconds must
be at least 1" and "upstream_resilience.retry_backoff_milliseconds must be at
least 1") so that zero values are rejected.

In `@clients/rook/src/doctor.rs`:
- Around line 232-235: The current is_externally_reachable_bind only treats
wildcards as external and misses concrete non-loopback IPs/hostnames; update
is_externally_reachable_bind to parse the host (trim brackets) and treat it as
external unless it is an explicit loopback (e.g., "127.0.0.1", "::1", or
IpAddr::is_loopback()), or an unspecified address should be considered external
(or explicitly treat "0.0.0.0"/"::" and "*" as external). Concretely: in
is_externally_reachable_bind, try parsing host into std::net::IpAddr and return
true if parsing succeeds and !ip.is_loopback(); if parsing fails (hostname
case), treat anything other than "localhost" as external; keep existing wildcard
checks for "*", "0.0.0.0", "::". This ensures non-loopback IPs/hostnames like
"203.0.113.10" trigger the advisory.

In `@clients/rook/src/gateway/handlers.rs`:
- Around line 336-346: When the route resolution path (resolve()) fails and the
loop exits, terminal retry outcomes aren't recorded because last_error is
drained and returns early; update the error-handling after the resolve() failure
path to check last_error and call record_upstream_retry_outcome(state,
&metric_context, ...) using the same logic as the retry branch (choose
"retry_exhausted" if retryable else "not_retryable") before returning the
upstream error. Refer to last_error, resolve(), metric_context,
record_upstream_retry_outcome, and state to locate where to insert this check
and ensure the terminal outcome is emitted for the resolve-failure path.

In `@clients/rook/src/server/mod.rs`:
- Around line 667-700: The test currently triggers a missing-model 503 and never
exercises the upstream resilience policy; change it to register a model/upstream
that resolves to a routed upstream which fails so the resilience logic runs:
using RookRegistry::open_in_memory(), add a fake model/route that points to an
upstream handler which returns 500 (or sleeps/fails) and then call
build_app_with_registry(config, registry) with ServerConfig.upstream_resilience
set as in the diff; send requests to /v1/chat/completions using that registered
model so retries/concurrency/cooldown are exercised, and assert a
policy-dependent observable (e.g., number of attempts, that a second concurrent
request is rejected due to max_concurrent_upstream_requests, or that a
subsequent request within failure_cooldown is short-circuited) rather than
relying on missing-model 503.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 97fe643e-cd74-48b7-8c84-de016d352137

📥 Commits

Reviewing files that changed from the base of the PR and between f52d91b and 4ea27bd.

📒 Files selected for processing (11)

clients/rook/README.md
clients/rook/src/admin/handlers.rs
clients/rook/src/admin/mod.rs
clients/rook/src/admin/types.rs
clients/rook/src/config/mod.rs
clients/rook/src/doctor.rs
clients/rook/src/gateway/handlers.rs
clients/rook/src/gateway/mod.rs
clients/rook/src/main.rs
clients/rook/src/observability.rs
clients/rook/src/server/mod.rs

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: sonar
GitHub Check: pr-checks
GitHub Check: semgrep-cloud-platform/scan
GitHub Check: submit-gradle
GitHub Check: Cloudflare Pages

🧰 Additional context used

📓 Path-based instructions (3)

**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

clients/rook/src/gateway/mod.rs
clients/rook/src/admin/types.rs
clients/rook/src/admin/handlers.rs
clients/rook/src/server/mod.rs
clients/rook/src/observability.rs
clients/rook/src/config/mod.rs
clients/rook/src/main.rs
clients/rook/src/gateway/handlers.rs
clients/rook/src/doctor.rs
clients/rook/src/admin/mod.rs

**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

clients/rook/src/gateway/mod.rs
clients/rook/src/admin/types.rs
clients/rook/src/admin/handlers.rs
clients/rook/README.md
clients/rook/src/server/mod.rs
clients/rook/src/observability.rs
clients/rook/src/config/mod.rs
clients/rook/src/main.rs
clients/rook/src/gateway/handlers.rs
clients/rook/src/doctor.rs
clients/rook/src/admin/mod.rs

**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

clients/rook/README.md

🔇 Additional comments (5)

clients/rook/src/gateway/mod.rs (1)

32-32: Debug derive on UpstreamResiliencePolicy looks good.

This improves operational introspection without changing behavior.

clients/rook/src/admin/types.rs (1)

148-160: Operator status DTO additions are clean and consistent.

The new response model matches handler/test expectations and keeps sensitive fields out of the payload.

clients/rook/src/observability.rs (2)

71-76: New upstream retry outcome metric plumbing looks solid.

Registration, handle exposure, and label-set modeling are consistent with the existing observability pattern.

Also applies to: 129-133, 429-446, 472-494

512-574: Test coverage for rook_upstream_retry_outcomes_total is strong.

The added assertions verify both registration and concrete sample emission.

clients/rook/src/admin/mod.rs (1)

32-32: /status route integration and coverage look good.

The tests validate the status contract across readiness states and include redaction checks.

Also applies to: 265-331

# Conflicts: # clients/rook/src/config/mod.rs

cloudflare-workers-and-pages · 2026-05-04T13:56:50Z

Deploying corvus with Cloudflare Pages

Latest commit:	`b152a04`
Status:	✅ Deploy successful!
Preview URL:	https://c82e6e71.corvus-42x.pages.dev
Branch Preview URL:	https://feat-rook-production-readine.corvus-42x.pages.dev

View logs

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

clients/rook/src/config/mod.rs (1)
252-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce validation in from_toml_str() to prevent boundary-check bypass.

UpstreamResilienceConfig::validate() is wired into validate_non_auth(), but from_toml_str() returns without calling validate(). Since load_from_file() uses from_toml_str(), direct file-load callers can still construct invalid configs (e.g., zero cooldown/backoff/concurrency) without error.
Suggested fix
 pub fn from_toml_str(input: &str) -> Result<Self, RookError> {
     let partial: PartialRookConfig = toml::from_str(input)
         .map_err(|error| RookError::Config(format!("invalid rook config TOML: {error}")))?;

     let mut config = Self::default();
     partial.apply_to(&mut config);
-    Ok(config)
+    config.validate()?;
+    Ok(config)
 }
As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."

Also applies to: 294-311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/config/mod.rs` around lines 252 - 259, from_toml_str
currently parses TOML into PartialRookConfig, applies it to a default
RookConfig, and returns without running validation, allowing invalid
UpstreamResilienceConfig values to slip through; update from_toml_str to call
the appropriate validate() method on the built RookConfig (or
RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as
appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if
validation fails so callers like load_from_file cannot construct configs with
out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other
similar loader (the block referenced around lines 294-311).

♻️ Duplicate comments (1)

clients/rook/src/gateway/handlers.rs (1)
74-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use opaque account labels for retry-outcome metrics.

The new rook_upstream_retry_outcomes_total emission uses context.account, which is currently derived from decision.account.display_name (via UpstreamMetricContext::from_decision). This exposes tenant/customer identifiers in /api/metrics.
Suggested fix
 fn from_decision(decision: &crate::routing::RoutingDecision) -> Self {
     Self {
         vendor: Cow::Borrowed(normalize_vendor_label(&decision.account.vendor)),
-        account: normalize_account_label(Some(decision.account.display_name.as_str())),
+        account: normalize_account_label(Some(&format!("acct_{}", decision.account.id))),
         model: normalize_model_label(Some(decision.logical_model.as_str())),
     }
 }
As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/gateway/handlers.rs` around lines 74 - 85, The metric
currently emits raw tenant identifiers via context.account in
record_upstream_retry_outcome; change this to an opaque account label (e.g., a
hashed or internal id) by producing that opaque value in
UpstreamMetricContext::from_decision (or add a new field like account_opaque)
and use that field in record_upstream_retry_outcome instead of context.account;
ensure the opaque value is deterministic but non-identifying
(hash/salted-id/bucket) and is computed from decision.account so the metric no
longer exposes display_name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/server/mod.rs`:
- Around line 175-178: The code wires config.upstream_resilience directly into
runtime state without validating its fields; add explicit validation of
config.upstream_resilience (e.g. check max_concurrent_upstream_requests is >= 1
and within allowed upper bound and any other numeric/semantic constraints)
before cloning into resilience_policy and before calling
UpstreamConcurrency::new; if validation fails, return an initialization error
(or log and exit) so invalid values are rejected instead of being silently
coerced downstream. Ensure you reference and validate upstream_resilience,
resilience_policy, and the field max_concurrent_upstream_requests prior to
calling crate::gateway::UpstreamConcurrency::new.

---

Outside diff comments:
In `@clients/rook/src/config/mod.rs`:
- Around line 252-259: from_toml_str currently parses TOML into
PartialRookConfig, applies it to a default RookConfig, and returns without
running validation, allowing invalid UpstreamResilienceConfig values to slip
through; update from_toml_str to call the appropriate validate() method on the
built RookConfig (or
RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as
appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if
validation fails so callers like load_from_file cannot construct configs with
out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other
similar loader (the block referenced around lines 294-311).

---

Duplicate comments:
In `@clients/rook/src/gateway/handlers.rs`:
- Around line 74-85: The metric currently emits raw tenant identifiers via
context.account in record_upstream_retry_outcome; change this to an opaque
account label (e.g., a hashed or internal id) by producing that opaque value in
UpstreamMetricContext::from_decision (or add a new field like account_opaque)
and use that field in record_upstream_retry_outcome instead of context.account;
ensure the opaque value is deterministic but non-identifying
(hash/salted-id/bucket) and is computed from decision.account so the metric no
longer exposes display_name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d87198e-9c96-4946-988b-27699341141e

📥 Commits

Reviewing files that changed from the base of the PR and between 4ea27bd and b152a04.

📒 Files selected for processing (6)

clients/rook/README.md
clients/rook/src/admin/handlers.rs
clients/rook/src/config/mod.rs
clients/rook/src/doctor.rs
clients/rook/src/gateway/handlers.rs
clients/rook/src/server/mod.rs

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pr-checks
GitHub Check: sonar
GitHub Check: Cloudflare Pages
GitHub Check: submit-gradle
GitHub Check: semgrep-cloud-platform/scan

🧰 Additional context used

📓 Path-based instructions (3)

**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

clients/rook/src/admin/handlers.rs
clients/rook/src/doctor.rs
clients/rook/src/server/mod.rs
clients/rook/src/config/mod.rs
clients/rook/src/gateway/handlers.rs

**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

clients/rook/src/admin/handlers.rs
clients/rook/src/doctor.rs
clients/rook/src/server/mod.rs
clients/rook/README.md
clients/rook/src/config/mod.rs
clients/rook/src/gateway/handlers.rs

**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

clients/rook/README.md

🔇 Additional comments (4)

clients/rook/src/config/mod.rs (1)

169-193: Good hardening on upstream resilience bounds.

The new non-zero checks for attempts, cooldown, backoff, and concurrency close a real misconfiguration risk and align with production-safety intent.

clients/rook/src/admin/handlers.rs (1)

380-410: Good hardening on provider health metric labels.

Using provider_health_account_label(account.id) keeps /api/metrics labels opaque and avoids exposing human-readable account names in this new metric surface.

clients/rook/src/doctor.rs (1)

232-245: External bind posture detection is now much safer.

This correctly flags non-loopback IP/host binds as externally reachable while preserving loopback/localhost as local-only.

clients/rook/README.md (1)

331-332: Nice correction on rook_provider_account_health semantics.

The updated wording avoids implying one-hot status per vendor/account in a scrape and better matches the emitted metric behavior.

coderabbitai · 2026-05-04T14:05:39Z

+    let resilience_policy = config.upstream_resilience.clone();
    let upstream_concurrency = crate::gateway::UpstreamConcurrency::new(
        resilience_policy.max_concurrent_upstream_requests,
    );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate upstream_resilience before wiring it into runtime state.

config.upstream_resilience is now authoritative for gateway behavior, but startup validation does not check it. Invalid values can be silently coerced downstream (for example via .max(1)), which breaks config intent and fail-closed guarantees.

As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@clients/rook/src/server/mod.rs` around lines 175 - 178, The code wires config.upstream_resilience directly into runtime state without validating its fields; add explicit validation of config.upstream_resilience (e.g. check max_concurrent_upstream_requests is >= 1 and within allowed upper bound and any other numeric/semantic constraints) before cloning into resilience_policy and before calling UpstreamConcurrency::new; if validation fails, return an initialization error (or log and exit) so invalid values are rejected instead of being silently coerced downstream. Ensure you reference and validate upstream_resilience, resilience_policy, and the field max_concurrent_upstream_requests prior to calling crate::gateway::UpstreamConcurrency::new.

sonarqubecloud · 2026-05-04T14:18:47Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.9% Duplication on New Code

See analysis details on SonarQube Cloud

feat(rook): add production readiness operations

4ea27bd

github-actions Bot added the size/xl Denotes an extra large change size label May 4, 2026

coderabbitai Bot added area:rust area:docs risk:high labels May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread clients/rook/README.md Outdated

Comment thread clients/rook/src/admin/handlers.rs Outdated

Comment thread clients/rook/src/config/mod.rs

Comment thread clients/rook/src/doctor.rs

Comment thread clients/rook/src/gateway/handlers.rs

Comment thread clients/rook/src/server/mod.rs

yacosta738 added 2 commits May 4, 2026 15:12

Merge branch 'main' into feat/rook-production-readiness-677

b6d6f0c

# Conflicts: # clients/rook/src/config/mod.rs

fix(rook): address production readiness review

b152a04

coderabbitai Bot removed area:rust area:docs risk:high labels May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

yacosta738 merged commit 2ae73ba into main May 4, 2026
17 checks passed

yacosta738 deleted the feat/rook-production-readiness-677 branch May 4, 2026 15:03

yacosta738 linked an issue May 4, 2026 that may be closed by this pull request

Rook production readiness umbrella #677

Closed

9 tasks

coderabbitai Bot mentioned this pull request May 6, 2026

feat(rook): operational parity – undercover mode, debug diagnostics, multi-provider routing controls #792

Merged

8 tasks

Conversation

yacosta738 commented May 4, 2026

Related Issues

Summary

Tested Information

Documentation Impact

Breaking Changes

Checklist

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying corvus with Cloudflare Pages

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 4, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 4, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented May 4, 2026 •

edited

Loading