Skip to content

feat(rook): add production readiness operations#768

Merged
yacosta738 merged 3 commits into
mainfrom
feat/rook-production-readiness-677
May 4, 2026
Merged

feat(rook): add production readiness operations#768
yacosta738 merged 3 commits into
mainfrom
feat/rook-production-readiness-677

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Related Issues

Related to #677


Summary

  • Adds production-readiness operator surfaces for Rook: machine-readable rook doctor --json, /api/status, provider health metrics, retry outcome metrics, and a local-first rook usage report CLI.
  • Wires upstream resilience policy through config/env/CLI into server startup, with doctor visibility and production posture advisories.
  • Expands Rook operations documentation with deployment triage guidance, retry exhaustion alerting, provider health metrics, and usage report runbook notes.

Tested Information

  • cargo fmt --manifest-path clients/rook/Cargo.toml
  • cargo test --manifest-path clients/rook/Cargo.toml
    • 391 lib tests passed
    • 23 main tests passed
    • 2 distribution contract tests passed
    • 4 doctor operational diagnostics tests passed
    • 1 doc-test passed
  • cargo check --manifest-path clients/rook/Cargo.toml
  • Pre-commit and pre-push doc link checks passed.

Documentation Impact

  • Docs updated in:
    • clients/rook/README.md
  • I verified the documentation matches the current behavior.

Breaking Changes

None.


Checklist

  • I have checked that there isn’t already a PR solving the same problem.
  • I have read the Contributing Guidelines.
  • I ensured my code follows the project's style guidelines.
  • I have added or updated tests that prove my fix is effective or that my feature works.
  • I have updated the documentation, or I explained above why no documentation update is needed.
  • I verified the documentation matches the current behavior.
  • I have documented any breaking changes in the Breaking Changes section.
  • I have linked the related issue (if any).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added /api/status operator status endpoint and expanded /api/metrics with provider health and cooldown gauges
    • Introduced upstream resilience settings (buffered retries, cooldown, backoff, concurrency) and CLI flags
    • Added structured diagnostics (--json) and a usage report CLI (JSON output option)
    • New retry outcome metrics for upstream retry decisions
  • Documentation

    • Expanded operational runbook, observability guidance, probes, and troubleshooting (retry exhaustion, provider cooldowns, empty usage reports)

Walkthrough

Adds a configurable upstream resilience feature (buffered retries, cooldown, backoff, concurrency) plus observability, operator status endpoint, CLI usage/reporting and doctor JSON output; wires config → server → gateway retry behavior, emits Prometheus metrics for retry outcomes and per-account health/cooldown, and documents operational runbook changes.

Changes

Upstream resilience, observability, admin & CLI

Layer / File(s) Summary
Config Shape & Validation
clients/rook/src/config/mod.rs
Adds UpstreamResilienceConfig, PartialUpstreamResilienceConfig, validation, env overlay parsing (ROOK_UPSTREAM_RESILIENCE_*), conversion to gateway::UpstreamResiliencePolicy, and export view.
Server wiring
clients/rook/src/server/mod.rs
ServerConfig gains upstream_resilience; app builder uses configured policy when creating UpstreamConcurrency/GatewayState.
Observability metric registration
clients/rook/src/observability.rs
Registers rook_upstream_retry_outcomes counter family, adds UpstreamRetryLabels/UpstreamRetryOutcomesHandle and accessor to record labeled retry outcomes.
Gateway retry instrumentation
clients/rook/src/gateway/handlers.rs
Buffered retry loop caches last error/context, records rook_upstream_retry_outcomes_total with outcomes (retry_scheduled, retry_exhausted, not_retryable) at decision points.
Admin API types & endpoint
clients/rook/src/admin/types.rs, clients/rook/src/admin/handlers.rs, clients/rook/src/admin/mod.rs
Adds OperatorStatusView/OperatorRuntimeView, handle_operator_status wired to GET /status; handle_get_metrics now appends per-account rook_provider_account_health and rook_provider_account_cooldown_active gauge lines. Tests added for /status.
CLI & Doctor
clients/rook/src/main.rs, clients/rook/src/doctor.rs
serve gains upstream resilience CLI flags; ServeOverrides overlays support resilience; new usage report command (--period/--limit/--format json); doctor --json and render_json_report; advisory checks for production posture and aggressive resilience settings.
Documentation & Runbook
clients/rook/README.md
Documents [upstream_resilience] example/defaults, env overrides, tuning notes (no streaming retries after bytes committed), new /api/status and /api/metrics endpoints, usage-report CLI, and updated runbook/troubleshooting steps.
Tests / Examples
clients/rook/src/{admin,config,doctor,gateway,main,observability,server}/*.rs, clients/rook/README.md
Adds/updates unit and integration tests covering config env overlays, validation, metrics presence and samples, retry-outcome metric cases, /status behaviors, usage report DB-backed rendering, and JSON doctor output.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Gateway as Gateway (Buffered Proxy)
  participant Provider as Upstream Provider
  participant Registry as Registry/HealthService
  participant Observ as Observability

  Client->>Gateway: POST /v1/chat/completions (request)
  Gateway->>Provider: Forward attempt (account A)
  alt Provider responds error (retryable)
    Provider-->>Gateway: Error
    Gateway->>Observ: inc(upstream_retry_outcomes_total{outcome="retry_scheduled", vendor..., account..., model...})
    Gateway->>Registry: record tentative account health/cooldown?
    Gateway->>Gateway: schedule retry (backoff)
    Gateway->>Provider: Forward retry attempt (account B or A)
    alt Exhausted attempts
      Gateway->>Observ: inc(...{outcome="retry_exhausted"})
      Gateway-->>Client: Fail response
    else Non-retryable error
      Gateway->>Observ: inc(...{outcome="not_retryable"})
      Gateway-->>Client: Fail response
    end
  else Provider responds success
    Provider-->>Gateway: Stream/response
    Gateway-->>Client: Stream/response (no further retries)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title follows Conventional Commit style with 'feat' prefix, includes a clear scope 'rook', and is concise (47 characters), accurately reflecting the PR's focus on production readiness operations.
Description check ✅ Passed The PR description covers all required template sections: Related Issues (#677), Summary, Tested Information (with detailed test results), Documentation Impact (README.md updated), Breaking Changes (none), and a complete checklist with all items checked.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rook-production-readiness-677

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/xl Denotes an extra large change size label May 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/README.md`:
- Around line 331-332: Update the README description for
rook_provider_account_health to remove the “one-hot” claim and instead state
that the status label (values: healthy, degraded, unhealthy, unknown) indicates
persisted provider health but is not guaranteed to be unique per vendor/account
in a single scrape (multiple status values may appear, as asserted in server
logic such as the checks around rook_provider_account_health in the server
module). Adjust the wording to avoid implying exclusivity and make it clear that
consumers should not assume only one status label value per vendor/account
during a scrape.

In `@clients/rook/src/admin/handlers.rs`:
- Around line 389-395: The metric currently uses account.display_name via
normalize_account_label to populate the account label (see account.display_name,
normalize_account_label, account_label, and the rook_provider_account_health
string in body.push_str), which can leak tenant identifiers; change this to use
an opaque identifier instead (prefer account.id if available) or replace the
display name with a stable irreversible hash of the ID (or ID+salt) before
passing to normalize_account_label/provider_health_status_label so the metric
exposes no raw customer/displayable identifiers. Ensure you update the label
value construction and any tests to expect the opaque value rather than
display_name.

In `@clients/rook/src/config/mod.rs`:
- Around line 170-183: The validate method currently checks
max_buffered_attempts and max_concurrent_upstream_requests but misses validating
failure_cooldown_seconds and retry_backoff_milliseconds; update the
validate(&self) -> Result<(), RookError> function to also return
Err(RookError::Config(...)) when self.failure_cooldown_seconds == 0 or
self.retry_backoff_milliseconds == 0, using the same error pattern and
descriptive messages (e.g., "upstream_resilience.failure_cooldown_seconds must
be at least 1" and "upstream_resilience.retry_backoff_milliseconds must be at
least 1") so that zero values are rejected.

In `@clients/rook/src/doctor.rs`:
- Around line 232-235: The current is_externally_reachable_bind only treats
wildcards as external and misses concrete non-loopback IPs/hostnames; update
is_externally_reachable_bind to parse the host (trim brackets) and treat it as
external unless it is an explicit loopback (e.g., "127.0.0.1", "::1", or
IpAddr::is_loopback()), or an unspecified address should be considered external
(or explicitly treat "0.0.0.0"/"::" and "*" as external). Concretely: in
is_externally_reachable_bind, try parsing host into std::net::IpAddr and return
true if parsing succeeds and !ip.is_loopback(); if parsing fails (hostname
case), treat anything other than "localhost" as external; keep existing wildcard
checks for "*", "0.0.0.0", "::". This ensures non-loopback IPs/hostnames like
"203.0.113.10" trigger the advisory.

In `@clients/rook/src/gateway/handlers.rs`:
- Around line 336-346: When the route resolution path (resolve()) fails and the
loop exits, terminal retry outcomes aren't recorded because last_error is
drained and returns early; update the error-handling after the resolve() failure
path to check last_error and call record_upstream_retry_outcome(state,
&metric_context, ...) using the same logic as the retry branch (choose
"retry_exhausted" if retryable else "not_retryable") before returning the
upstream error. Refer to last_error, resolve(), metric_context,
record_upstream_retry_outcome, and state to locate where to insert this check
and ensure the terminal outcome is emitted for the resolve-failure path.

In `@clients/rook/src/server/mod.rs`:
- Around line 667-700: The test currently triggers a missing-model 503 and never
exercises the upstream resilience policy; change it to register a model/upstream
that resolves to a routed upstream which fails so the resilience logic runs:
using RookRegistry::open_in_memory(), add a fake model/route that points to an
upstream handler which returns 500 (or sleeps/fails) and then call
build_app_with_registry(config, registry) with ServerConfig.upstream_resilience
set as in the diff; send requests to /v1/chat/completions using that registered
model so retries/concurrency/cooldown are exercised, and assert a
policy-dependent observable (e.g., number of attempts, that a second concurrent
request is rejected due to max_concurrent_upstream_requests, or that a
subsequent request within failure_cooldown is short-circuited) rather than
relying on missing-model 503.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 97fe643e-cd74-48b7-8c84-de016d352137

📥 Commits

Reviewing files that changed from the base of the PR and between f52d91b and 4ea27bd.

📒 Files selected for processing (11)
  • clients/rook/README.md
  • clients/rook/src/admin/handlers.rs
  • clients/rook/src/admin/mod.rs
  • clients/rook/src/admin/types.rs
  • clients/rook/src/config/mod.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/gateway/mod.rs
  • clients/rook/src/main.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/server/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: sonar
  • GitHub Check: pr-checks
  • GitHub Check: semgrep-cloud-platform/scan
  • GitHub Check: submit-gradle
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/gateway/mod.rs
  • clients/rook/src/admin/types.rs
  • clients/rook/src/admin/handlers.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/config/mod.rs
  • clients/rook/src/main.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/admin/mod.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/gateway/mod.rs
  • clients/rook/src/admin/types.rs
  • clients/rook/src/admin/handlers.rs
  • clients/rook/README.md
  • clients/rook/src/server/mod.rs
  • clients/rook/src/observability.rs
  • clients/rook/src/config/mod.rs
  • clients/rook/src/main.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/admin/mod.rs
**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

  • clients/rook/README.md
🔇 Additional comments (5)
clients/rook/src/gateway/mod.rs (1)

32-32: Debug derive on UpstreamResiliencePolicy looks good.

This improves operational introspection without changing behavior.

clients/rook/src/admin/types.rs (1)

148-160: Operator status DTO additions are clean and consistent.

The new response model matches handler/test expectations and keeps sensitive fields out of the payload.

clients/rook/src/observability.rs (2)

71-76: New upstream retry outcome metric plumbing looks solid.

Registration, handle exposure, and label-set modeling are consistent with the existing observability pattern.

Also applies to: 129-133, 429-446, 472-494


512-574: Test coverage for rook_upstream_retry_outcomes_total is strong.

The added assertions verify both registration and concrete sample emission.

clients/rook/src/admin/mod.rs (1)

32-32: /status route integration and coverage look good.

The tests validate the status contract across readiness states and include redaction checks.

Also applies to: 265-331

Comment thread clients/rook/README.md Outdated
Comment thread clients/rook/src/admin/handlers.rs Outdated
Comment thread clients/rook/src/config/mod.rs
Comment thread clients/rook/src/doctor.rs
Comment thread clients/rook/src/gateway/handlers.rs
Comment thread clients/rook/src/server/mod.rs
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 4, 2026

Deploying corvus with  Cloudflare Pages  Cloudflare Pages

Latest commit: b152a04
Status: ✅  Deploy successful!
Preview URL: https://c82e6e71.corvus-42x.pages.dev
Branch Preview URL: https://feat-rook-production-readine.corvus-42x.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
clients/rook/src/config/mod.rs (1)

252-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce validation in from_toml_str() to prevent boundary-check bypass.

UpstreamResilienceConfig::validate() is wired into validate_non_auth(), but from_toml_str() returns without calling validate(). Since load_from_file() uses from_toml_str(), direct file-load callers can still construct invalid configs (e.g., zero cooldown/backoff/concurrency) without error.

Suggested fix
 pub fn from_toml_str(input: &str) -> Result<Self, RookError> {
     let partial: PartialRookConfig = toml::from_str(input)
         .map_err(|error| RookError::Config(format!("invalid rook config TOML: {error}")))?;

     let mut config = Self::default();
     partial.apply_to(&mut config);
-    Ok(config)
+    config.validate()?;
+    Ok(config)
 }

As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."

Also applies to: 294-311

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/config/mod.rs` around lines 252 - 259, from_toml_str
currently parses TOML into PartialRookConfig, applies it to a default
RookConfig, and returns without running validation, allowing invalid
UpstreamResilienceConfig values to slip through; update from_toml_str to call
the appropriate validate() method on the built RookConfig (or
RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as
appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if
validation fails so callers like load_from_file cannot construct configs with
out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other
similar loader (the block referenced around lines 294-311).
♻️ Duplicate comments (1)
clients/rook/src/gateway/handlers.rs (1)

74-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use opaque account labels for retry-outcome metrics.

The new rook_upstream_retry_outcomes_total emission uses context.account, which is currently derived from decision.account.display_name (via UpstreamMetricContext::from_decision). This exposes tenant/customer identifiers in /api/metrics.

Suggested fix
 fn from_decision(decision: &crate::routing::RoutingDecision) -> Self {
     Self {
         vendor: Cow::Borrowed(normalize_vendor_label(&decision.account.vendor)),
-        account: normalize_account_label(Some(decision.account.display_name.as_str())),
+        account: normalize_account_label(Some(&format!("acct_{}", decision.account.id))),
         model: normalize_model_label(Some(decision.logical_model.as_str())),
     }
 }

As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/gateway/handlers.rs` around lines 74 - 85, The metric
currently emits raw tenant identifiers via context.account in
record_upstream_retry_outcome; change this to an opaque account label (e.g., a
hashed or internal id) by producing that opaque value in
UpstreamMetricContext::from_decision (or add a new field like account_opaque)
and use that field in record_upstream_retry_outcome instead of context.account;
ensure the opaque value is deterministic but non-identifying
(hash/salted-id/bucket) and is computed from decision.account so the metric no
longer exposes display_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/server/mod.rs`:
- Around line 175-178: The code wires config.upstream_resilience directly into
runtime state without validating its fields; add explicit validation of
config.upstream_resilience (e.g. check max_concurrent_upstream_requests is >= 1
and within allowed upper bound and any other numeric/semantic constraints)
before cloning into resilience_policy and before calling
UpstreamConcurrency::new; if validation fails, return an initialization error
(or log and exit) so invalid values are rejected instead of being silently
coerced downstream. Ensure you reference and validate upstream_resilience,
resilience_policy, and the field max_concurrent_upstream_requests prior to
calling crate::gateway::UpstreamConcurrency::new.

---

Outside diff comments:
In `@clients/rook/src/config/mod.rs`:
- Around line 252-259: from_toml_str currently parses TOML into
PartialRookConfig, applies it to a default RookConfig, and returns without
running validation, allowing invalid UpstreamResilienceConfig values to slip
through; update from_toml_str to call the appropriate validate() method on the
built RookConfig (or
RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as
appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if
validation fails so callers like load_from_file cannot construct configs with
out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other
similar loader (the block referenced around lines 294-311).

---

Duplicate comments:
In `@clients/rook/src/gateway/handlers.rs`:
- Around line 74-85: The metric currently emits raw tenant identifiers via
context.account in record_upstream_retry_outcome; change this to an opaque
account label (e.g., a hashed or internal id) by producing that opaque value in
UpstreamMetricContext::from_decision (or add a new field like account_opaque)
and use that field in record_upstream_retry_outcome instead of context.account;
ensure the opaque value is deterministic but non-identifying
(hash/salted-id/bucket) and is computed from decision.account so the metric no
longer exposes display_name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d87198e-9c96-4946-988b-27699341141e

📥 Commits

Reviewing files that changed from the base of the PR and between 4ea27bd and b152a04.

📒 Files selected for processing (6)
  • clients/rook/README.md
  • clients/rook/src/admin/handlers.rs
  • clients/rook/src/config/mod.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/gateway/handlers.rs
  • clients/rook/src/server/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: pr-checks
  • GitHub Check: sonar
  • GitHub Check: Cloudflare Pages
  • GitHub Check: submit-gradle
  • GitHub Check: semgrep-cloud-platform/scan
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/admin/handlers.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/src/config/mod.rs
  • clients/rook/src/gateway/handlers.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/admin/handlers.rs
  • clients/rook/src/doctor.rs
  • clients/rook/src/server/mod.rs
  • clients/rook/README.md
  • clients/rook/src/config/mod.rs
  • clients/rook/src/gateway/handlers.rs
**/*.{md,mdx}

⚙️ CodeRabbit configuration file

**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.

Files:

  • clients/rook/README.md
🔇 Additional comments (4)
clients/rook/src/config/mod.rs (1)

169-193: Good hardening on upstream resilience bounds.

The new non-zero checks for attempts, cooldown, backoff, and concurrency close a real misconfiguration risk and align with production-safety intent.

clients/rook/src/admin/handlers.rs (1)

380-410: Good hardening on provider health metric labels.

Using provider_health_account_label(account.id) keeps /api/metrics labels opaque and avoids exposing human-readable account names in this new metric surface.

clients/rook/src/doctor.rs (1)

232-245: External bind posture detection is now much safer.

This correctly flags non-loopback IP/host binds as externally reachable while preserving loopback/localhost as local-only.

clients/rook/README.md (1)

331-332: Nice correction on rook_provider_account_health semantics.

The updated wording avoids implying one-hot status per vendor/account in a scrape and better matches the emitted metric behavior.

Comment on lines +175 to 178
let resilience_policy = config.upstream_resilience.clone();
let upstream_concurrency = crate::gateway::UpstreamConcurrency::new(
resilience_policy.max_concurrent_upstream_requests,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate upstream_resilience before wiring it into runtime state.

config.upstream_resilience is now authoritative for gateway behavior, but startup validation does not check it. Invalid values can be silently coerced downstream (for example via .max(1)), which breaks config intent and fail-closed guarantees.

As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/server/mod.rs` around lines 175 - 178, The code wires
config.upstream_resilience directly into runtime state without validating its
fields; add explicit validation of config.upstream_resilience (e.g. check
max_concurrent_upstream_requests is >= 1 and within allowed upper bound and any
other numeric/semantic constraints) before cloning into resilience_policy and
before calling UpstreamConcurrency::new; if validation fails, return an
initialization error (or log and exit) so invalid values are rejected instead of
being silently coerced downstream. Ensure you reference and validate
upstream_resilience, resilience_policy, and the field
max_concurrent_upstream_requests prior to calling
crate::gateway::UpstreamConcurrency::new.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 4, 2026

@yacosta738 yacosta738 merged commit 2ae73ba into main May 4, 2026
17 checks passed
@yacosta738 yacosta738 deleted the feat/rook-production-readiness-677 branch May 4, 2026 15:03
@yacosta738 yacosta738 linked an issue May 4, 2026 that may be closed by this pull request
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl Denotes an extra large change size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rook production readiness umbrella

1 participant