feat(rook): add production readiness operations#768
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a configurable upstream resilience feature (buffered retries, cooldown, backoff, concurrency) plus observability, operator status endpoint, CLI usage/reporting and doctor JSON output; wires config → server → gateway retry behavior, emits Prometheus metrics for retry outcomes and per-account health/cooldown, and documents operational runbook changes. ChangesUpstream resilience, observability, admin & CLI
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Gateway as Gateway (Buffered Proxy)
participant Provider as Upstream Provider
participant Registry as Registry/HealthService
participant Observ as Observability
Client->>Gateway: POST /v1/chat/completions (request)
Gateway->>Provider: Forward attempt (account A)
alt Provider responds error (retryable)
Provider-->>Gateway: Error
Gateway->>Observ: inc(upstream_retry_outcomes_total{outcome="retry_scheduled", vendor..., account..., model...})
Gateway->>Registry: record tentative account health/cooldown?
Gateway->>Gateway: schedule retry (backoff)
Gateway->>Provider: Forward retry attempt (account B or A)
alt Exhausted attempts
Gateway->>Observ: inc(...{outcome="retry_exhausted"})
Gateway-->>Client: Fail response
else Non-retryable error
Gateway->>Observ: inc(...{outcome="not_retryable"})
Gateway-->>Client: Fail response
end
else Provider responds success
Provider-->>Gateway: Stream/response
Gateway-->>Client: Stream/response (no further retries)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/rook/README.md`:
- Around line 331-332: Update the README description for
rook_provider_account_health to remove the “one-hot” claim and instead state
that the status label (values: healthy, degraded, unhealthy, unknown) indicates
persisted provider health but is not guaranteed to be unique per vendor/account
in a single scrape (multiple status values may appear, as asserted in server
logic such as the checks around rook_provider_account_health in the server
module). Adjust the wording to avoid implying exclusivity and make it clear that
consumers should not assume only one status label value per vendor/account
during a scrape.
In `@clients/rook/src/admin/handlers.rs`:
- Around line 389-395: The metric currently uses account.display_name via
normalize_account_label to populate the account label (see account.display_name,
normalize_account_label, account_label, and the rook_provider_account_health
string in body.push_str), which can leak tenant identifiers; change this to use
an opaque identifier instead (prefer account.id if available) or replace the
display name with a stable irreversible hash of the ID (or ID+salt) before
passing to normalize_account_label/provider_health_status_label so the metric
exposes no raw customer/displayable identifiers. Ensure you update the label
value construction and any tests to expect the opaque value rather than
display_name.
In `@clients/rook/src/config/mod.rs`:
- Around line 170-183: The validate method currently checks
max_buffered_attempts and max_concurrent_upstream_requests but misses validating
failure_cooldown_seconds and retry_backoff_milliseconds; update the
validate(&self) -> Result<(), RookError> function to also return
Err(RookError::Config(...)) when self.failure_cooldown_seconds == 0 or
self.retry_backoff_milliseconds == 0, using the same error pattern and
descriptive messages (e.g., "upstream_resilience.failure_cooldown_seconds must
be at least 1" and "upstream_resilience.retry_backoff_milliseconds must be at
least 1") so that zero values are rejected.
In `@clients/rook/src/doctor.rs`:
- Around line 232-235: The current is_externally_reachable_bind only treats
wildcards as external and misses concrete non-loopback IPs/hostnames; update
is_externally_reachable_bind to parse the host (trim brackets) and treat it as
external unless it is an explicit loopback (e.g., "127.0.0.1", "::1", or
IpAddr::is_loopback()), or an unspecified address should be considered external
(or explicitly treat "0.0.0.0"/"::" and "*" as external). Concretely: in
is_externally_reachable_bind, try parsing host into std::net::IpAddr and return
true if parsing succeeds and !ip.is_loopback(); if parsing fails (hostname
case), treat anything other than "localhost" as external; keep existing wildcard
checks for "*", "0.0.0.0", "::". This ensures non-loopback IPs/hostnames like
"203.0.113.10" trigger the advisory.
In `@clients/rook/src/gateway/handlers.rs`:
- Around line 336-346: When the route resolution path (resolve()) fails and the
loop exits, terminal retry outcomes aren't recorded because last_error is
drained and returns early; update the error-handling after the resolve() failure
path to check last_error and call record_upstream_retry_outcome(state,
&metric_context, ...) using the same logic as the retry branch (choose
"retry_exhausted" if retryable else "not_retryable") before returning the
upstream error. Refer to last_error, resolve(), metric_context,
record_upstream_retry_outcome, and state to locate where to insert this check
and ensure the terminal outcome is emitted for the resolve-failure path.
In `@clients/rook/src/server/mod.rs`:
- Around line 667-700: The test currently triggers a missing-model 503 and never
exercises the upstream resilience policy; change it to register a model/upstream
that resolves to a routed upstream which fails so the resilience logic runs:
using RookRegistry::open_in_memory(), add a fake model/route that points to an
upstream handler which returns 500 (or sleeps/fails) and then call
build_app_with_registry(config, registry) with ServerConfig.upstream_resilience
set as in the diff; send requests to /v1/chat/completions using that registered
model so retries/concurrency/cooldown are exercised, and assert a
policy-dependent observable (e.g., number of attempts, that a second concurrent
request is rejected due to max_concurrent_upstream_requests, or that a
subsequent request within failure_cooldown is short-circuited) rather than
relying on missing-model 503.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 97fe643e-cd74-48b7-8c84-de016d352137
📒 Files selected for processing (11)
clients/rook/README.mdclients/rook/src/admin/handlers.rsclients/rook/src/admin/mod.rsclients/rook/src/admin/types.rsclients/rook/src/config/mod.rsclients/rook/src/doctor.rsclients/rook/src/gateway/handlers.rsclients/rook/src/gateway/mod.rsclients/rook/src/main.rsclients/rook/src/observability.rsclients/rook/src/server/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: sonar
- GitHub Check: pr-checks
- GitHub Check: semgrep-cloud-platform/scan
- GitHub Check: submit-gradle
- GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.
Files:
clients/rook/src/gateway/mod.rsclients/rook/src/admin/types.rsclients/rook/src/admin/handlers.rsclients/rook/src/server/mod.rsclients/rook/src/observability.rsclients/rook/src/config/mod.rsclients/rook/src/main.rsclients/rook/src/gateway/handlers.rsclients/rook/src/doctor.rsclients/rook/src/admin/mod.rs
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/rook/src/gateway/mod.rsclients/rook/src/admin/types.rsclients/rook/src/admin/handlers.rsclients/rook/README.mdclients/rook/src/server/mod.rsclients/rook/src/observability.rsclients/rook/src/config/mod.rsclients/rook/src/main.rsclients/rook/src/gateway/handlers.rsclients/rook/src/doctor.rsclients/rook/src/admin/mod.rs
**/*.{md,mdx}
⚙️ CodeRabbit configuration file
**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.
Files:
clients/rook/README.md
🔇 Additional comments (5)
clients/rook/src/gateway/mod.rs (1)
32-32:Debugderive onUpstreamResiliencePolicylooks good.This improves operational introspection without changing behavior.
clients/rook/src/admin/types.rs (1)
148-160: Operator status DTO additions are clean and consistent.The new response model matches handler/test expectations and keeps sensitive fields out of the payload.
clients/rook/src/observability.rs (2)
71-76: New upstream retry outcome metric plumbing looks solid.Registration, handle exposure, and label-set modeling are consistent with the existing observability pattern.
Also applies to: 129-133, 429-446, 472-494
512-574: Test coverage forrook_upstream_retry_outcomes_totalis strong.The added assertions verify both registration and concrete sample emission.
clients/rook/src/admin/mod.rs (1)
32-32:/statusroute integration and coverage look good.The tests validate the status contract across readiness states and include redaction checks.
Also applies to: 265-331
# Conflicts: # clients/rook/src/config/mod.rs
Deploying corvus with
|
| Latest commit: |
b152a04
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://c82e6e71.corvus-42x.pages.dev |
| Branch Preview URL: | https://feat-rook-production-readine.corvus-42x.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
clients/rook/src/config/mod.rs (1)
252-259:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce validation in
from_toml_str()to prevent boundary-check bypass.
UpstreamResilienceConfig::validate()is wired intovalidate_non_auth(), butfrom_toml_str()returns without callingvalidate(). Sinceload_from_file()usesfrom_toml_str(), direct file-load callers can still construct invalid configs (e.g., zero cooldown/backoff/concurrency) without error.Suggested fix
pub fn from_toml_str(input: &str) -> Result<Self, RookError> { let partial: PartialRookConfig = toml::from_str(input) .map_err(|error| RookError::Config(format!("invalid rook config TOML: {error}")))?; let mut config = Self::default(); partial.apply_to(&mut config); - Ok(config) + config.validate()?; + Ok(config) }As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."
Also applies to: 294-311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clients/rook/src/config/mod.rs` around lines 252 - 259, from_toml_str currently parses TOML into PartialRookConfig, applies it to a default RookConfig, and returns without running validation, allowing invalid UpstreamResilienceConfig values to slip through; update from_toml_str to call the appropriate validate() method on the built RookConfig (or RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if validation fails so callers like load_from_file cannot construct configs with out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other similar loader (the block referenced around lines 294-311).
♻️ Duplicate comments (1)
clients/rook/src/gateway/handlers.rs (1)
74-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse opaque account labels for retry-outcome metrics.
The new
rook_upstream_retry_outcomes_totalemission usescontext.account, which is currently derived fromdecision.account.display_name(viaUpstreamMetricContext::from_decision). This exposes tenant/customer identifiers in/api/metrics.Suggested fix
fn from_decision(decision: &crate::routing::RoutingDecision) -> Self { Self { vendor: Cow::Borrowed(normalize_vendor_label(&decision.account.vendor)), - account: normalize_account_label(Some(decision.account.display_name.as_str())), + account: normalize_account_label(Some(&format!("acct_{}", decision.account.id))), model: normalize_model_label(Some(decision.logical_model.as_str())), } }As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clients/rook/src/gateway/handlers.rs` around lines 74 - 85, The metric currently emits raw tenant identifiers via context.account in record_upstream_retry_outcome; change this to an opaque account label (e.g., a hashed or internal id) by producing that opaque value in UpstreamMetricContext::from_decision (or add a new field like account_opaque) and use that field in record_upstream_retry_outcome instead of context.account; ensure the opaque value is deterministic but non-identifying (hash/salted-id/bucket) and is computed from decision.account so the metric no longer exposes display_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/rook/src/server/mod.rs`:
- Around line 175-178: The code wires config.upstream_resilience directly into
runtime state without validating its fields; add explicit validation of
config.upstream_resilience (e.g. check max_concurrent_upstream_requests is >= 1
and within allowed upper bound and any other numeric/semantic constraints)
before cloning into resilience_policy and before calling
UpstreamConcurrency::new; if validation fails, return an initialization error
(or log and exit) so invalid values are rejected instead of being silently
coerced downstream. Ensure you reference and validate upstream_resilience,
resilience_policy, and the field max_concurrent_upstream_requests prior to
calling crate::gateway::UpstreamConcurrency::new.
---
Outside diff comments:
In `@clients/rook/src/config/mod.rs`:
- Around line 252-259: from_toml_str currently parses TOML into
PartialRookConfig, applies it to a default RookConfig, and returns without
running validation, allowing invalid UpstreamResilienceConfig values to slip
through; update from_toml_str to call the appropriate validate() method on the
built RookConfig (or
RookConfig::validate_non_auth()/UpstreamResilienceConfig::validate as
appropriate) after partial.apply_to(&mut config) and return an Err(RookError) if
validation fails so callers like load_from_file cannot construct configs with
out-of-bounds cooldown/backoff/concurrency; apply the same fix to the other
similar loader (the block referenced around lines 294-311).
---
Duplicate comments:
In `@clients/rook/src/gateway/handlers.rs`:
- Around line 74-85: The metric currently emits raw tenant identifiers via
context.account in record_upstream_retry_outcome; change this to an opaque
account label (e.g., a hashed or internal id) by producing that opaque value in
UpstreamMetricContext::from_decision (or add a new field like account_opaque)
and use that field in record_upstream_retry_outcome instead of context.account;
ensure the opaque value is deterministic but non-identifying
(hash/salted-id/bucket) and is computed from decision.account so the metric no
longer exposes display_name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6d87198e-9c96-4946-988b-27699341141e
📒 Files selected for processing (6)
clients/rook/README.mdclients/rook/src/admin/handlers.rsclients/rook/src/config/mod.rsclients/rook/src/doctor.rsclients/rook/src/gateway/handlers.rsclients/rook/src/server/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: pr-checks
- GitHub Check: sonar
- GitHub Check: Cloudflare Pages
- GitHub Check: submit-gradle
- GitHub Check: semgrep-cloud-platform/scan
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.
Files:
clients/rook/src/admin/handlers.rsclients/rook/src/doctor.rsclients/rook/src/server/mod.rsclients/rook/src/config/mod.rsclients/rook/src/gateway/handlers.rs
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/rook/src/admin/handlers.rsclients/rook/src/doctor.rsclients/rook/src/server/mod.rsclients/rook/README.mdclients/rook/src/config/mod.rsclients/rook/src/gateway/handlers.rs
**/*.{md,mdx}
⚙️ CodeRabbit configuration file
**/*.{md,mdx}: Verify technical accuracy and that docs stay aligned with code changes.
For user-facing docs, check EN/ES parity or explicitly note pending translation gaps.
Files:
clients/rook/README.md
🔇 Additional comments (4)
clients/rook/src/config/mod.rs (1)
169-193: Good hardening on upstream resilience bounds.The new non-zero checks for attempts, cooldown, backoff, and concurrency close a real misconfiguration risk and align with production-safety intent.
clients/rook/src/admin/handlers.rs (1)
380-410: Good hardening on provider health metric labels.Using
provider_health_account_label(account.id)keeps/api/metricslabels opaque and avoids exposing human-readable account names in this new metric surface.clients/rook/src/doctor.rs (1)
232-245: External bind posture detection is now much safer.This correctly flags non-loopback IP/host binds as externally reachable while preserving loopback/localhost as local-only.
clients/rook/README.md (1)
331-332: Nice correction onrook_provider_account_healthsemantics.The updated wording avoids implying one-hot status per
vendor/accountin a scrape and better matches the emitted metric behavior.
| let resilience_policy = config.upstream_resilience.clone(); | ||
| let upstream_concurrency = crate::gateway::UpstreamConcurrency::new( | ||
| resilience_policy.max_concurrent_upstream_requests, | ||
| ); |
There was a problem hiding this comment.
Validate upstream_resilience before wiring it into runtime state.
config.upstream_resilience is now authoritative for gateway behavior, but startup validation does not check it. Invalid values can be silently coerced downstream (for example via .max(1)), which breaks config intent and fail-closed guarantees.
As per coding guidelines, "Security first, performance second. Validate input boundaries, auth/authz implications, and secret management."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@clients/rook/src/server/mod.rs` around lines 175 - 178, The code wires
config.upstream_resilience directly into runtime state without validating its
fields; add explicit validation of config.upstream_resilience (e.g. check
max_concurrent_upstream_requests is >= 1 and within allowed upper bound and any
other numeric/semantic constraints) before cloning into resilience_policy and
before calling UpstreamConcurrency::new; if validation fails, return an
initialization error (or log and exit) so invalid values are rejected instead of
being silently coerced downstream. Ensure you reference and validate
upstream_resilience, resilience_policy, and the field
max_concurrent_upstream_requests prior to calling
crate::gateway::UpstreamConcurrency::new.
|



Related Issues
Related to #677
Summary
rook doctor --json,/api/status, provider health metrics, retry outcome metrics, and a local-firstrook usage reportCLI.Tested Information
cargo fmt --manifest-path clients/rook/Cargo.tomlcargo test --manifest-path clients/rook/Cargo.tomlcargo check --manifest-path clients/rook/Cargo.tomlDocumentation Impact
clients/rook/README.mdBreaking Changes
None.
Checklist