Skip to content

feat: Persist Rook provider account health state#759

Merged
yacosta738 merged 3 commits into
mainfrom
feat/rook-health-persistence
May 2, 2026
Merged

feat: Persist Rook provider account health state#759
yacosta738 merged 3 commits into
mainfrom
feat/rook-health-persistence

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Summary

  • add SQLite persistence for provider account health/cooldown state
  • wire RookRegistry to use SqliteHealthService for runtime health checks
  • preserve missing-row Unknown semantics and recover availability after expired cooldowns

Tests

  • cargo test --manifest-path clients/rook/Cargo.toml

Closes #683

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 2, 2026

Deploying corvus with  Cloudflare Pages  Cloudflare Pages

Latest commit: f31a3b8
Status: ✅  Deploy successful!
Preview URL: https://1bc71370.corvus-42x.pages.dev
Branch Preview URL: https://feat-rook-health-persistence.corvus-42x.pages.dev

View logs

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Provider account health status and cooldown state now persist across application restarts for improved account management.
  • Security

    • Enhanced path validation with stricter safety checks to prevent malicious inputs and path traversal attempts.

Walkthrough

Two independent changes: (1) security policy hardens path validation to check raw inputs before URL-decoding/dequoting, shifting decoded-path checks to command argument normalization; (2) Rook gains persistent health state via SQLite storage, replacing in-memory tracking with a new database schema, persistence layer, and updated service wiring.

Changes

Security Policy Raw-Path Validation

Layer / File(s) Summary
Core Validation Logic
clients/agent-runtime/src/security/policy.rs (lines 595–620)
is_path_allowed now validates the raw path string directly, rejecting inputs containing \0, backslashes, %, or .. components before any URL-decoding or dequoting; only expand_tilde operates on the unmodified input.
Command Argument Normalization
clients/agent-runtime/src/security/policy.rs (lines 822–828)
normalize_arg_for_path_checks now URL-decodes the token first, rejects if the decoded result still contains %, then dequotes the decoded value; shifts handling of encoded/quoted paths to the command-parsing layer.
Tests & Validation
clients/agent-runtime/src/security/policy.rs (lines 2210–2231)
Removed test asserting quoted paths are blocked by is_path_allowed; added tests confirming raw-path validation (quoted strings are allowed) and command-layer decoding/dequoting (encoded/quoted absolute/traversal paths are blocked during command parsing).

Rook Health Persistence

Layer / File(s) Summary
Database Schema
clients/rook/migrations/0006_health_persistence.sql
New provider_account_health table persists account status, cooldown windows, consecutive failure counts, and timestamps; includes cascade-delete foreign key to provider_accounts.
Migration Runner
clients/rook/src/db/mod.rs (lines 49–52, 251–265)
Embeds the 0006_health_persistence migration and integrates it into SqliteDb::run_migrations, checking schema_migrations and applying the migration if missing.
Persistence Layer
clients/rook/src/db/health.rs
New module implements helpers (status_to_db_str, parse_optional_rfc3339, row_to_health) and three public async methods on SqliteDb: get_account_health (retrieves or returns None), upsert_account_health_success (clears failures/cooldown), and upsert_account_health_failure (increments failures, sets cooldown); includes atomicity via ON CONFLICT upserts and comprehensive error handling.
Service Implementation
clients/rook/src/services/health.rs (lines 192–267)
New SqliteHealthService implements HealthService by reading/writing to SqliteDb, defaulting to healthy on missing rows or read errors, logging warnings on persistence failures; all health methods (get, mark_success, mark_failure, is_available, list_healthy) now durable.
Service Wiring
clients/rook/src/registry/mod.rs (lines 24–83, 124–129)
RookRegistry::from_db constructs SqliteHealthService instead of InMemoryHealthService, and the health() accessor now returns &SqliteHealthService; updated trait imports accordingly.
Dependency Update
clients/rook/Cargo.toml (line 71)
corvus-traits version updated from 0.1.0 to 0.2.2 (same path dependency).
Integration Tests
clients/rook/src/db/mod.rs (lines 509–546), clients/rook/src/services/health.rs (lines 379–435), clients/rook/src/registry/mod.rs (lines 219–258)
In-memory DB tests verify schema columns and migration recording; SQLite service tests confirm missing-row defaulting, cooldown/failure persistence across service instances, cooldown expiry, and concurrent failure increment correctness; registry test verifies health state survives database reopen.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

area:rust, risk:high

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is brief and covers key objectives but lacks structured detail in several template sections; 'Tested Information' and 'Documentation Impact' are minimal. Expand 'Tested Information' with specifics (test output, coverage), clarify 'Documentation Impact' explicitly, and add detailed context on API compatibility or breaking changes if any exist.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title follows Conventional Commit style with 'feat:' prefix, is descriptive of the main change (persisting Rook provider account health), and is 48 characters (well under the 72-character limit).
Linked Issues check ✅ Passed Code changes comprehensively address issue #683: SQLite persistence for health/cooldown state, durable storage model, startup recovery, expiry semantics, and API compatibility are all implemented and tested.
Out of Scope Changes check ✅ Passed All code changes directly support the linked issue: new migration, health DB module, SQLite service implementation, registry wiring, and security policy updates for path validation are all in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rook-health-persistence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/l Denotes a large change size label May 2, 2026
@yacosta738 yacosta738 changed the title Persist Rook provider account health state feat: Persist Rook provider account health state May 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/agent-runtime/src/security/policy.rs`:
- Around line 607-620: is_path_allowed currently lets quoted paths (e.g.,
"/etc/passwd" or '../secret') bypass the absolute/traversal checks because
quotes hide components; update is_path_allowed to detect and reject quoted
direct-path inputs by checking for a matching leading and trailing single or
double quote and returning false (or alternatively unquote first and then re-run
the existing validations) before the percent-sign and
Path::new(path).components() checks; modify the logic around the existing
percent check and the call to expand_tilde so quoted strings cannot bypass
absolute-path or ParentDir detection in is_path_allowed.

In `@clients/rook/src/services/health.rs`:
- Around line 247-255: InMemoryHealthService::is_available currently treats
Unhealthy as always unavailable, which diverges from
SqliteHealthService::is_available that only respects cooldowns; change
InMemoryHealthService::is_available to mirror SqliteHealthService by looking up
health.cooldown_until and returning false only while Utc::now() <
cooldown_until, otherwise return true (i.e., do not special-case Unhealthy
status), so tests reflect production recovery semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa7f1d76-c672-4270-8e34-cb98627fbd30

📥 Commits

Reviewing files that changed from the base of the PR and between a70ec1e and f31a3b8.

⛔ Files ignored due to path filters (1)
  • clients/rook/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • clients/agent-runtime/src/security/policy.rs
  • clients/rook/Cargo.toml
  • clients/rook/migrations/0006_health_persistence.sql
  • clients/rook/src/db/health.rs
  • clients/rook/src/db/mod.rs
  • clients/rook/src/registry/mod.rs
  • clients/rook/src/services/health.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: pr-checks
  • GitHub Check: submit-gradle
  • GitHub Check: sonar
  • GitHub Check: semgrep-cloud-platform/scan
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (6)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/services/health.rs
  • clients/rook/src/db/mod.rs
  • clients/rook/src/registry/mod.rs
  • clients/rook/src/db/health.rs
  • clients/agent-runtime/src/security/policy.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/services/health.rs
  • clients/rook/migrations/0006_health_persistence.sql
  • clients/rook/src/db/mod.rs
  • clients/rook/src/registry/mod.rs
  • clients/rook/src/db/health.rs
  • clients/agent-runtime/src/security/policy.rs
  • clients/rook/Cargo.toml
clients/agent-runtime/src/{security,gateway,tools}/**/*.rs

📄 CodeRabbit inference engine (clients/agent-runtime/AGENTS.md)

Treat src/security/, src/gateway/, src/tools/ as high-risk surfaces and never broaden filesystem/network execution scope without explicit policy checks

Files:

  • clients/agent-runtime/src/security/policy.rs
clients/agent-runtime/src/**/*.rs

📄 CodeRabbit inference engine (clients/agent-runtime/AGENTS.md)

clients/agent-runtime/src/**/*.rs: Never log secrets, tokens, raw credentials, or sensitive payloads in any logging statements
Avoid unnecessary allocations, clones, and blocking operations to maintain performance and efficiency

Files:

  • clients/agent-runtime/src/security/policy.rs
clients/agent-runtime/**/*.rs

📄 CodeRabbit inference engine (clients/agent-runtime/AGENTS.md)

Run cargo fmt --all -- --check, cargo clippy --all-targets -- -D warnings, and cargo test for code validation, or document which checks were skipped and why

Files:

  • clients/agent-runtime/src/security/policy.rs
clients/agent-runtime/src/{security,gateway,tools,config}/**/*.rs

📄 CodeRabbit inference engine (clients/agent-runtime/AGENTS.md)

Do not silently weaken security policy or access constraints; keep default behavior secure-by-default with deny-by-default where applicable

Files:

  • clients/agent-runtime/src/security/policy.rs
🔇 Additional comments (15)
clients/rook/Cargo.toml (1)

71-71: LGTM!

Internal dependency version bump with path override intact.

clients/rook/migrations/0006_health_persistence.sql (1)

1-11: LGTM!

Schema aligns with the AccountHealth struct, FK cascade correctly cleans up orphaned health rows, and CREATE TABLE IF NOT EXISTS ensures idempotency.

clients/rook/src/db/mod.rs (3)

10-10: LGTM!

New health module export correctly placed.


49-52: LGTM!

Migration 0006 embedding and conditional application follows the established pattern consistently.

Also applies to: 251-264


509-546: LGTM!

Tests verify both schema columns and migration version recording.

clients/rook/src/registry/mod.rs (2)

27-27: LGTM!

Registry correctly wired to use SqliteHealthService, maintaining API compatibility via the HealthService trait.

Also applies to: 43-43, 82-82, 127-128


219-258: LGTM!

Solid persistence test - creates account, marks failure, closes registry, reopens from same file, and verifies health state (status, failures, cooldown, availability) survived the reopen.

clients/rook/src/services/health.rs (3)

192-206: LGTM!

Clean struct definition and constructor.


207-245: LGTM!

Graceful degradation on read/write errors - logs warnings and falls back to defaults rather than propagating failures that would break routing.


378-435: LGTM!

Good coverage: missing-row semantics, cross-instance persistence, cooldown expiry, and concurrent failure increments. SQLite's write serialization ensures the concurrent test is deterministic.

clients/rook/src/db/health.rs (5)

10-29: LGTM!

Bidirectional status conversion is exhaustive and handles unknown values with a clear error.


44-85: LGTM!

Row parsing with proper error context. The u32::try_from guard handles theoretical overflow gracefully.


87-103: LGTM!

Parameterized query prevents injection. Clean optional row handling.


105-167: LGTM!

Atomic upserts via ON CONFLICT DO UPDATE handle both insert and increment cases correctly. The failure upsert's consecutive_failures + 1 is evaluated atomically by SQLite, avoiding read-modify-write races.


170-237: LGTM!

Tests cover the key scenarios: missing rows, failure round-trips with increment verification, and success clearing state.

Comment on lines +607 to +620
// Block percent signs rather than decoding direct path inputs here.
if path.contains('%') {
return false;
}

// Block path traversal: check for ".." as a path component
if Path::new(&dequoted)
if Path::new(path)
.components()
.any(|c| matches!(c, std::path::Component::ParentDir))
{
return false;
}

let expanded = expand_tilde(&dequoted);
let expanded = expand_tilde(path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject quoted direct-path inputs here.

is_path_allowed is still the guard used for raw tool parameters in clients/agent-runtime/src/tools/file_read.rs and clients/agent-runtime/src/tools/glob.rs. With dequoting removed, inputs like "/etc/passwd" or "../secret" now pass this check because the quotes hide the absolute/traversal component from Path::components() and the absolute-path check. That weakens the direct-path policy surface, and Lines 2216-2217 lock the regression in.

Either reject quotes here for non-shell path APIs, or normalize every path-taking tool before calling this helper.

Suggested fix
     // Block backslashes (Windows-style separators or escaping)
     if path.contains('\\') {
         return false;
     }

+    // Direct path parameters should never arrive shell-quoted.
+    if path.contains('"') || path.contains('\'') {
+        return false;
+    }
+
     // Block percent signs rather than decoding direct path inputs here.
     if path.contains('%') {
         return false;
     }

As per coding guidelines, "Treat src/security/, src/gateway/, src/tools/ as high-risk surfaces and never broaden filesystem/network execution scope without explicit policy checks" and "Do not silently weaken security policy or access constraints; keep default behavior secure-by-default with deny-by-default where applicable."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/agent-runtime/src/security/policy.rs` around lines 607 - 620,
is_path_allowed currently lets quoted paths (e.g., "/etc/passwd" or '../secret')
bypass the absolute/traversal checks because quotes hide components; update
is_path_allowed to detect and reject quoted direct-path inputs by checking for a
matching leading and trailing single or double quote and returning false (or
alternatively unquote first and then re-run the existing validations) before the
percent-sign and Path::new(path).components() checks; modify the logic around
the existing percent check and the call to expand_tilde so quoted strings cannot
bypass absolute-path or ParentDir detection in is_path_allowed.

Comment on lines +247 to +255
async fn is_available(&self, account_id: AccountId) -> bool {
let health = self.get(account_id).await;
if let Some(until) = health.cooldown_until {
if Utc::now() < until {
return false;
}
}
true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Behavioral inconsistency: is_available differs from InMemoryHealthService.

InMemoryHealthService.is_available() (lines 168-179) returns false for Unhealthy status regardless of cooldown. SqliteHealthService only checks cooldown expiry.

Per PR objectives ("availability is recovered after cooldowns expire"), this appears intentional. However, tests using InMemoryHealthService won't reflect production recovery semantics.

Align InMemoryHealthService.is_available() to match, or document the divergence explicitly.

Proposed fix to align InMemoryHealthService
 impl HealthService for InMemoryHealthService {
     async fn is_available(&self, account_id: AccountId) -> bool {
         let health = self.get(account_id).await;
-        if health.status == HealthStatus::Unhealthy {
-            return false;
-        }
         if let Some(until) = health.cooldown_until {
             if Utc::now() < until {
                 return false;
             }
         }
         true
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clients/rook/src/services/health.rs` around lines 247 - 255,
InMemoryHealthService::is_available currently treats Unhealthy as always
unavailable, which diverges from SqliteHealthService::is_available that only
respects cooldowns; change InMemoryHealthService::is_available to mirror
SqliteHealthService by looking up health.cooldown_until and returning false only
while Utc::now() < cooldown_until, otherwise return true (i.e., do not
special-case Unhealthy status), so tests reflect production recovery semantics.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 2, 2026

@yacosta738 yacosta738 merged commit 930700f into main May 2, 2026
17 checks passed
@yacosta738 yacosta738 deleted the feat/rook-health-persistence branch May 2, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:rust risk:high size/l Denotes a large change size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist provider health state across restarts

1 participant