Skip to content

feat: rpc fallback & metrics validation#2324

Open
sergeytimoshin wants to merge 6 commits intomainfrom
sergey/forester-epoch-registration
Open

feat: rpc fallback & metrics validation#2324
sergeytimoshin wants to merge 6 commits intomainfrom
sergey/forester-epoch-registration

Conversation

@sergeytimoshin
Copy link
Contributor

@sergeytimoshin sergeytimoshin commented Mar 3, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added fallback RPC and indexer endpoints for improved resilience during primary service outages.
    • Implemented RPC pool health monitoring with configurable failure thresholds and recovery probes.
    • Added Prometheus metrics scrape endpoint for enhanced monitoring capabilities.
    • Dashboard now displays registration status and per-tree queue capacity metrics.
    • Enhanced epoch detection and registration tracking with new metrics.
    • Added transaction failure tracking and categorization by failure reason.
  • Documentation

    • Significantly expanded README with comprehensive configuration sections, command references, and operational guidance.
  • Tests

    • Added metrics contract validation test to ensure configuration consistency.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

This PR extends Forester's RPC resilience with fallback configuration, comprehensively refactors the metrics system using macros with centralized definitions, enriches queue data structures with batch and capacity tracking, and updates the dashboard and status reporting to expose registration and queue state information.

Changes

Cohort / File(s) Summary
Environment & Configuration
forester/.env.example, forester/src/cli.rs, forester/src/config.rs
Added fallback RPC/indexer URLs and RPC pool health-check configuration (failure_threshold, primary_probe_interval_secs) to StartArgs and ExternalServicesConfig. Updated PROCESSOR_MODE default from v1 to "all".
Metrics Infrastructure
forester/src/metrics.rs, forester/metrics-contract.json, forester/tests/metrics_contract_test.rs
Introduced macro-based metric definitions (define_metrics) with centralized METRIC_DESCRIPTORS, added metric queuing/batching with queue_metric_update and process_queued_metrics, and created metrics-contract.json as a single source of truth. New metrics: epoch_detected, epoch_registered, transactions_failed_total, indexer response times.
Queue & Status Data Models
forester/src/queue_helpers.rs, forester/src/forester_status.rs, forester/dashboard/src/types/forester.ts
Refactored fetch_queue_item_data to return QueueFetchResult (items, capacity, total_pending); extended V2QueueInfo with batch_size and total ZKP batch counts; added TreeStatus.queue_capacity and ForesterStatus.registration_is_open fields.
Dashboard Components
forester/dashboard/src/app/page.tsx, forester/dashboard/src/components/EpochCard.tsx, forester/dashboard/src/components/TreeTable.tsx
Enhanced registration status gating to check registration_is_open, updated status indicators and countdown display logic, and introduced V2QueueCell component to render input/output queue counts with optional ZKP batch totals.
Transaction Processing & Metrics
forester/src/processor/v1/send_transaction.rs, forester/src/processor/v1/helpers.rs, forester/src/processor/v2/common.rs
Added blockhash refresh logic with periodic RPC updates, incremented transaction failure metrics on send failures and timeouts, and threaded tree_pubkey into indexer proof count tracking.
Core Pipeline & Epoch Tracking
forester/src/lib.rs, forester/src/api_server.rs, forester/src/epoch_manager.rs
Wired RPC fallback configuration into SolanaRpcPoolBuilder, added Prometheus metrics scrape endpoint, and introduced update_epoch_detected/update_epoch_registered metric hooks at lifecycle transition points.
Documentation & Configuration
forester/README.md, forester/.gitignore, forester/dashboard/.gitignore
Significantly expanded README with table-driven configuration sections, updated command syntax, removed deprecated Prover V2 section, and clarified fallback RPC behavior. Unignored metrics-contract.json.
Tests
forester/tests/e2e_test.rs, forester/tests/priority_fee_test.rs, forester/tests/test_utils.rs, forester/tests/legacy/*
Updated test configurations to include fallback URLs and RPC pool fields; added metrics_contract_test.rs to validate metrics-contract.json sync with METRIC_DESCRIPTORS.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

ai-review

Suggested reviewers

  • ananas-block

Poem

🔗 Fallback paths guide the way,
Metrics whisper, queues display,
Epochs detected, batches bright—
Registration dances in the light. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.27% which is insufficient. The required threshold is 70.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: rpc fallback & metrics validation' accurately reflects the main changes: RPC fallback configuration and metrics contract validation are central to this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sergey/forester-epoch-registration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Simplified the handling of epoch registration and processing by removing redundant checks and consolidating error handling.
- Enhanced logging for better traceability of epoch processing events.
- Updated the logic to always process the current epoch regardless of registration status.
- Improved the registration phase checks to determine if registration is open or closed.
- Adjusted the handling of target epochs to streamline the flow of sending epochs for processing.

Enhance ForesterStatus to include registration state

- Added `registration_is_open` field to `ForesterStatus` to indicate if registration is currently open.
- Updated logic in `get_forester_status_with_options` to determine the current registration state and adjust the next registration epoch accordingly.
- Modified calculations for slots until next registration based on the registration state.

Update queue helpers to track total ZKP batches

- Extended `V2QueueInfo` and related structures to include `input_total_zkp_batches` and `output_total_zkp_batches`.
- Adjusted parsing functions to calculate total ZKP batches based on batch size and ZKP batch size.
- Ensured compatibility with existing queue processing logic.

Deploy script adjustments

- Updated the deploy script to focus on the `light_registry` library, removing references to other libraries.
@sergeytimoshin sergeytimoshin force-pushed the sergey/forester-epoch-registration branch from 4b96f12 to f0b8e74 Compare March 3, 2026 13:04
- Introduced fallback RPC and indexer URLs in the environment configuration and CLI arguments to enhance resilience.
- Updated README.md to document new fallback options and their usage.
- Modified the API server and RPC pool to support automatic failover to fallback URLs on health check failures.
- Enhanced the configuration structs and tests to include fallback settings.
- Refactored related code for improved readability and maintainability.
@sergeytimoshin sergeytimoshin changed the title Sergey/forester epoch registration feat: rpc fallback & metrics validation Mar 4, 2026
@sergeytimoshin sergeytimoshin marked this pull request as ready for review March 4, 2026 00:19
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
forester/src/epoch_manager.rs (1)

950-1011: ⚠️ Potential issue | 🟠 Major

Prevent forester_epoch_* gauges from moving backwards.

At Line [950] and Line [1011], metrics are set per processed epoch. Because older/newer epochs are queued and processed concurrently (e.g., Line [886]-[917]), an older epoch can overwrite a newer gauge value. For “latest epoch” metrics, this produces stale/regressing values.

Use monotonic update semantics (max(existing, incoming)) in metric helpers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/src/epoch_manager.rs` around lines 950 - 1011, The epoch metrics can
regress because update_epoch_detected and update_epoch_registered set gauges
directly and concurrent processing can overwrite newer values with older ones;
change their implementations (the metric helper(s) used by
update_epoch_detected/update_epoch_registered) to perform monotonic updates by
reading the current gauge value and setting it to max(current, incoming) (i.e.,
only update when incoming > current) so gauges never move backwards; apply this
max(existing, incoming) logic in the metric helper(s) so any callers like
update_epoch_detected/update_epoch_registered benefit without changing call
sites.
forester/src/config.rs (1)

150-160: ⚠️ Potential issue | 🟠 Major

RpcPoolConfig::default() is unsafe with derived zero-values.

At Line 150, #[derive(Default)] now makes failure_threshold and primary_probe_interval_secs default to 0, which is an unsafe runtime baseline for pool health behavior.

🔧 Proposed fix
-#[derive(Debug, Clone, Copy, Default)]
+#[derive(Debug, Clone, Copy)]
 pub struct RpcPoolConfig {
     pub max_size: u32,
     pub connection_timeout_secs: u64,
     pub idle_timeout_secs: u64,
     pub max_retries: u32,
     pub initial_retry_delay_ms: u64,
     pub max_retry_delay_ms: u64,
     pub failure_threshold: u64,
     pub primary_probe_interval_secs: u64,
 }
+
+impl Default for RpcPoolConfig {
+    fn default() -> Self {
+        Self {
+            max_size: 10,
+            connection_timeout_secs: 15,
+            idle_timeout_secs: 300,
+            max_retries: 10,
+            initial_retry_delay_ms: 1000,
+            max_retry_delay_ms: 16000,
+            failure_threshold: 3,
+            primary_probe_interval_secs: 30,
+        }
+    }
+}

Based on learnings: RPC pool configuration should use --rpc-pool-size (default 10), --rpc-pool-connection-timeout-secs (default 15), and --rpc-pool-idle-timeout-secs (default 300) for optimizing connection pooling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/src/config.rs` around lines 150 - 160, RpcPoolConfig's derived
Default yields unsafe zero-values for fields like failure_threshold and
primary_probe_interval_secs; replace the derived Default with an explicit impl
Default for RpcPoolConfig that returns safe, non-zero defaults (e.g., max_size =
10, connection_timeout_secs = 15, idle_timeout_secs = 300) and set sensible
values for max_retries, initial_retry_delay_ms, max_retry_delay_ms,
failure_threshold, and primary_probe_interval_secs so pool health behavior is
initialized correctly; update or remove #[derive(Default)] and implement
RpcPoolConfig::default() to return these recommended defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@forester/.env.example`:
- Line 61: The dotenv-linter warning is caused by an unquoted environment value;
update the PROCESSOR_MODE declaration (the `PROCESSOR_MODE` variable in
.env.example) to use quotes around the value (e.g., "all") so dotenv-linter no
longer flags ValueWithoutQuotes, ensuring the variable remains functionally
equivalent but lint-clean.
- Around line 25-30: The .env.example file is missing the top-level bootstrap
instruction required by our setup guidelines; add a clear one-line note at the
top of forester/.env.example instructing first-time users to copy the example
into .env (for example: "cp .env.example .env") so the Forester service is
initialized correctly before editing environment values; update the file near
the header (the commented RPC fallback block such as
FALLBACK_RPC_URL/FALLBACK_INDEXER_URL) to include this instruction as the first
visible line.

In `@forester/dashboard/tsconfig.tsbuildinfo`:
- Line 1: The PR accidentally adds the generated TypeScript build cache file
tsconfig.tsbuildinfo; remove this file from the commit/PR and update Git ignore
rules by adding *.tsbuildinfo (or forester/dashboard/*.tsbuildinfo) to
.gitignore so tsbuildinfo files are not tracked going forward, then commit the
.gitignore change and use git rm --cached tsconfig.tsbuildinfo (or equivalent)
before pushing the updated branch.

In `@forester/src/api_server.rs`:
- Around line 1177-1181: The prometheus_route currently never sees real errors
because metrics::metrics_handler swallows encoding failures and returns
Ok(empty_string); change metrics::metrics_handler to propagate failures (return
Result<String, E> or Result<impl Reply, E> instead of Ok("")) and update
prometheus_route to translate an Err into an explicit 5xx warp response (e.g.,
return a warp rejection or a warp reply with StatusCode::INTERNAL_SERVER_ERROR)
instead of treating everything as success; reference the prometheus_route and
metrics::metrics_handler symbols when making these changes so encoding errors
surface as HTTP 5xx responses to Prometheus scrapes.

In `@forester/src/cli.rs`:
- Around line 174-180: The CLI accepts zero for failover timing which causes
pathological behavior; update the argument definitions for
rpc_pool_failure_threshold and rpc_pool_primary_probe_interval_secs to validate
non-zero values by adding a value_parser using
clap::value_parser!(u64).range(1..) so parsing rejects zero; modify the
attribute on the rpc_pool_failure_threshold field and the
rpc_pool_primary_probe_interval_secs field (the struct-level arg annotations) to
include value_parser = clap::value_parser!(u64).range(1..) so users cannot
supply 0 at parse time.

In `@forester/src/config.rs`:
- Around line 351-352: The config currently assigns
args.rpc_pool_failure_threshold and args.rpc_pool_primary_probe_interval_secs
directly (failure_threshold and primary_probe_interval_secs), allowing zero
values; add a lower-bound validation where configuration is built (the code that
sets failure_threshold and primary_probe_interval_secs) to ensure both values
are >= 1 (or fail startup with a clear error), e.g., check
args.rpc_pool_failure_threshold >= 1 and
args.rpc_pool_primary_probe_interval_secs >= 1 and return/propagate a
configuration error if not, so invalid zero inputs cannot be used.

In `@forester/src/epoch_manager.rs`:
- Around line 818-827: The send failure for target epochs should be treated as
terminal: in the epoch monitor where tx.send(target_epoch).await is handled (the
block that currently logs error with event =
"epoch_monitor_send_target_epoch_failed" and then continues), change the control
flow to stop the monitor instead of continuing—mirror the fatal handling used
for current-epoch send failures by returning/propagating an error or breaking
out of the task after logging the error so the loop does not spin; update the
code surrounding tx.send(target_epoch).await and its error branch to terminate
the monitor (using the same pattern used for current-epoch send failures) while
keeping the existing log message and fields (self.run_id, target_epoch, ?e).

In `@forester/src/forester_status.rs`:
- Around line 792-822: The AddressV2 branch computes queue_len in "items"
(input_pending_batches * zkp_batch_size) but queue_cap is derived as
batches.len() * (batch_size / zkp_batch_size), causing a unit mismatch; update
TreeType::AddressV2 (in the code that calls parse_address_v2_queue_info and
computes queue_len/queue_cap) to use consistent units: either both represent
total items or both represent batch slots — e.g., to keep items, set queue_cap =
Some(v2_info.batches.len() as u64 * v2_info.zkp_batch_size as u64) (guard
zkp_batch_size>0) and ensure all casts to u64, or if you prefer batches, compute
queue_len = Some(v2_info.input_pending_batches) and queue_cap =
Some(v2_info.batches.len() as u64); make the corresponding change where
queue_len and queue_cap are returned so callers get consistent units.
- Around line 773-779: The current queue_cap computation uses ZKP batch slots;
update it to represent total item slots by multiplying batches.len() by
batch_size instead of (batch_size / zkp_batch_size). Locate where queue_cap is
computed from v2_info (variable name queue_cap inside StateV2 and the analogous
block in AddressV2) and replace the inner expression to compute i.batches.len()
as u64 * i.batch_size (preserve the existing guard for zero-sized or missing
v2_info); ensure both StateV2 and AddressV2 use the same change so
queue_capacity and queue_length are in the same item units.

In `@forester/src/lib.rs`:
- Line 430: A background recovery probe is started via
arc_pool.spawn_primary_recovery_probe() and assigned to recovery_probe_handle
before a later fallible operation returns with ?, which can exit early and never
reach the existing abort block — leaking the spawned task. Fix by delaying
creation of recovery_probe_handle until after the fallible call succeeds or by
installing a guard that will abort recovery_probe_handle on
unwinding/early-return (e.g., wrap recovery_probe_handle in a Drop/ScopeGuard or
use a try { ... } pattern that ensures the existing abort block runs); reference
arc_pool.spawn_primary_recovery_probe, recovery_probe_handle, and the abort
block that currently runs at the end so the spawned task is always aborted on
every error path.

---

Outside diff comments:
In `@forester/src/config.rs`:
- Around line 150-160: RpcPoolConfig's derived Default yields unsafe zero-values
for fields like failure_threshold and primary_probe_interval_secs; replace the
derived Default with an explicit impl Default for RpcPoolConfig that returns
safe, non-zero defaults (e.g., max_size = 10, connection_timeout_secs = 15,
idle_timeout_secs = 300) and set sensible values for max_retries,
initial_retry_delay_ms, max_retry_delay_ms, failure_threshold, and
primary_probe_interval_secs so pool health behavior is initialized correctly;
update or remove #[derive(Default)] and implement RpcPoolConfig::default() to
return these recommended defaults.

In `@forester/src/epoch_manager.rs`:
- Around line 950-1011: The epoch metrics can regress because
update_epoch_detected and update_epoch_registered set gauges directly and
concurrent processing can overwrite newer values with older ones; change their
implementations (the metric helper(s) used by
update_epoch_detected/update_epoch_registered) to perform monotonic updates by
reading the current gauge value and setting it to max(current, incoming) (i.e.,
only update when incoming > current) so gauges never move backwards; apply this
max(existing, incoming) logic in the metric helper(s) so any callers like
update_epoch_detected/update_epoch_registered benefit without changing call
sites.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14adf27 and 18c1ba8.

⛔ Files ignored due to path filters (4)
  • .github/workflows/metrics-contract.yml is excluded by none and included by none
  • external/photon is excluded by none and included by none
  • forester-utils/src/rpc_pool.rs is excluded by none and included by none
  • scripts/deploy/deploy.sh is excluded by none and included by none
📒 Files selected for processing (26)
  • forester/.env.example
  • forester/.gitignore
  • forester/README.md
  • forester/dashboard/src/app/page.tsx
  • forester/dashboard/src/components/EpochCard.tsx
  • forester/dashboard/src/components/TreeTable.tsx
  • forester/dashboard/src/types/forester.ts
  • forester/dashboard/tsconfig.tsbuildinfo
  • forester/metrics-contract.json
  • forester/src/api_server.rs
  • forester/src/cli.rs
  • forester/src/config.rs
  • forester/src/epoch_manager.rs
  • forester/src/forester_status.rs
  • forester/src/lib.rs
  • forester/src/metrics.rs
  • forester/src/processor/v1/helpers.rs
  • forester/src/processor/v1/send_transaction.rs
  • forester/src/processor/v2/common.rs
  • forester/src/queue_helpers.rs
  • forester/tests/e2e_test.rs
  • forester/tests/legacy/priority_fee_test.rs
  • forester/tests/legacy/test_utils.rs
  • forester/tests/metrics_contract_test.rs
  • forester/tests/priority_fee_test.rs
  • forester/tests/test_utils.rs

- Added value parser with range validation for `rpc_pool_failure_threshold` and `rpc_pool_primary_probe_interval_secs` in CLI arguments to ensure they are greater than 0.
- Updated error handling in `EpochManager` to return an error when the epoch channel is closed instead of continuing the loop.
- Simplified queue capacity calculation in `parse_tree_status` by removing unnecessary checks for `zkp_batch_size`.
- Moved the spawning of the primary recovery probe to after the pool is built in `run_pipeline_with_run_id` for better clarity and structure.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
forester/src/lib.rs (1)

469-474: ⚠️ Potential issue | 🟠 Major

Abort slot_tracker_handle on tracker-init error path.

slot_tracker_handle starts at Line [444], but Line [472]-[473] can early-return with ?. That leaves the SlotTracker task running after pipeline init failed.

🛠️ Proposed fix
     let tracker_handles = if let Some(trackers) = preconfigured_trackers {
         trackers
     } else {
-        initialize_compressible_trackers(config.clone(), shutdown_compressible, shutdown_bootstrap)
-            .await?
+        match initialize_compressible_trackers(
+            config.clone(),
+            shutdown_compressible,
+            shutdown_bootstrap,
+        )
+        .await
+        {
+            Ok(handles) => handles,
+            Err(e) => {
+                slot_tracker_handle.abort();
+                return Err(e);
+            }
+        }
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/src/lib.rs` around lines 469 - 474, The SlotTracker task started in
slot_tracker_handle must be aborted if tracker initialization fails; modify the
branch around initialize_compressible_trackers so that on the error path you
call slot_tracker_handle.abort() (or otherwise stop/join it) before returning
the propagated error from initialize_compressible_trackers; locate the variables
slot_tracker_handle, preconfigured_trackers, and
initialize_compressible_trackers to add the abort call just prior to the early
return (the assignment to tracker_handles) so the SlotTracker does not remain
running after pipeline init fails.
forester/.env.example (1)

25-30: ⚠️ Potential issue | 🟡 Minor

Add the required .env bootstrap instruction at file top.

Please add cp .env.example .env as the first visible line/comment in this file so first-time setup matches the Forester setup flow.

Proposed fix
+## First-time setup: cp .env.example .env
+
 # Core RPC Configuration
 export RPC_URL="https://api.devnet.solana.com"

As per coding guidelines, forester/**/.env.example: Copy the example environment file using cp .env.example .env before configuring the Forester service.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/.env.example` around lines 25 - 30, The .env.example file is missing
the bootstrap instruction to copy the example into an active .env; add a visible
top-line comment containing the command cp .env.example .env as the first
visible line in the file (before any other commented variables) so first-time
setup follows the Forester setup flow and users know to create their .env from
.env.example.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@forester/src/epoch_manager.rs`:
- Around line 950-951: The spawned process_epoch tasks call
update_epoch_detected and update_epoch_registered which currently call
forester_epoch_detected.set(...) / forester_epoch_registered.set(...) and allow
older tasks to regress the metric; change the gauge backing to a monotonic
atomic (e.g., store an AtomicI64 or AtomicU64 per gauge) and update the setter
functions (the gauge setter helpers used by
update_epoch_detected/update_epoch_registered defined alongside
forester_epoch_detected/forester_epoch_registered) to perform a compare-and-swap
loop that only replaces the stored value when the new epoch is greater
(monotonic-max), then update the exported gauge from that atomic, ensuring
process_epoch concurrent completions cannot move the reported latest epoch
backward.

In `@forester/src/forester_status.rs`:
- Around line 172-200: The code incorrectly uses
protocol_config_pda.config.is_registration_phase(slot).is_ok() (which always
returns Ok due to disabled validation) to set registration_is_open; instead call
the epoch-phase inspector to get concrete phase boundaries and compare the slot
against the registration phase range. Replace the is_ok() check with a direct
boundary test using protocol_config_pda.config.get_epoch_phases(slot) (or
get_epoch_phases for the current/latest epoch), extract the registration phase
start/end (or registration_phase.range), set registration_is_open = slot >=
registration_start && slot < registration_end, and then compute
current_registration_epoch and slots_until_next_registration by subtracting slot
from the appropriate phase end/start using latest_register_epoch,
registration_phase_length, active_phase_length and
protocol_config_pda.config.get_latest_register_epoch_progress(slot) as before.

---

Duplicate comments:
In `@forester/.env.example`:
- Around line 25-30: The .env.example file is missing the bootstrap instruction
to copy the example into an active .env; add a visible top-line comment
containing the command cp .env.example .env as the first visible line in the
file (before any other commented variables) so first-time setup follows the
Forester setup flow and users know to create their .env from .env.example.

In `@forester/src/lib.rs`:
- Around line 469-474: The SlotTracker task started in slot_tracker_handle must
be aborted if tracker initialization fails; modify the branch around
initialize_compressible_trackers so that on the error path you call
slot_tracker_handle.abort() (or otherwise stop/join it) before returning the
propagated error from initialize_compressible_trackers; locate the variables
slot_tracker_handle, preconfigured_trackers, and
initialize_compressible_trackers to add the abort call just prior to the early
return (the assignment to tracker_handles) so the SlotTracker does not remain
running after pipeline init fails.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18c1ba8 and 723905d.

📒 Files selected for processing (6)
  • forester/.env.example
  • forester/dashboard/.gitignore
  • forester/src/cli.rs
  • forester/src/epoch_manager.rs
  • forester/src/forester_status.rs
  • forester/src/lib.rs

Comment on lines +950 to 951
update_epoch_detected(epoch);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Script 1: Show per-epoch async task spawning
echo "=== PER-EPOCH ASYNC TASK SPAWNING ==="
rg -n "epoch_queued_for_processing|tokio::spawn|process_epoch" forester/src/epoch_manager.rs -A8 -B8

echo ""
echo "=== GAUGE UPDATES IN PROCESS_EPOCH ==="
# Script 2: Show where epoch gauges are updated
rg -n "update_epoch_detected|update_epoch_registered" forester/src/epoch_manager.rs -A4 -B4

echo ""
echo "=== GAUGE SETTER IMPLEMENTATION ==="
# Script 3: Show gauge setters
rg -n "pub fn update_epoch_detected|pub fn update_epoch_registered|set" forester/src/metrics.rs -A8 -B3 | head -100

Repository: Lightprotocol/light-protocol

Length of output: 10425


Fix metric regressions from concurrent epoch task updates.

Lines 950 and 1011 update forester_epoch_detected and forester_epoch_registered gauges from spawned process_epoch tasks that run concurrently per epoch. The gauge setters (lines 236–244 in metrics.rs) use simple .set() semantics with no ordering protection—an older epoch's task finishing after a newer one will overwrite the metric with a lower value, causing gauges to move backward.

Replace the gauge updates with monotonic-max semantics using atomic compare-and-swap on an AtomicI64/AtomicU64, ensuring concurrent older epochs cannot regress the reported "latest" epoch value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/src/epoch_manager.rs` around lines 950 - 951, The spawned
process_epoch tasks call update_epoch_detected and update_epoch_registered which
currently call forester_epoch_detected.set(...) /
forester_epoch_registered.set(...) and allow older tasks to regress the metric;
change the gauge backing to a monotonic atomic (e.g., store an AtomicI64 or
AtomicU64 per gauge) and update the setter functions (the gauge setter helpers
used by update_epoch_detected/update_epoch_registered defined alongside
forester_epoch_detected/forester_epoch_registered) to perform a compare-and-swap
loop that only replaces the stored value when the new epoch is greater
(monotonic-max), then update the exported gauge from that atomic, ensuring
process_epoch concurrent completions cannot move the reported latest epoch
backward.

Comment on lines +172 to +200
// Determine if registration is currently open
let registration_is_open = protocol_config_pda
.config
.registration_phase_length
.saturating_sub(active_epoch_progress);
.is_registration_phase(slot)
.is_ok();

// If registration is closed, show the next epoch as the registration target
let current_registration_epoch = if registration_is_open {
latest_register_epoch
} else {
latest_register_epoch + 1
};

let slots_until_next_registration = if registration_is_open {
// Slots until current registration closes
let epoch_progress = protocol_config_pda
.config
.get_latest_register_epoch_progress(slot)
.unwrap_or(0);
registration_phase_length.saturating_sub(epoch_progress)
} else {
// Slots until next epoch's registration opens
active_phase_length.saturating_sub(
protocol_config_pda
.config
.get_latest_register_epoch_progress(slot)
.unwrap_or(0),
)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the registry module files
fd -t f "state.rs" programs/registry/src/protocol_config/

Repository: Lightprotocol/light-protocol

Length of output: 118


🏁 Script executed:

# Inspect the is_registration_phase implementation
rg -n "pub fn is_registration_phase" --type rust -A 30

Repository: Lightprotocol/light-protocol

Length of output: 2927


🏁 Script executed:

# Look at the protocol_config module structure
fd -t f "\.rs$" programs/registry/src/protocol_config/

Repository: Lightprotocol/light-protocol

Length of output: 263


🏁 Script executed:

# Check for phase boundary definitions
rg -n "registration_phase_length|registration.*start|registration.*end" --type rust -B 2 -A 2

Repository: Lightprotocol/light-protocol

Length of output: 37073


🏁 Script executed:

# Look for any CLAUDE.md or guidance docs
find . -name "CLAUDE.md" -o -name "claude.md" 2>/dev/null

Repository: Lightprotocol/light-protocol

Length of output: 718


🏁 Script executed:

# Check if get_epoch_phases is imported in forester_status.rs
head -100 forester/src/forester_status.rs | grep -E "^use|^mod"

Repository: Lightprotocol/light-protocol

Length of output: 734


🏁 Script executed:

# Check the full import section and verify get_epoch_phases availability
rg -n "get_epoch_phases|use.*epoch" forester/src/forester_status.rs -B 2 -A 2 | head -40

Repository: Lightprotocol/light-protocol

Length of output: 900


🏁 Script executed:

# Check how forester_epoch.rs is used in the forester module
rg -n "mod forester_epoch|use.*forester_epoch" forester/src/

Repository: Lightprotocol/light-protocol

Length of output: 929


🏁 Script executed:

# Get the exact code at lines 170-210 in forester_status.rs to verify the diff
sed -n '170,210p' forester/src/forester_status.rs

Repository: Lightprotocol/light-protocol

Length of output: 1588


🏁 Script executed:

# Also check what get_latest_register_epoch_progress returns and how it's used
rg -n "get_latest_register_epoch_progress" --type rust -B 2 -A 5 | head -50

Repository: Lightprotocol/light-protocol

Length of output: 2698


🏁 Script executed:

# Verify the exact structure of EpochPhases and get_epoch_phases signature
rg -n "pub struct EpochPhases|pub fn get_epoch_phases" forester-utils/src/forester_epoch.rs -A 15 | head -60

Repository: Lightprotocol/light-protocol

Length of output: 1489


🏁 Script executed:

# Check the latest_register_epoch to see if it's the current epoch to register for
rg -n "get_latest_register_epoch" programs/registry/src/protocol_config/state.rs -A 10 -B 2

Repository: Lightprotocol/light-protocol

Length of output: 1885


Fix registration phase detection to use explicit slot boundaries instead of disabled API check.

Lines 175–176 call .is_ok() on is_registration_phase(slot), which always returns Ok(...) because the phase boundary validation is commented out (disabled intentionally per the code comment). This causes registration_is_open to be stuck true, which distorts both current_registration_epoch and slots_until_next_registration calculations.

The codebase already has the correct infrastructure (get_epoch_phases() imported at line 15) and uses explicit slot boundary checks throughout (e.g., epoch_manager.rs, forester_epoch.rs). Replace the .is_ok() check with direct boundary comparison:

Suggested fix
-    // Determine if registration is currently open
-    let registration_is_open = protocol_config_pda
-        .config
-        .is_registration_phase(slot)
-        .is_ok();
+    let current_reg_phases = get_epoch_phases(&protocol_config_pda.config, latest_register_epoch);
+    let registration_is_open = slot >= current_reg_phases.registration.start
+        && slot < current_reg_phases.registration.end;

     // If registration is closed, show the next epoch as the registration target
     let current_registration_epoch = if registration_is_open {
         latest_register_epoch
     } else {
         latest_register_epoch + 1
     };

-    let slots_until_next_registration = if registration_is_open {
-        // Slots until current registration closes
-        let epoch_progress = protocol_config_pda
-            .config
-            .get_latest_register_epoch_progress(slot)
-            .unwrap_or(0);
-        registration_phase_length.saturating_sub(epoch_progress)
-    } else {
-        // Slots until next epoch's registration opens
-        active_phase_length.saturating_sub(
-            protocol_config_pda
-                .config
-                .get_latest_register_epoch_progress(slot)
-                .unwrap_or(0),
-        )
-    };
+    let slots_until_next_registration = if registration_is_open {
+        current_reg_phases.registration.end.saturating_sub(slot)
+    } else {
+        let next_reg_phases =
+            get_epoch_phases(&protocol_config_pda.config, latest_register_epoch + 1);
+        next_reg_phases.registration.start.saturating_sub(slot)
+    };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@forester/src/forester_status.rs` around lines 172 - 200, The code incorrectly
uses protocol_config_pda.config.is_registration_phase(slot).is_ok() (which
always returns Ok due to disabled validation) to set registration_is_open;
instead call the epoch-phase inspector to get concrete phase boundaries and
compare the slot against the registration phase range. Replace the is_ok() check
with a direct boundary test using
protocol_config_pda.config.get_epoch_phases(slot) (or get_epoch_phases for the
current/latest epoch), extract the registration phase start/end (or
registration_phase.range), set registration_is_open = slot >= registration_start
&& slot < registration_end, and then compute current_registration_epoch and
slots_until_next_registration by subtracting slot from the appropriate phase
end/start using latest_register_epoch, registration_phase_length,
active_phase_length and
protocol_config_pda.config.get_latest_register_epoch_progress(slot) as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant