Conversation
- add dashboard app (Next.js + Tailwind), components, hooks, and API client - include dashboard Dockerfile + env example; update gitignore and main Dockerfile - remove legacy static dashboard.html - extend Rust backend/CLI to expose status, metrics, epochs, and compressible-related data - refactor compressible/config/processor helpers, queue pressure + validation, and related telemetry/metrics
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThis PR introduces a Next.js-based dashboard frontend for the Forester service with corresponding backend API enhancements. The dashboard provides real-time monitoring of epochs, trees, metrics, and compressible accounts. The API server is refactored to stream metrics and compressible data via watch channels rather than generating them on-demand, improving scalability. ForesterStatus is expanded with aggregate statistics. A new CLI Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Dashboard as Next.js Dashboard
participant APIServer as Forester API Server
participant MetricsProvider as Background Metrics Provider
participant PrometheusDB as Prometheus
participant RPC as Solana RPC
User->>Dashboard: Loads /metrics page
Dashboard->>APIServer: GET /metrics/json
par Parallel Background Tasks
MetricsProvider->>MetricsProvider: Periodically runs (10s interval)
MetricsProvider->>PrometheusDB: Query transaction_rate,<br/>transactions_processed,<br/>forester_balances
MetricsProvider->>MetricsProvider: Parse & aggregate results
MetricsProvider->>MetricsProvider: Store in watch channel
and
MetricsProvider->>RPC: (fallback) Get local metrics<br/>if Prometheus unavailable
end
APIServer->>APIServer: Read latest snapshot<br/>from watch channel
APIServer-->>Dashboard: Return MetricsSnapshot JSON
Dashboard->>Dashboard: Render transaction rates,<br/>balances, queue depths
Dashboard-->>User: Display metrics dashboard
sequenceDiagram
actor User
participant Dashboard as Next.js Dashboard
participant APIServer as Forester API Server
participant StatusHandler as Status Route Handler
participant CompressibleProvider as Background<br/>Compressible Provider
participant RPC as Solana RPC
participant Trackers as In-Memory Trackers
User->>Dashboard: Loads /compressible page
Dashboard->>APIServer: GET /compressible
par Concurrent Sources
CompressibleProvider->>CompressibleProvider: Runs periodically
alt Trackers Available
CompressibleProvider->>Trackers: Read cached counts<br/>(CTokens, Mints, PDAs)
else Trackers Unavailable
CompressibleProvider->>RPC: Query getProgramAccounts<br/>with pagination & filters
CompressibleProvider->>CompressibleProvider: Count results
end
CompressibleProvider->>CompressibleProvider: Store in watch channel
and
APIServer->>APIServer: Read latest snapshot<br/>from channel
end
APIServer-->>Dashboard: Return CompressibleSnapshot
Dashboard->>Dashboard: Render account counts
Dashboard-->>User: Display compressible panel
Estimated code review effort🎯 4 (Complex) | ⏱️ ~90 minutes Rationale: This diff introduces substantial, heterogeneous changes across multiple domains: (1) a complete Next.js dashboard frontend with configuration, 11 components, 3 pages, hooks, utilities, and TypeScript types; (2) a significant API server refactoring changing core async patterns and request flow; (3) extended ForesterStatus and related structs affecting multiple query paths; (4) supporting changes to compressible, processors, metrics, and CLI. While individual file changes are sometimes localized, the cumulative scope, API-surface modifications (e.g., spawn_api_server signature, struct field additions), and interconnected logic across frontend/backend layers demands careful review of interaction boundaries, error handling consistency, and type correctness throughout. Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 33
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
forester/src/processor/v1/tx_builder.rs (1)
131-141:⚠️ Potential issue | 🟠 MajorOverly broad "not found" error swallowing — this will silently discard real failures.
The check
err_str.to_lowercase().contains("not found")matches far more than indexer record-not-found errors. Any error message containing "not found" — connection not found, method not found, key not found, socket not found, etc. — will be silently treated as an empty batch and the work items will be skipped.Additionally, note that the first condition (
"record not found") is logically redundant since the second condition ("not found") already subsumes it. And.to_lowercase()is called twice on the same string.I'd recommend tightening this to match only the specific indexer error you expect:
Proposed fix
- let err_str = e.to_string(); - return if err_str.to_lowercase().contains("record not found") - || err_str.to_lowercase().contains("not found") - { + let err_str = e.to_string().to_lowercase(); + return if err_str.contains("record not found") {If you genuinely need to catch a broader class of indexer "not found" responses, consider matching on a structured error variant rather than substring matching on the message. Error messages are not a stable API contract — an upstream library changing its wording will silently alter your control flow.
forester/src/processor/v2/root_guard.rs (1)
96-101: 🧹 Nitpick | 🔵 TrivialConsider adding a test for the non-warning uninitialized path.
This test exercises the case where
indexer_root != onchain_root(triggers the warning log). There's no test forexpected_root == 0withindexer_root == onchain_root, which is the "clean" uninitialized startup. It's a one-liner to add and would round out coverage of both branches inside the new block:Suggested additional test
#[test] fn proceeds_when_expected_is_zero_and_roots_agree() { assert_eq!( reconcile_roots(root(0), root(5), root(5)), RootReconcileDecision::Proceed ); }forester/src/rollover/operations.rs (2)
64-72:⚠️ Potential issue | 🟡 MinorConst generic
26on Line 65 can silently diverge fromSTATE_MERKLE_TREE_HEIGHT.Line 72 now correctly uses
STATE_MERKLE_TREE_HEIGHT, but the const generic parameter on Line 65 is still a hardcoded26. If the constant ever changes, this will compile fine but produce incorrect tree reads at runtime — a subtle and dangerous mismatch.I understand that Rust's stable const generics don't allow
FOO as usizein generic position, so you may not be able to use the constant directly. At minimum, add a compile-time or debug assertion to keep these in sync:Proposed safeguard
+ // Safety: const generic must match the tree height constant. + assert_eq!(STATE_MERKLE_TREE_HEIGHT as usize, 26, "STATE_MERKLE_TREE_HEIGHT changed — update the const generic on get_concurrent_merkle_tree"); let merkle_tree = get_concurrent_merkle_tree::<StateMerkleTreeAccount, R, Poseidon, 26>(The same pattern appears in
sdk-libs/client/src/indexer/tree_info.rs(theheight()method also hardcodes26forStateV1/AddressV1), so this is a codebase-wide consistency concern worth tracking.
104-111:⚠️ Potential issue | 🟡 MinorSame const-generic / named-constant mismatch for AddressV1.
Line 105 hardcodes
26as a const generic forget_indexed_merkle_tree, while Line 111 now usesADDRESS_MERKLE_TREE_HEIGHT. Same divergence risk as theStateV1path above — please add an equivalent assertion here too.Proposed safeguard
+ assert_eq!(ADDRESS_MERKLE_TREE_HEIGHT as usize, 26, "ADDRESS_MERKLE_TREE_HEIGHT changed — update the const generic on get_indexed_merkle_tree"); let merkle_tree = get_indexed_merkle_tree::<AddressMerkleTreeAccount, R, Poseidon, usize, 26, 16>(forester/src/compressible/subscriber.rs (2)
159-168:⚠️ Potential issue | 🟠 MajorBackoff reset on
StreamCloseddefeats its own purpose and produces misleading logs.Two problems here:
Misleading log: You increment
attempton line 160 and log it on line 162, then immediately reset it to 0 on line 168. EveryStreamClosedwill always log"attempt 1", making it impossible to tell if you're in a reconnect storm.No backoff for rapid stream closures: If the server accepts the WebSocket but immediately drops it (e.g., overloaded, auth issue, rate-limited),
run_connectionreturnsStreamClosedevery time. Because you unconditionally resetcurrent_delayandattempt, you'll reconnect everyinitial_delay(1 s) forever — effectively a tight loop that hammers the server.The comment says "Reset backoff after a successful connection," which is the right idea — but
StreamCloseddoesn't mean the connection was healthy. A more robust approach: track how long the connection was alive, and only reset backoff if it survived past some threshold (e.g., > 30 s). Otherwise, keep the exponential backoff.Proposed fix — reset only for long-lived connections
Ok(ConnectionResult::StreamClosed) => { attempt += 1; warn!( "{} connection lost (attempt {}), reconnecting in {:?}...", self.config.name, attempt, current_delay ); - - // Reset backoff after a successful connection - current_delay = self.reconnect_config.initial_delay; - attempt = 0; + // Note: backoff is NOT reset here. If you want to reset + // after a long-lived connection, track connection duration + // in run_connection() and return it with StreamClosed. }Alternatively, measure elapsed time inside
run_connectionand return it with theStreamClosedvariant:enum ConnectionResult { Shutdown, StreamClosed { connected_duration: Duration }, }Then in the caller:
Ok(ConnectionResult::StreamClosed { connected_duration }) => { if connected_duration > Duration::from_secs(30) { // Connection was stable — reset backoff current_delay = self.reconnect_config.initial_delay; attempt = 0; } else { attempt += 1; } // ... }
144-193:⚠️ Potential issue | 🟠 MajorThe
Errpath correctly applies exponential backoff — nice contrast to theStreamClosedpath.Worth noting: the backoff computation on lines 189-192 runs for both
StreamClosedandErrarms, but sinceStreamClosedresetscurrent_delayon line 167 before reaching it, the exponential increase afterStreamClosedjust computesinitial_delay * multiplier— so on a second consecutiveStreamClosedyou'd sleepinitial_delay(reset) then set delay toinitial_delay * 2, but then the nextStreamClosedresets it again. The backoff never actually grows. This reinforces the concern above.forester/src/processor/v2/helpers.rs (1)
435-473: 🧹 Nitpick | 🔵 TrivialGood direction — condvar with timeout is the right move here. Replacing the old
thread::sleeppolling withCondvar::wait_timeoutis a meaningful improvement: lower CPU overhead and faster wake-up when data actually arrives. The 120s overall cap is a sensible safety net. A couple of things to tighten up, though:1. Double lock acquisition creates a TOCTOU gap and unnecessary contention.
Each loop iteration acquires
available_elementsat line 439 (read + drop), then re-acquires the same lock at line 467 for the condvar wait. Between those two acquisitions, the background producer could updateavailable_elementsand callready.notify_all()(line 702). Since we're not waiting on the condvar at that moment, the notification is lost, and we sleep for the full 50ms poll interval before rechecking.You can eliminate this gap by holding the guard across the check and the wait:
♻️ Suggested restructure to single lock acquisition per iteration
loop { - let available = *lock_recover( - &self.available_elements, - "streaming_address_queue.available_elements", - ); - if available >= batch_end { - return available; - } - let complete = *lock_recover( &self.fetch_complete, "streaming_address_queue.fetch_complete", ); - if complete { - return available; - } - if start.elapsed() > POLL_TIMEOUT { - tracing::warn!( - "wait_for_batch timed out after {:?} waiting for {} elements (available: {})", - POLL_TIMEOUT, - batch_end, - available - ); - return available; - } - - // Use condvar wait with timeout instead of thread::sleep to avoid - // blocking the thread and to wake up promptly when data arrives. - let guard = lock_recover( + let guard = lock_recover( &self.available_elements, "streaming_address_queue.available_elements", ); - let _ = self - .data_ready - .wait_timeout(guard, std::time::Duration::from_millis(50)); + let available = *guard; + + if available >= batch_end || complete { + return available; + } + + if start.elapsed() > POLL_TIMEOUT { + tracing::warn!( + "wait_for_batch timed out after {:?} waiting for {} elements (available: {})", + POLL_TIMEOUT, + batch_end, + available + ); + return available; + } + + // Hold the guard into wait_timeout so the notify between + // "check" and "wait" cannot be lost. + let (_guard, _timeout_result) = self + .data_ready + .wait_timeout(guard, std::time::Duration::from_millis(50)) + .unwrap_or_else(|e| { + tracing::warn!("Poisoned condvar (recovering): streaming_address_queue.data_ready"); + e.into_inner() + }); }Note:
fetch_completeuses a different mutex, so it still needs its own lock — but moving it before theavailable_elementslock keeps the critical window tight. The condvar wait atomically releases theavailable_elementsguard, so no notification from the producer (which locksavailable_elements→ updates → drops → notifies) can slip through unobserved.2. Stale
availableon timeout return (minor).When the timeout fires at line 455, the
availablevalue was read at the top of the iteration (line 439) and could be out of date. Inget_batch_data, this stale value feeds into thestart >= availableearly-return check (line 479), which could produce a spuriousNoneeven though data exists. If you adopt the restructured loop above, this is resolved naturally sinceavailableis read from the held guard right before the timeout check.
🤖 Fix all issues with AI agents
In `@forester/dashboard/Dockerfile`:
- Around line 17-18: Combine the two consecutive Docker RUN layers into a single
RUN to reduce image layers and satisfy DL3059: replace the separate RUN addgroup
--system --gid 1001 nodejs and RUN adduser --system --uid 1001 nextjs with one
RUN that executes both commands (using shell chaining such as &&) so both
addgroup and adduser run in the same layer and preserve exit-on-failure
behavior.
- Around line 23-26: Add a Docker HEALTHCHECK instruction to the Dockerfile to
probe the running Next.js server (after the existing USER nextjs / EXPOSE 3000 /
ENV PORT=3000 / CMD ["node","server.js"] lines): implement a lightweight HTTP
probe (e.g., wget or curl against http://localhost:3000/ or a dedicated /health
route if available) with sensible parameters (interval, timeout, start-period
and retries) so the container runtime can mark the container unhealthy when the
server is unresponsive.
In `@forester/dashboard/src/app/metrics/page.tsx`:
- Around line 22-24: The empty-state check for isEmpty currently only inspects
metrics.transactions_processed_total and metrics.forester_balances; update it to
also consider metrics.queue_lengths (and optionally metrics.transaction_rate) so
that isEmpty truly reflects no data across all metric groups. Locate the isEmpty
declaration and include Object.keys(metrics.queue_lengths).length === 0 (and
Object.keys(metrics.transaction_rate).length === 0 if you want to treat
transaction_rate as contributing to non-empty) in the combined boolean
expression. Ensure you use the same metrics object keys (metrics.queue_lengths,
metrics.transaction_rate) and keep the logical AND semantics so the page only
treats it as empty when all relevant metric maps are empty.
In `@forester/dashboard/src/app/page.tsx`:
- Around line 45-52: Replace the array index key in the warnings.rendering (the
warnings.map callback) with the warning string itself: use the iteration value
(the warning string variable passed as w) as the React key in the div (ensuring
it’s unique/deterministic), or fall back to a stable derived id if you have any
risk of duplicates; update the key prop in the map inside the component that
renders warnings (the warnings.map callback) accordingly.
- Around line 29-34: The literal 1000 should be replaced with a named constant
to clarify intent and make tuning easier: add a top-level constant like
REGISTRATION_WARNING_SLOT_THRESHOLD (or similar) in page.tsx and use it in the
check inside the block that currently references
status.slots_until_next_registration (the if that also checks
status.registration_epoch_foresters.length). Replace the magic number with that
constant, choose a descriptive name and add a brief comment describing what the
threshold represents.
In `@forester/dashboard/src/components/ErrorState.tsx`:
- Around line 6-18: The current ErrorState component renders an "Error" UI even
when the error prop is falsy; change the logic in ErrorState so that when error
is undefined/null it returns null (not the error block) to avoid showing a false
error to users, and add a one-line comment above that early return explaining
it’s intentionally returning null because callers should guard rendering (or, if
you prefer to keep a defensive fallback, keep the UI but add a comment
clarifying it’s for the rare case where isError is true but error is undefined);
reference the ErrorState function and its props error and fallbackMessage when
making this change.
- Around line 20-23: The current fragile detection in ErrorState uses isApiError
computed from error.name === "ApiError" || error.message.startsWith("Forester
API returned"); remove the fallback string-prefix check and rely on the
structured discriminator (error.name === "ApiError") or add a typed property
(e.g., error.type === "ApiError" or a custom ApiError class) when creating
errors; update the isApiError logic (and any code that constructs errors) to use
that structured field instead of error.message.startsWith to avoid brittle
message parsing.
In `@forester/dashboard/src/components/ForesterList.tsx`:
- Around line 20-24: Replace the array index key with the unique forester
identifier: in the foresters.map callback (where the code currently uses
key={i}), use f.authority as the React key instead; update the JSX return in the
map (the element with key={...}) to key={f.authority} so React can correctly
reconcile items when the list changes.
In `@forester/dashboard/src/components/MetricsPanel.tsx`:
- Around line 66-73: The parameter name in the formatSol utility is misleading:
rename the function parameter in utils.ts from lamports to a clearer name like
sol or amount (e.g., change function formatSol(lamports: number) to
formatSol(sol: number)) and update any internal variable/JSdoc references inside
formatSol to use the new name; keep the function behavior unchanged and ensure
all call sites (e.g., the MetricsPanel usage formatSol(balance)) continue to
work without changes.
In `@forester/dashboard/src/components/ProgressBar.tsx`:
- Line 14: The computed percent "pct" can go negative when "value" < 0; update
the calculation in ProgressBar.tsx (the pct = ... expression using value and
max) to clamp to the [0,100] range by wrapping the existing min(...) with
Math.max(0, ...), e.g. compute the ratio as before and then apply Math.max(0,
Math.min(...)) so negative widths become 0 and values above 100 remain capped.
In `@forester/dashboard/src/components/Sidebar.tsx`:
- Around line 23-37: The nav highlight logic uses exact match (pathname ===
item.href) so parent items don't highlight for sub-routes; update the
computation of active (used when mapping navItems and passed to the Link
rendering) to treat the root "/" specially and use startsWith for other items —
e.g., compute active as item.href === "/" ? pathname === "/" :
pathname.startsWith(item.href) (replace the current pathname === item.href check
in the Sidebar mapping where item.href, pathname, and navItems are referenced).
In `@forester/dashboard/src/components/StatusBadge.tsx`:
- Around line 6-16: Add a short inline comment next to the StatusBadge component
(near the color prop and the className interpolation in StatusBadge) warning
that Tailwind/JIT only picks up literal class names and that constructing
classes dynamically (e.g., `bg-${severity}-500`) will be purged in production;
reference callers like batchStateColor in utils.ts as examples of where full
literals must exist and instruct future authors to pass predefined literal class
strings into the color prop or update the Tailwind safelist if dynamic
generation is required.
In `@forester/dashboard/src/components/TreeBatchDetail.tsx`:
- Around line 17-30: The three batch fields are inconsistently formatted:
formatNumber is used for info.zkp_batch_size but not for
info.input_pending_batches or info.output_pending_batches; update the JSX in
TreeBatchDetail (the <p> elements that render info.input_pending_batches and
info.output_pending_batches) to wrap their values with the formatNumber helper
so all three fields use formatNumber(info.<field>) for consistent numeric
formatting.
In `@forester/dashboard/src/components/TreeTable.tsx`:
- Around line 68-78: The filter buttons in TreeTable are missing an explicit
type and will default to "submit" inside forms; update the button element
rendered in the map (the one using key={f}, onClick={() => setFilter(f)} and
rendering {f === "all" ? "All" : f}) to include type="button" so clicking it
won't submit any surrounding form.
- Around line 216-242: The "+N" badge uses a fixed 8 instead of the actual
number of items shown, causing wrong counts near the end; compute the
remaining/hidden count using the computed start and visible slice (e.g., const
hidden = Math.max(0, schedule.length - (start + visible.length))) and render
+{hidden} instead of +{schedule.length - 8}; update references to schedule,
start, visible, and the span that outputs "+{...}" accordingly.
- Line 207: Move the stray "import { Fragment } from 'react'" into the top
import block alongside the existing React imports in the TreeTable component so
all React imports are colocated; then remove the duplicate import on line 207.
Locate the import usage by checking the TreeTable.tsx file and adjust the top
import statements (where React is imported) to include Fragment, ensuring there
are no duplicate React imports remaining.
In `@forester/dashboard/src/hooks/useForesterStatus.ts`:
- Around line 7-13: The magic number 0.46 in the refreshInterval calculation
should be replaced with a named constant (e.g., SLOT_DURATION_SECONDS or
FORESTER_SLOT_DURATION_SEC) declared at the top of useForesterStatus.ts; update
the expression in refreshInterval (the function using
data.slots_until_next_light_slot) to multiply by that constant instead of 0.46
and adjust any +500 or ms conversion logic if needed; ensure the constant name
documents the unit (seconds or ms) and consider aligning its value with the
backend's SLOT_DURATION_MS for consistency.
In `@forester/dashboard/src/lib/api.ts`:
- Around line 38-39: The current code returns res.json() directly so a 200 OK
with non-JSON body causes an opaque SyntaxError; wrap the JSON parse in a
try/catch (or first check Content-Type) around the res.json() call, attempt
res.text() on failure, and throw a new Error that includes the request context
(HTTP status and URL) and the response body/snippet to aid debugging; update the
function that calls res.json() (refer to res and res.json() in the file) to
produce this enriched error instead of propagating the raw SyntaxError.
In `@forester/dashboard/src/lib/utils.ts`:
- Around line 6-9: The parameter name in formatSol is misleading: it is named
lamports but actually receives SOL (e.g., ForesterInfo.balance_sol), which can
cause incorrect future fixes; rename the parameter in the function signature and
all callers from lamports to sol (keep the type number | null | undefined) and
preserve the current logic (no /1e9 conversion) so formatSol(sol) returns "-"
for null/undefined or `${sol.toFixed(4)} SOL`; update references to formatSol to
use the new parameter name where applicable.
In `@forester/src/api_server.rs`:
- Around line 381-384: The CORS setup currently uses
warp::cors().allow_any_origin() which is unsafe for production; update the code
that constructs the cors variable to read a new configuration option (e.g.,
cors_origin: Option<String> or Vec<String>) or reuse the existing
allow_public_bind flag to choose between allow_any_origin() for local/public
testing and a restricted .allow_origin(...) call for production. Locate the cors
builder in api_server.rs where cors is defined and replace the unconditional
allow_any_origin() with a conditional: if a cors_origin config is present use
warp::cors().allow_origin(...) (or map multiple origins) and permit only
necessary methods/headers, otherwise fall back to allow_any_origin() only when
allow_public_bind is true.
- Around line 85-97: CompressibleTrackers and CompressibleDashboardState are
duplicates; remove CompressibleDashboardState and replace its uses with
CompressibleTrackers (which already derives Clone), update ApiServerConfig to
hold CompressibleTrackers instead of CompressibleDashboardState, and change the
place that converted CompressibleDashboardState into tracker fields to simply
clone the CompressibleTrackers (e.g., replace the manual ctoken/pda/mint mapping
with a .clone() of CompressibleTrackers); ensure all references and imports for
CompressibleDashboardState are removed and compile.
In `@forester/src/compressible/bootstrap_helpers.rs`:
- Around line 493-496: The reqwest Client is given a 30s per-request timeout
which races with the module-level tokio timeout used by send_rpc_request
(RPC_REQUEST_TIMEOUT); remove the client-level timeout to avoid double-timing
and rely on send_rpc_request's timeout instead—i.e., delete the
.timeout(Duration::from_secs(30)) call in the Client::builder() code that
constructs the client so the client is built without that per-request timeout
while leaving send_rpc_request and RPC_REQUEST_TIMEOUT unchanged.
- Around line 488-559: The non-localhost pagination in count_program_accounts
duplicates logic found in bootstrap_v2_api (payload construction, paginationKey
handling, sleep, break conditions); refactor by extracting a shared paginator
function (e.g., paginate_rpc_pages) that accepts parameters like rpc client,
rpc_url, method name ("getProgramAccountsV2"), program_id, optional dataSlice,
filters, PAGE_SIZE, DEFAULT_PAGINATION_DELAY_MS and a per-page callback or
collector; have paginate_rpc_pages use extract_accounts_array and
extract_pagination_cursor for page parsing and return collected results or
invoke the callback for each page, then update count_program_accounts to call
this paginator (and similarly update bootstrap_v2_api to reuse it) to remove
duplication and centralize retry/rate-limit behavior.
In `@forester/src/epoch_manager.rs`:
- Around line 926-927: The code casts slots_to_wait (u64) to u32 when computing
wait_duration (slot_duration() * slots_to_wait as u32), which can silently
truncate large values; change this to clamp/saturate the value before casting
(e.g., let slots_to_wait_clamped = slots_to_wait.min(u32::MAX as u64) and use
slot_duration() * (slots_to_wait_clamped as u32)) or otherwise compute the
multiplied Duration via checked arithmetic so it cannot wrap; update the
multiplication at the sleep call in epoch_manager.rs (the slot_duration() * ...
expression) and mirror the same safe clamping approach used in
calculate_remaining_time_or_default / the prewarm timeout sites.
In `@forester/src/forester_status.rs`:
- Around line 588-589: The match on tree.tree_type returns a 7-tuple
(fullness_percentage, next_index, capacity, height, threshold, queue_length,
v2_queue_info) which hurts readability; create a small struct (e.g.,
ParsedTreeMetrics) with named fields matching those seven items, update each
match arm to return ParsedTreeMetrics instead of a tuple, and then destructure
or access fields from that struct where fullness_percentage, next_index,
capacity, height, threshold, queue_length, and v2_queue_info are currently used;
update any function signatures or variable bindings that expect the tuple to use
ParsedTreeMetrics (or its fields) to preserve type correctness.
- Around line 688-691: The threshold computation in the StateV2 and AddressV2
arms recomputes 1u64 << height and uses plain multiplication, causing
inconsistency with V1; change both to compute threshold_val from the
already-calculated capacity variable using
capacity.saturating_mul(merkle_tree.metadata.rollover_metadata.rollover_threshold)
/ 100 so the multiplication is overflow-safe and you don't recalculate 1u64 <<
height (refer to capacity, threshold_val,
merkle_tree.metadata.rollover_metadata.rollover_threshold, StateV2, AddressV2,
height, saturating_mul).
In `@forester/src/main.rs`:
- Around line 147-166: The Dashboard branch (Commands::Dashboard) currently
awaits ctrl_c().await then calls api_server_handle.shutdown() with no
forced-exit fallback; update it to reuse the existing spawn_shutdown_handler (or
implement equivalent) so that after the first Ctrl+C you trigger
api_server_handle.shutdown() and also spawn a watcher that on a second Ctrl+C or
after a timeout calls std::process::exit(1); specifically replace the direct
ctrl_c().await + api_server_handle.shutdown() sequence with a call into
spawn_shutdown_handler (or wrap api_server_handle.shutdown() in a
shutdown-with-timeout helper) to ensure a forced exit if graceful shutdown
hangs.
In `@forester/src/metrics.rs`:
- Around line 343-382: The code swallows Prometheus query errors by only
handling Ok(...) for tx_total, tx_rate, last_run, balances, and queues; update
each branch that currently uses if let Ok(ref v) = ... to also handle the Err(e)
case and log the error (e.g., warn! or debug!) with context (which query failed
and error details) before falling back to the zero/default values; make these
changes around the blocks that populate transactions_processed_total,
transaction_rate, last_run_timestamp, forester_balances, and queue_lengths and
reuse extract_label_values where applicable so failures are logged but existing
fallback behavior remains.
In `@forester/src/processor/v2/helpers.rs`:
- Around line 471-473: The call to self.data_ready.wait_timeout(guard,
Duration::from_millis(50)) currently discards a potential PoisonError; update
this to handle the Result consistently with the file's lock_recover pattern by
matching the return value, calling lock_recover (or equivalent recovery) if
Err(PoisonError) is returned, and logging or propagating the recovered guard;
specifically modify the wait_timeout invocation tied to the guard produced by
lock_recover so that Err(poison) is handled (recover the mutex and continue
using the recovered guard) rather than silently ignored.
In `@forester/src/processor/v2/proof_worker.rs`:
- Line 3: Replace the hard-coded MAX_CONCURRENT_PROOFS constant with a
configurable field on ProverConfig (e.g., max_concurrent_proofs: usize)
defaulting to 64, then use that field inside spawn_proof_workers and any
functions in proof_worker.rs that currently reference MAX_CONCURRENT_PROOFS;
update spawn_proof_workers' parameter usage to read config.max_concurrent_proofs
(or fall back to 64 if None) and remove or deprecate the constant so deployments
can tune concurrency via ProverConfig.
- Around line 182-186: The code stores the Result from
semaphore.clone().acquire_owned().await into permit and then moves that Result
into the spawned task as _permit, which means the actual OwnedSemaphorePermit
isn’t held and errors are silently ignored; change the await to unwrap/expect
(or match and early-return/log) so you obtain an OwnedSemaphorePermit before
spawning. Concretely, call semaphore.clone().acquire_owned().await.expect("...")
(or handle Err explicitly) so permit is an OwnedSemaphorePermit, then move that
permit into the tokio::spawn closure and bind it to _permit to keep the permit
alive for the task.
In `@forester/src/slot_tracker.rs`:
- Around line 13-16: The slot duration is hardcoded to 400ms in SLOT_DURATION_MS
and exposed via slot_duration(), but the frontend uses a different value
(460ms); to avoid divergence either extract a shared canonical constant (e.g.,
move SLOT_DURATION_MS/slot_duration to a shared config crate or an exported API
consumed by the dashboard) or, at minimum, add a clear comment above
SLOT_DURATION_MS stating it is the canonical slot duration (400ms) and call out
the frontend's 460ms polling approximation and where to change it; update
references to SLOT_DURATION_MS/slot_duration() accordingly so all code points
use the single source of truth.
In `@forester/src/telemetry.rs`:
- Line 42: The file_layer created with
fmt::Layer::new().with_writer(non_blocking) should explicitly disable ANSI
escape codes; update the file_layer construction (symbol: file_layer, function:
fmt::Layer::new()) to call .with_ansi(false) so logs written via the
non_blocking writer do not contain terminal ANSI sequences.
| RUN addgroup --system --gid 1001 nodejs | ||
| RUN adduser --system --uid 1001 nextjs |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consolidate consecutive RUN instructions.
These two RUN commands can be merged into one layer, reducing image size slightly and addressing the Hadolint DL3059 hint.
♻️ Proposed fix
-RUN addgroup --system --gid 1001 nodejs
-RUN adduser --system --uid 1001 nextjs
+RUN addgroup --system --gid 1001 nodejs && \
+ adduser --system --uid 1001 nextjs📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RUN addgroup --system --gid 1001 nodejs | |
| RUN adduser --system --uid 1001 nextjs | |
| RUN addgroup --system --gid 1001 nodejs && \ | |
| adduser --system --uid 1001 nextjs |
🧰 Tools
🪛 Hadolint (2.14.0)
[info] 18-18: Multiple consecutive RUN instructions. Consider consolidation.
(DL3059)
🤖 Prompt for AI Agents
In `@forester/dashboard/Dockerfile` around lines 17 - 18, Combine the two
consecutive Docker RUN layers into a single RUN to reduce image layers and
satisfy DL3059: replace the separate RUN addgroup --system --gid 1001 nodejs and
RUN adduser --system --uid 1001 nextjs with one RUN that executes both commands
(using shell chaining such as &&) so both addgroup and adduser run in the same
layer and preserve exit-on-failure behavior.
| USER nextjs | ||
| EXPOSE 3000 | ||
| ENV PORT=3000 | ||
| CMD ["node", "server.js"] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding a HEALTHCHECK instruction.
Both Trivy and Checkov flag the missing HEALTHCHECK. For a production dashboard container, a simple health check helps orchestrators (Docker Compose, ECS, K8s with Docker health probes) detect when the Next.js server is unresponsive.
🩺 Example HEALTHCHECK
USER nextjs
EXPOSE 3000
ENV PORT=3000
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+ CMD ["wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/", "||", "exit", "1"]
CMD ["node", "server.js"]Note: node:20-alpine includes wget. Adjust the endpoint to a dedicated health route if one exists.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| USER nextjs | |
| EXPOSE 3000 | |
| ENV PORT=3000 | |
| CMD ["node", "server.js"] | |
| USER nextjs | |
| EXPOSE 3000 | |
| ENV PORT=3000 | |
| HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ | |
| CMD ["wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/", "||", "exit", "1"] | |
| CMD ["node", "server.js"] |
🧰 Tools
🪛 Checkov (3.2.334)
[low] 1-26: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
🤖 Prompt for AI Agents
In `@forester/dashboard/Dockerfile` around lines 23 - 26, Add a Docker HEALTHCHECK
instruction to the Dockerfile to probe the running Next.js server (after the
existing USER nextjs / EXPOSE 3000 / ENV PORT=3000 / CMD ["node","server.js"]
lines): implement a lightweight HTTP probe (e.g., wget or curl against
http://localhost:3000/ or a dedicated /health route if available) with sensible
parameters (interval, timeout, start-period and retries) so the container
runtime can mark the container unhealthy when the server is unresponsive.
| const isEmpty = | ||
| Object.keys(metrics.transactions_processed_total).length === 0 && | ||
| Object.keys(metrics.forester_balances).length === 0; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider including queue_lengths in the empty-state check.
The emptiness check only looks at transactions_processed_total and forester_balances. If the API returns data solely in queue_lengths or transaction_rate, the page will still show the MetricsPanel — which is probably fine since MetricsPanel conditionally renders those sections. Just flagging in case the intent was to detect "truly no data at all."
🤖 Prompt for AI Agents
In `@forester/dashboard/src/app/metrics/page.tsx` around lines 22 - 24, The
empty-state check for isEmpty currently only inspects
metrics.transactions_processed_total and metrics.forester_balances; update it to
also consider metrics.queue_lengths (and optionally metrics.transaction_rate) so
that isEmpty truly reflects no data across all metric groups. Locate the isEmpty
declaration and include Object.keys(metrics.queue_lengths).length === 0 (and
Object.keys(metrics.transaction_rate).length === 0 if you want to treat
transaction_rate as contributing to non-empty) in the combined boolean
expression. Ensure you use the same metrics object keys (metrics.queue_lengths,
metrics.transaction_rate) and keep the logical AND semantics so the page only
treats it as empty when all relevant metric maps are empty.
| if ( | ||
| status.slots_until_next_registration < 1000 && | ||
| status.registration_epoch_foresters.length === 0 | ||
| ) { | ||
| warnings.push("Registration closing soon with no foresters registered"); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Magic number 1000 for registration slot threshold.
This threshold controls when the "Registration closing soon" warning appears. Consider extracting it as a named constant to make the intent clearer and easier to tune:
+const REGISTRATION_WARNING_SLOT_THRESHOLD = 1000;
+
if (
- status.slots_until_next_registration < 1000 &&
+ status.slots_until_next_registration < REGISTRATION_WARNING_SLOT_THRESHOLD &&
status.registration_epoch_foresters.length === 0
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| status.slots_until_next_registration < 1000 && | |
| status.registration_epoch_foresters.length === 0 | |
| ) { | |
| warnings.push("Registration closing soon with no foresters registered"); | |
| } | |
| const REGISTRATION_WARNING_SLOT_THRESHOLD = 1000; | |
| if ( | |
| status.slots_until_next_registration < REGISTRATION_WARNING_SLOT_THRESHOLD && | |
| status.registration_epoch_foresters.length === 0 | |
| ) { | |
| warnings.push("Registration closing soon with no foresters registered"); | |
| } |
🤖 Prompt for AI Agents
In `@forester/dashboard/src/app/page.tsx` around lines 29 - 34, The literal 1000
should be replaced with a named constant to clarify intent and make tuning
easier: add a top-level constant like REGISTRATION_WARNING_SLOT_THRESHOLD (or
similar) in page.tsx and use it in the check inside the block that currently
references status.slots_until_next_registration (the if that also checks
status.registration_epoch_foresters.length). Replace the magic number with that
constant, choose a descriptive name and add a brief comment describing what the
threshold represents.
| {warnings.map((w, i) => ( | ||
| <div | ||
| key={i} | ||
| className="bg-amber-50 border border-amber-200 rounded-lg px-4 py-3 text-sm text-amber-800" | ||
| > | ||
| {w} | ||
| </div> | ||
| ))} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using the warning string as key instead of array index.
Since warning messages are unique strings derived deterministically from the status, they make better React keys than array indices. This is a minor nit — with a small, non-reorderable list it won't cause bugs, but it's slightly more idiomatic:
- {warnings.map((w, i) => (
- <div key={i}
+ {warnings.map((w) => (
+ <div key={w}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {warnings.map((w, i) => ( | |
| <div | |
| key={i} | |
| className="bg-amber-50 border border-amber-200 rounded-lg px-4 py-3 text-sm text-amber-800" | |
| > | |
| {w} | |
| </div> | |
| ))} | |
| {warnings.map((w) => ( | |
| <div | |
| key={w} | |
| className="bg-amber-50 border border-amber-200 rounded-lg px-4 py-3 text-sm text-amber-800" | |
| > | |
| {w} | |
| </div> | |
| ))} |
🤖 Prompt for AI Agents
In `@forester/dashboard/src/app/page.tsx` around lines 45 - 52, Replace the array
index key in the warnings.rendering (the warnings.map callback) with the warning
string itself: use the iteration value (the warning string variable passed as w)
as the React key in the div (ensuring it’s unique/deterministic), or fall back
to a stable derived id if you have any risk of duplicates; update the key prop
in the map inside the component that renders warnings (the warnings.map
callback) accordingly.
| let _ = self | ||
| .data_ready | ||
| .wait_timeout(guard, std::time::Duration::from_millis(50)); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Handle poisoned condvar consistently with lock_recover pattern.
Throughout this file, mutex poisoning is handled gracefully via lock_recover. But wait_timeout on a poisoned mutex will return Err(PoisonError), and the current let _ = ... silently discards it. If the mutex was poisoned between the lock_recover at line 467 and the wait_timeout at line 471, the error is swallowed without logging.
For consistency with the rest of the file's poison-recovery strategy:
🛡️ Proposed fix
- let _ = self
+ let (_guard, _timeout_result) = self
.data_ready
- .wait_timeout(guard, std::time::Duration::from_millis(50));
+ .wait_timeout(guard, std::time::Duration::from_millis(50))
+ .unwrap_or_else(|poisoned| {
+ tracing::warn!("Poisoned condvar (recovering): streaming_address_queue.data_ready");
+ poisoned.into_inner()
+ });🤖 Prompt for AI Agents
In `@forester/src/processor/v2/helpers.rs` around lines 471 - 473, The call to
self.data_ready.wait_timeout(guard, Duration::from_millis(50)) currently
discards a potential PoisonError; update this to handle the Result consistently
with the file's lock_recover pattern by matching the return value, calling
lock_recover (or equivalent recovery) if Err(PoisonError) is returned, and
logging or propagating the recovered guard; specifically modify the wait_timeout
invocation tied to the guard produced by lock_recover so that Err(poison) is
handled (recover the mutex and continue using the recovered guard) rather than
silently ignored.
| @@ -1,5 +1,7 @@ | |||
| use std::{sync::Arc, time::Duration}; | |||
|
|
|||
| const MAX_CONCURRENT_PROOFS: usize = 64; | |||
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider making the concurrency limit configurable via ProverConfig.
MAX_CONCURRENT_PROOFS = 64 is a reasonable default, but the optimal value depends on the prover server's capacity, which can vary across deployments. ProverConfig is already threaded into spawn_proof_workers — adding a field with a default of 64 would let operators tune this without a recompile.
🤖 Prompt for AI Agents
In `@forester/src/processor/v2/proof_worker.rs` at line 3, Replace the hard-coded
MAX_CONCURRENT_PROOFS constant with a configurable field on ProverConfig (e.g.,
max_concurrent_proofs: usize) defaulting to 64, then use that field inside
spawn_proof_workers and any functions in proof_worker.rs that currently
reference MAX_CONCURRENT_PROOFS; update spawn_proof_workers' parameter usage to
read config.max_concurrent_proofs (or fall back to 64 if None) and remove or
deprecate the constant so deployments can tune concurrency via ProverConfig.
| let permit = semaphore.clone().acquire_owned().await; | ||
| // Spawn immediately so we don't block receiving the next job | ||
| // while waiting for HTTP submission | ||
| // while waiting for HTTP submission. Semaphore bounds concurrency. | ||
| tokio::spawn(async move { | ||
| let _permit = permit; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Unwrap the semaphore permit to avoid silently ignoring errors.
acquire_owned() returns Result<OwnedSemaphorePermit, AcquireError>. Right now, permit is a Result and on line 186 _permit holds that Result, not the actual permit. If acquire_owned ever errors (semaphore closed), the task would run without holding a permit—bypassing the concurrency bound silently. In practice AcquireError can't fire here because the Arc<Semaphore> is never dropped, but making the intent explicit is much cleaner and will surface bugs if the code evolves.
Proposed fix
- let permit = semaphore.clone().acquire_owned().await;
+ let permit = semaphore
+ .clone()
+ .acquire_owned()
+ .await
+ .expect("semaphore closed unexpectedly");
// Spawn immediately so we don't block receiving the next job
// while waiting for HTTP submission. Semaphore bounds concurrency.
tokio::spawn(async move {
let _permit = permit;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let permit = semaphore.clone().acquire_owned().await; | |
| // Spawn immediately so we don't block receiving the next job | |
| // while waiting for HTTP submission | |
| // while waiting for HTTP submission. Semaphore bounds concurrency. | |
| tokio::spawn(async move { | |
| let _permit = permit; | |
| let permit = semaphore | |
| .clone() | |
| .acquire_owned() | |
| .await | |
| .expect("semaphore closed unexpectedly"); | |
| // Spawn immediately so we don't block receiving the next job | |
| // while waiting for HTTP submission. Semaphore bounds concurrency. | |
| tokio::spawn(async move { | |
| let _permit = permit; |
🤖 Prompt for AI Agents
In `@forester/src/processor/v2/proof_worker.rs` around lines 182 - 186, The code
stores the Result from semaphore.clone().acquire_owned().await into permit and
then moves that Result into the spawned task as _permit, which means the actual
OwnedSemaphorePermit isn’t held and errors are silently ignored; change the
await to unwrap/expect (or match and early-return/log) so you obtain an
OwnedSemaphorePermit before spawning. Concretely, call
semaphore.clone().acquire_owned().await.expect("...") (or handle Err explicitly)
so permit is an OwnedSemaphorePermit, then move that permit into the
tokio::spawn closure and bind it to _permit to keep the permit alive for the
task.
| const SLOT_DURATION_MS: u64 = 400; | ||
|
|
||
| pub fn slot_duration() -> Duration { | ||
| Duration::from_nanos(solana_sdk::genesis_config::GenesisConfig::default().ns_per_slot() as u64) | ||
| Duration::from_millis(SLOT_DURATION_MS) |
There was a problem hiding this comment.
Hardcoded 400ms slot duration — reasonable, but note the frontend inconsistency.
400ms is Solana's target slot time, so this is a sensible default. However, there's a discrepancy with the dashboard frontend: in forester/dashboard/src/hooks/useForesterStatus.ts (line 9), the refresh interval calculation uses 0.46 * 1000 (460ms) per slot. While both are approximations and the frontend usage is just for polling cadence, having two different "slot duration" constants across the codebase is a maintenance hazard.
Consider extracting a shared constant or at least adding a comment here noting the canonical value, so future readers know which to reference.
🤖 Prompt for AI Agents
In `@forester/src/slot_tracker.rs` around lines 13 - 16, The slot duration is
hardcoded to 400ms in SLOT_DURATION_MS and exposed via slot_duration(), but the
frontend uses a different value (460ms); to avoid divergence either extract a
shared canonical constant (e.g., move SLOT_DURATION_MS/slot_duration to a shared
config crate or an exported API consumed by the dashboard) or, at minimum, add a
clear comment above SLOT_DURATION_MS stating it is the canonical slot duration
(400ms) and call out the frontend's 460ms polling approximation and where to
change it; update references to SLOT_DURATION_MS/slot_duration() accordingly so
all code points use the single source of truth.
| let (non_blocking, guard) = tracing_appender::non_blocking(file_appender); | ||
| let _ = LOG_GUARD.set(guard); | ||
|
|
||
| let file_layer = fmt::Layer::new().with_writer(non_blocking); |
There was a problem hiding this comment.
File layer should disable ANSI escape codes.
The stdout_layer correctly sets .with_ansi(true) for terminal coloring, but the file_layer doesn't set .with_ansi(false). By default, fmt::Layer may include ANSI codes depending on the writer, and writing them to log files produces garbled output when viewed in editors or ingested by log aggregation tools.
Proposed fix
- let file_layer = fmt::Layer::new().with_writer(non_blocking);
+ let file_layer = fmt::Layer::new().with_writer(non_blocking).with_ansi(false);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let file_layer = fmt::Layer::new().with_writer(non_blocking); | |
| let file_layer = fmt::Layer::new().with_writer(non_blocking).with_ansi(false); |
🤖 Prompt for AI Agents
In `@forester/src/telemetry.rs` at line 42, The file_layer created with
fmt::Layer::new().with_writer(non_blocking) should explicitly disable ANSI
escape codes; update the file_layer construction (symbol: file_layer, function:
fmt::Layer::new()) to call .with_ansi(false) so logs written via the
non_blocking writer do not contain terminal ANSI sequences.
Summary by CodeRabbit
New Features
Bug Fixes
Chores