Skip to content

Added Grafana dashboard and a couple of fixes#638

Merged
tcsenpai merged 15 commits intotestnetfrom
grafana
Jan 11, 2026
Merged

Added Grafana dashboard and a couple of fixes#638
tcsenpai merged 15 commits intotestnetfrom
grafana

Conversation

@tcsenpai
Copy link
Contributor

@tcsenpai tcsenpai commented Jan 11, 2026

User description

Besides adding Prometheus + Grafana for observability, fixed a TUI unresponsiveness error and updated README.md and INSTALL.md to be more user friendly


PR Type

Enhancement, Documentation


Description

  • Added comprehensive Prometheus metrics collection and Grafana monitoring stack for observability

    • New MetricsCollector service actively gathers blockchain, system, network, and service health metrics
    • New MetricsService implements Prometheus registry with support for counters, gauges, histograms, and summaries
    • New MetricsServer exposes /metrics endpoint on dedicated port (default 9090) for Prometheus scraping
  • Created four production-ready Grafana dashboards for monitoring

    • DEMOS Network Node Overview dashboard with real-time blockchain and resource metrics
    • Network Peers and Connectivity dashboard for peer health monitoring
    • Consensus and Blockchain dashboard with block production analysis
    • System Health dashboard with CPU, memory, Docker container, and port health checks
  • Integrated monitoring stack into main application with Docker Compose configuration

    • Added Prometheus v2.48.0 and Grafana v10.2.2 services with persistent volumes
    • Implemented monitoring startup/shutdown in main run script with health checks
    • Added METRICS_ENABLED and METRICS_PORT configuration options
  • Fixed TUI unresponsiveness issue by preventing stdin manipulation when TUI is enabled

  • Replaced terminal-kit dependency with structured logging across 11 modules for consistency

  • Added support for new peer data format with backward compatibility

  • Optimized TUI performance with log caching and debounced updates

  • Updated documentation (README.md, INSTALL.md, monitoring/README.md) with comprehensive setup and configuration guides

  • Added prom-client v15.1.3 dependency and bumped version to 0.47.0


Diagram Walkthrough

flowchart LR
  A["Node Subsystems<br/>Blockchain, System,<br/>Network, Health"]
  B["MetricsCollector<br/>Active Collection<br/>2.5s Interval"]
  C["MetricsService<br/>Prometheus Registry<br/>prom-client"]
  D["MetricsServer<br/>HTTP Endpoint<br/>Port 9090"]
  E["Prometheus<br/>Scraping<br/>15s Interval"]
  F["Grafana<br/>4 Dashboards<br/>Visualization"]
  
  A -- "collects" --> B
  B -- "registers" --> C
  C -- "exposes" --> D
  D -- "scrapes" --> E
  E -- "queries" --> F
Loading

File Walkthrough

Relevant files
Enhancement
12 files
MetricsCollector.ts
Active metrics collection from node subsystems                     

src/features/metrics/MetricsCollector.ts

  • New metrics collector service that actively gathers metrics from
    blockchain, system, network, and service health subsystems
  • Implements singleton pattern with configurable collection intervals
    (default 2.5 seconds)
  • Collects blockchain metrics (block height, transaction counts,
    timestamps), system metrics (CPU, memory, load average), network I/O
    rates, peer information, and service health checks
  • Includes Docker container health monitoring and port accessibility
    checks for critical services
+722/-0 
MetricsService.ts
Prometheus metrics registry and management service             

src/features/metrics/MetricsService.ts

  • Core Prometheus metrics registry and management service using
    prom-client library
  • Implements singleton pattern for global metrics access with support
    for counters, gauges, histograms, and summaries
  • Registers core Demos metrics for consensus, network, transactions,
    API, IPFS, and GCR subsystems
  • Provides methods to create, update, and retrieve metrics in Prometheus
    format
+530/-0 
MetricsServer.ts
HTTP server for Prometheus metrics endpoint                           

src/features/metrics/MetricsServer.ts

  • HTTP server for exposing Prometheus metrics endpoint on dedicated port
    (default 9090)
  • Provides /metrics endpoint for Prometheus scraping, /health for health
    checks, and / for service info
  • Runs on separate port from main RPC server to avoid interference
  • Handles request routing and error management for metrics exposure
+168/-0 
index.ts
Metrics module public API exports                                               

src/features/metrics/index.ts

  • Module barrel export file for metrics feature
  • Exports MetricsService, MetricsServer, and MetricsCollector classes
    with their configurations
  • Provides clean public API for metrics functionality
+26/-0   
index.ts
Integrate Prometheus metrics and fix TUI stdin conflict   

src/index.ts

  • Integrated metrics server startup in main application initialization
  • Added metrics configuration to indexState with METRICS_ENABLED and
    METRICS_PORT settings
  • Starts metrics server and collector during node startup with automatic
    port allocation
  • Added metrics server shutdown handling in graceful shutdown routine
  • Fixed TUI unresponsiveness issue by preventing stdin manipulation when
    TUI is enabled
+113/-36
PeerManager.ts
Support new peer data format with backward compatibility 

src/libs/peer/PeerManager.ts

  • Added support for both old (string) and new (object with url property)
    peer data formats
  • Includes validation and error handling for invalid peer data formats
  • Maintains backward compatibility with existing peer list
    configurations
+12/-1   
CategorizedLogger.ts
Cache getAllEntries to improve log retrieval performance 

src/utilities/tui/CategorizedLogger.ts

  • Added caching mechanism for getAllEntries() to avoid repeated sorting
    on every call
  • Cache is invalidated when entry counter changes or logs are cleared
  • Improves performance by reducing computational overhead in log
    retrieval
+16/-1   
TUIManager.ts
Debounce log updates for improved TUI performance               

src/utilities/tui/TUIManager.ts

  • Optimized log rendering by deferring filtered log updates until render
    cycle
  • Added logsNeedUpdate flag to mark when logs need refreshing instead of
    updating on every entry
  • Reduces unnecessary computations during high-frequency log events
+18/-9   
demos-overview.json
DEMOS Network Node Overview Grafana Dashboard                       

monitoring/grafana/provisioning/dashboards/json/demos-overview.json

  • Added comprehensive Grafana dashboard for DEMOS Network node
    monitoring with 1187 lines of JSON configuration
  • Includes real-time blockchain status panels (node RPC health, block
    height, block lag, peer count, transaction count, RPC latency)
  • Displays node resource utilization metrics (CPU, memory usage, load
    average)
  • Shows infrastructure status (Docker services, network ports, network
    I/O)
  • Configured with Prometheus datasource and 5-second refresh interval
+1187/-0
network-peers.json
Network Peers and Connectivity Grafana Dashboard                 

monitoring/grafana/provisioning/dashboards/json/network-peers.json

  • Added Grafana dashboard for peer connectivity monitoring with 916
    lines of JSON configuration
  • Includes peer connectivity metrics (online/offline peers, total known
    peers, peer health percentage)
  • Displays network I/O analysis with RX/TX rates and total bytes
    transmitted/received
  • Contains peer details table with status information
  • Configured with 5-second refresh interval and multi-metric time series
    visualizations
+916/-0 
consensus-blockchain.json
Consensus and Blockchain Monitoring Grafana Dashboard       

monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json

  • Added Grafana dashboard for blockchain consensus monitoring with 680
    lines of JSON configuration
  • Includes block production metrics (current block height, seconds since
    last block, transactions per block, last block timestamp)
  • Displays block timing analysis with production rate and timing
    thresholds
  • Configured with color-coded alerts for block production delays (green
    <30s, yellow 30-60s, red >60s)
  • Provides time series visualizations for block height trends and
    transaction patterns
+680/-0 
run
Monitoring Stack Integration in Main Run Script                   

run

  • Added MONITORING_DISABLED flag to control Prometheus/Grafana stack
    startup (default: enabled)
  • Added -m and --no-monitoring command-line options to disable
    monitoring
  • Implemented monitoring stack startup logic with health checks for
    Grafana readiness
  • Added monitoring stack shutdown in cleanup function (ctrl_c)
  • Displays monitoring URLs (Prometheus and Grafana) on successful
    startup
+73/-1   
Refactoring
9 files
endpointHandlers.ts
Replace terminal-kit with structured logging                         

src/libs/network/endpointHandlers.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted colored terminal output (term.yellow, term.red) to
    structured log calls with categories
  • Improved logging consistency across transaction handling and
    validation routines
+6/-12   
validateTransaction.ts
Replace terminal-kit with structured logging                         

src/libs/blockchain/routines/validateTransaction.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted terminal output to structured logging with "TX" category
  • Improved error message formatting in transaction validation
+5/-19   
DAHRFactory.ts
Replace terminal-kit with structured logging                         

src/features/web2/dahr/DAHRFactory.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted colored terminal output to structured logging with "DAHR"
    category
  • Improved consistency in DAHR instance lifecycle logging
+5/-10   
manageExecution.ts
Replace terminal-kit with structured logging                         

src/libs/network/manageExecution.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted terminal output to structured logging with "SERVER" category
  • Improved error and info message formatting
+3/-8     
keyMaker.ts
Replace terminal-kit with structured logging                         

src/libs/utils/keyMaker.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted terminal output to structured logging with "KEYMAKER"
    category
  • Improved consistency in identity generation and loading messages
+8/-10   
manageAuth.ts
Replace terminal-kit with structured logging                         

src/libs/network/manageAuth.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Consolidated duplicate logging statements into single structured log
    calls
  • Improved authentication message logging consistency
+3/-9     
cryptography.ts
Replace terminal-kit with structured logging                         

src/libs/crypto/cryptography.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted terminal output to structured logging with "CRYPTO" category
  • Changed term.yellow to log.debug and term.red to log.error
    appropriately
+8/-9     
identity.ts
Replace terminal-kit with structured logging                         

src/libs/identity/identity.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Converted terminal output to structured logging with "IDENTITY"
    category
  • Improved consistency in identity lifecycle logging
+2/-5     
gcr.ts
Replace terminal-kit with structured logging                         

src/libs/blockchain/gcr/gcr.ts

  • Removed terminal-kit dependency and replaced with logger calls
  • Changed term.yellow to log.debug for balance lookup messages
+1/-4     
Formatting
1 files
PointSystem.ts
Use const for immutable variable declaration                         

src/features/incentive/PointSystem.ts

  • Changed let linkedNomis declaration to const for immutability
+1/-1     
Documentation
4 files
system-health.json
Grafana system health dashboard with resource and service monitoring

monitoring/grafana/provisioning/dashboards/json/system-health.json

  • New Grafana dashboard for system health monitoring with 1321 lines of
    configuration
  • Includes panels for CPU usage, memory usage, load average, Docker
    container status, and port health checks
  • Displays system resource metrics with color-coded thresholds and
    time-series visualizations
  • Covers service health section with status indicators for PostgreSQL,
    TLSNotary, IPFS, and critical ports
+1321/-0
README.md
Monitoring stack documentation and setup guide                     

monitoring/README.md

  • Comprehensive monitoring stack documentation for Prometheus and
    Grafana setup
  • Includes quick start guide, architecture overview, configuration
    options, and troubleshooting
  • Documents available metrics, dashboard descriptions, and security
    considerations
  • Provides Docker Compose commands and directory structure reference
+269/-0 
README.md
Add monitoring and network ports documentation                     

README.md

  • Added monitoring section documenting Prometheus and Grafana
    integration
  • Included metrics configuration options and available metrics reference
    table
  • Added network ports section documenting required and optional ports
    with firewall examples
  • Provided links to detailed monitoring documentation
+87/-0   
INSTALL.md
Updated Installation Guide with Dependencies and Port Documentation

INSTALL.md

  • Updated installation instructions to reference ./install-deps.sh
    script instead of direct bun install
  • Added note about Rust/Cargo requirement for wstcp tool installation
  • Included Rust installation instructions with curl command
  • Added comprehensive network information table documenting required and
    optional ports
  • Clarified port usage for Node RPC, OmniProtocol, TLSNotary,
    Prometheus, Grafana, and PostgreSQL
+30/-6   
Dependencies
1 files
package.json
Add prom-client dependency for metrics                                     

package.json

  • Added prom-client v15.1.3 dependency for Prometheus metrics collection
+1/-0     
Configuration changes
7 files
.local_version
Version bump for release                                                                 

.beads/.local_version

  • Version bump from 0.46.0 to 0.47.0
+1/-1     
docker-compose.yml
Docker Compose Configuration for Monitoring Stack               

monitoring/docker-compose.yml

  • Added Docker Compose configuration for Prometheus v2.48.0 and Grafana
    v10.2.2 services
  • Configured Prometheus with 15-day retention, lifecycle API enabled,
    and volume persistence
  • Configured Grafana with admin credentials, dark theme, DEMOS branding,
    and provisioning volumes
  • Included optional Node Exporter service (profile-based) for host-level
    metrics
  • Set up demos-monitoring bridge network for service communication
+130/-0 
grafana.ini
Grafana Configuration with DEMOS Branding                               

monitoring/grafana/grafana.ini

  • Added custom Grafana configuration file with 103 lines of settings
  • Configured server, dashboards, users, theme, authentication, and
    security settings
  • Enabled unified alerting, custom HTML panels, and modern feature
    toggles
  • Set dark theme as default for glass-morphism aesthetic
  • Disabled analytics, news feed, and gravatar for cleaner interface
+103/-0 
prometheus.yml
Prometheus Configuration for Node Metrics Collection         

monitoring/prometheus/prometheus.yml

  • Added Prometheus configuration file with 67 lines for DEMOS Network
    node monitoring
  • Configured global scrape interval (15s) and evaluation interval
  • Set up three scrape jobs: Prometheus self-monitoring, DEMOS node
    metrics (5s interval), and optional Node Exporter
  • Configured host.docker.internal for accessing host node metrics from
    Docker container
  • Included commented examples for multi-node production setups
+67/-0   
prometheus.yml
Grafana Prometheus Datasource Provisioning                             

monitoring/grafana/provisioning/datasources/prometheus.yml

  • Added Grafana datasource provisioning file for automatic Prometheus
    configuration
  • Configured Prometheus as default datasource with proxy access
  • Set up datasource with caching, alert management, and incremental
    query overlap settings
  • Configured for Prometheus v2.48.0 with HTTP POST method
+23/-0   
dashboard.yml
Grafana Dashboard Provisioning Configuration                         

monitoring/grafana/provisioning/dashboards/dashboard.yml

  • Added Grafana dashboard provisioning configuration file
  • Configured dashboard provider to load JSON files from
    /etc/grafana/provisioning/dashboards/json directory
  • Set up folder organization under "Demos Network" with 30-second update
    interval
  • Enabled UI updates and disabled deletion protection for dashboard
    management
+19/-0   
.env.example
Environment Variables for Prometheus Metrics Configuration

.env.example

  • Added new environment variables section for Prometheus metrics
    configuration
  • Added METRICS_ENABLED flag (default: true) to control metrics exposure
  • Added METRICS_PORT (default: 9090) and METRICS_HOST (default: 0.0.0.0)
    configuration options
  • Documented metrics endpoint availability at
    http://localhost:9090/metrics
+6/-0     

Summary by CodeRabbit

  • New Features

    • Added full Prometheus + Grafana monitoring stack and local metrics server; metrics collection for blockchain, system, network, peers, containers, and ports.
    • CLI option to disable the monitoring stack.
  • Documentation

    • New monitoring guides, dashboards, compose config, and expanded network ports & install instructions (including Rust/wstcp notes).
  • Chores

    • Version bumps and added runtime metrics dependency.
  • Style

    • Replaced terminal-based prints with centralized logging across the app.

✏️ Tip: You can customize this high-level summary in your review settings.

tcsenpai and others added 9 commits January 10, 2026 14:56
Implements Prometheus metrics collection for Demos Network node monitoring.

Components:
- MetricsService: Singleton service for metric registration and collection
- MetricsServer: HTTP server exposing /metrics and /health endpoints

Metrics Categories:
- System: node_uptime_seconds, node_info
- Consensus: rounds_total, round_duration, block_height, mempool_size
- Network: peers_connected, peers_total, messages_sent/received, peer_latency
- Transactions: transactions_total/failed, tps, processing_seconds
- API: requests_total, request_duration, errors_total
- IPFS: pins_total, storage_bytes, peers, operations_total
- GCR: accounts_total, total_supply

Configuration:
- METRICS_ENABLED=true (default)
- METRICS_PORT=9090 (default)
- METRICS_HOST=0.0.0.0 (default)

Dependency: prom-client@15.1.3

Part of Grafana Dashboard epic (DEM-540)
Closes: DEM-541

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The peerlist format changed from:
  { "pubkey": "http://url" }
to:
  { "pubkey": { "url": "http://...", "capabilities": {...} } }

PeerManager.loadPeerList() now handles both formats.

Fixes: TypeError "[object Object]" cannot be parsed as a URL

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Prometheus + Grafana monitoring section to README.md
- Add Network Ports section with required/optional ports to both files
- Include TCP/UDP protocol requirements for OmniProtocol and WS proxy
- Add default ports note for users with custom configurations
- Add ufw firewall examples for quick setup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace manual bun install with ./install-deps.sh
- Add note about Rust/Cargo requirement for wstcp
- Include Rust installation instructions in full guide

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 11, 2026

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a Prometheus/Grafana monitoring stack and a new metrics subsystem (MetricsService, MetricsCollector, MetricsServer), integrates metrics startup/shutdown into the main app and run script, provisions multiple Grafana dashboards, updates docs/compose config, and replaces terminal-kit outputs with centralized logging.

Changes

Cohort / File(s) Summary
Monitoring stack & docs
monitoring/*, monitoring/docker-compose.yml, monitoring/prometheus/prometheus.yml, monitoring/grafana/*, monitoring/grafana/provisioning/*, monitoring/grafana/provisioning/dashboards/json/*
New full monitoring stack: Docker Compose for Prometheus/Grafana (optional node-exporter), Prometheus scrape config, Grafana provisioning (datasource, dashboards), many dashboard JSONs, and extensive monitoring docs.
Metrics subsystem (code)
src/features/metrics/MetricsService.ts, src/features/metrics/MetricsCollector.ts, src/features/metrics/MetricsServer.ts, src/features/metrics/index.ts
New metrics implementation: Prometheus registry service, HTTP metrics server, periodic collector (blockchain, system, network, peers, HTTP, Docker, ports), and barrel exports. Public APIs and types added.
App integration & CLI
src/index.ts, run, .env.example
Integrates metrics startup/shutdown into app lifecycle, adds METRICS_* env/state, CLI/run flag to disable monitoring, and environment examples for metrics.
Docs & install
README.md, INSTALL.md, .beads/.local_version
Adds monitoring sections, ports documentation, install-script notes (Rust/wstcp), version bump (.beads) and README updates.
Dependency bump
package.json
Version bump and addition of prom-client dependency.
Logging refactor (terminal-kit → logger)
src/features/web2/dahr/DAHRFactory.ts, src/libs/* (cryptography, gcr, identity, network/, routines/), src/libs/utils/keyMaker.ts
Removed terminal-kit usage; replaced term.* calls with centralized logger (log.info/debug/error/warn). No functional changes beyond logging.
Peer & minor logic tweaks
src/libs/peer/PeerManager.ts, src/features/incentive/PointSystem.ts, src/utilities/tui/{CategorizedLogger,TUIManager}.ts, src/utilities/sharedState.ts
Peer data parsing tolerant to object-with-url, small var declaration change (let→const), logger-related performance cache and render deferral changes, and version/name bumps in shared state.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Collector as MetricsCollector
    participant Service as MetricsService
    participant Server as MetricsServer
    participant Prom as Prometheus
    participant Graf as Grafana

    App->>Collector: start()
    Collector->>Service: register metrics
    Collector->>Collector: schedule collectAll()
    Collector->>Service: update metrics (gauges/counters/histograms)
    App->>Server: start() (expose /metrics)
    Prom->>Server: GET /metrics (scrape)
    Server->>Service: getMetrics()
    Service-->>Server: metrics payload
    Server-->>Prom: 200 + metrics
    Graf->>Prom: query data
    Prom-->>Graf: timeseries
    Graf->>Graf: render dashboards
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Review effort 5/5

Suggested reviewers

  • cwilvx

Poem

🐰 I hopped in with carrots and keys,

Metrics now hum like springtime bees,
Grafana paints the network bright,
Prometheus counts through day and night,
Hooray — the node is watching the seas! 📈✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Added Grafana dashboard and a couple of fixes' is vague and does not accurately represent the scope of changes, which includes extensive Prometheus metrics infrastructure, multiple dashboard configurations, and significant documentation updates beyond just Grafana dashboards. Consider a more descriptive title that captures the primary change, such as 'Add Prometheus and Grafana monitoring stack with metrics instrumentation' or 'Add comprehensive monitoring with Prometheus, Grafana dashboards, and metrics collection'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch grafana

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 56b0df9 and 060b086.

📒 Files selected for processing (15)
  • .env.example
  • INSTALL.md
  • monitoring/README.md
  • monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json
  • monitoring/grafana/provisioning/dashboards/json/demos-overview.json
  • monitoring/grafana/provisioning/dashboards/json/network-peers.json
  • monitoring/grafana/provisioning/dashboards/json/system-health.json
  • monitoring/prometheus/prometheus.yml
  • package.json
  • run
  • src/features/metrics/MetricsCollector.ts
  • src/features/metrics/MetricsService.ts
  • src/index.ts
  • src/libs/peer/PeerManager.ts
  • src/utilities/sharedState.ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 11, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🔴
Private key logging

Description: The new logging prints the generated/loaded private key (log.info("KEYMAKER", "Private
Key: " + privateKey)), which can leak secret key material into logs/CI output and enable
full node impersonation if accessed.
keyMaker.ts [33-43]

Referred Code
    const publicKey = identity.publicKey.toString("hex")
    const privateKey = identity.privateKey.toString("hex")
    log.info("KEYMAKER", "\n\n====\nPublic Key: " + publicKey)
    log.info("KEYMAKER", "Private Key: " + privateKey)
    log.info("KEYMAKER", "====\n\n")
    // Save to file
    await fs.promises.writeFile("public.key", publicKey)
    await fs.promises.writeFile(".demos_identity", "0x" + privateKey)
    // Logging
    log.info("KEYMAKER", "Identity saved (or kept) to .demos_identity and public.key")
}
Unauthenticated metrics exposure

Description: The metrics server is enabled by default and binds to 0.0.0.0 (METRICS_HOST default),
exposing an unauthenticated /metrics endpoint to the network, which can leak
operational/internal data (including node_info identity label) and aid attackers in
reconnaissance.
MetricsServer.ts [21-73]

Referred Code
const DEFAULT_CONFIG: MetricsServerConfig = {
    port: parseInt(process.env.METRICS_PORT ?? "9090", 10),
    hostname: process.env.METRICS_HOST ?? "0.0.0.0",
    enabled: process.env.METRICS_ENABLED?.toLowerCase() !== "false",
}

/**
 * MetricsServer - Dedicated HTTP server for Prometheus metrics
 *
 * Usage:
 * ```typescript
 * const server = new MetricsServer()
 * await server.start()
 * // Prometheus scrapes http://localhost:9090/metrics
 * ```
 */
export class MetricsServer {
    private server: Server | null = null
    private config: MetricsServerConfig
    private metricsService: MetricsService



 ... (clipped 32 lines)
Command injection via exec

Description: Shell commands are constructed using environment-derived values (e.g., PG_PORT,
TLSNOTARY_PORT, OMNI_PORT) and executed via execAsync(...), which can allow command
injection if an attacker can influence these environment variables (e.g.,
OMNI_PORT='53551; rm -rf /').
MetricsCollector.ts [636-706]

Referred Code
private async collectDockerHealth(): Promise<void> {
    // Get ports from env to construct exact container names (matching run script)
    const pgPort = process.env.PG_PORT || "5332"
    const tlsnPort = process.env.TLSNOTARY_PORT || "7047"

    // Container names match exactly what the run script creates
    const containers = [
        { name: `postgres_${pgPort}`, displayName: "postgres" },
        { name: `tlsn-notary-${tlsnPort}`, displayName: "tlsn" },
        { name: "ipfs", displayName: "ipfs" }, // IPFS uses simple name
    ]

    for (const { name, displayName } of containers) {
        try {
            const { stdout } = await execAsync(
                `docker ps --filter "name=${name}" --format "{{.Status}}" 2>/dev/null || echo ""`,
            )
            const isUp = stdout.trim().toLowerCase().includes("up") ? 1 : 0
            this.metricsService.setGauge("service_docker_container_up", isUp, {
                container: displayName,
            })


 ... (clipped 50 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Secret logged: The PR logs the generated private key (and public key) to application logs, which exposes
sensitive cryptographic material.

Referred Code
log.info("KEYMAKER", "\n\n====\nPublic Key: " + publicKey)
log.info("KEYMAKER", "Private Key: " + privateKey)
log.info("KEYMAKER", "====\n\n")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Insecure key handling: The PR outputs private key material to logs, violating secure sensitive-data handling
requirements and increasing risk of credential compromise.

Referred Code
log.info("KEYMAKER", "\n\n====\nPublic Key: " + publicKey)
log.info("KEYMAKER", "Private Key: " + privateKey)
log.info("KEYMAKER", "====\n\n")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Silent exception handling: Multiple catch blocks swallow exceptions without logging any context, which can hinder
production debugging and incident triage.

Referred Code
    } catch {
        this.metricsService.setGauge("node_http_health", 0, {
            endpoint: name,
        })
        this.metricsService.setGauge("node_http_response_time_ms", 0, {
            endpoint: name,
        })
        return false
    }
}

/**
 * Check /info endpoint and extract node metadata for node_info metric
 */
private async checkInfoEndpoint(baseUrl: string): Promise<void> {
    const startTime = Date.now()
    try {
        const controller = new AbortController()
        const timeout = setTimeout(() => controller.abort(), 5000)

        const response = await fetch(`${baseUrl}/info`, {


 ... (clipped 43 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 11, 2026

PR Code Suggestions ✨

Latest suggestions up to 52a2537

CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Fix broken peer health query

Correct the Prometheus query for the "Peer Health %" panel by changing the
metric name from demos_peers_total to demos_peer_total_count to match the
documented metric and fix a "No data" error.

monitoring/grafana/provisioning/dashboards/json/network-peers.json [284]

-"expr": "(demos_peer_online_count / (demos_peers_total + (demos_peers_total == 0))) * 100",
+"expr": "(demos_peer_online_count / (demos_peer_total_count + (demos_peer_total_count == 0))) * 100",
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: This is a critical bug fix, as the query uses a metric name demos_peers_total that is inconsistent with the project's documentation (demos_peer_total_count), which would cause the panel to show "No data".

High
Redact identity label by default

Redact the node identity in Prometheus metrics by default for security. Make its
inclusion conditional on a new METRICS_INCLUDE_IDENTITY environment variable.

src/features/metrics/MetricsCollector.ts [621-628]

 // Set node_metadata metric with labels (value is always 1)
+const includeIdentity =
+    process.env.METRICS_INCLUDE_IDENTITY?.toLowerCase() === "true"
+
 this.metricsService.setGauge("node_metadata", 1, {
     version: info.version || "unknown",
     version_name: info.version_name || "unknown",
-    identity: info.identity
-        ? `${info.identity.slice(0, 10)}...${info.identity.slice(-6)}`
-        : "unknown",
+    identity: includeIdentity
+        ? info.identity
+            ? `${info.identity.slice(0, 10)}...${info.identity.slice(-6)}`
+            : "unknown"
+        : "redacted",
 })
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This is a valid security hardening suggestion that prevents leaking node identity in metrics by default, making the system more secure out-of-the-box.

Medium
Correct misleading port panel titles

Correct misleading hardcoded port numbers in the Grafana dashboard panel titles.
The titles should be port-agnostic to avoid confusion, as the actual ports are
configurable via environment variables.

monitoring/grafana/provisioning/dashboards/json/system-health.json [820-898]

-"title": "Port 5432 (PostgreSQL)",
+"title": "PostgreSQL Port (PG_PORT)",
 "type": "stat"
 ...
-"title": "Port 9000 (OmniProtocol)",
+"title": "OmniProtocol Port (OMNI_PORT)",
 "type": "stat"
 ...
-"title": "Port 7047 (TLSNotary)",
+"title": "TLSNotary Port (TLSNOTARY_PORT)",
 "type": "stat"

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the Grafana dashboard has hardcoded, incorrect port numbers, which could mislead operators; updating the titles to be port-agnostic improves clarity and correctness.

Medium
Use rate for block throughput

Replace deriv() with rate() in the "Block Production Rate" queries to accurately
calculate blocks per minute from the demos_block_height counter and avoid
misleading negative spikes.

monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json [646-648]

-"expr": "deriv(demos_block_height[5m]) * 60",
+"expr": "rate(demos_block_height[5m]) * 60",
 "legendFormat": "Blocks/min (5m avg)",
 "refId": "A"
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that rate() is more suitable than deriv() for a monotonically increasing counter like demos_block_height, preventing negative spikes and improving the accuracy of the "Block Production Rate" panel.

Medium
Ensure single-series metadata display

Aggregate the demos_node_metadata metric using max by (version, version_name) to
ensure the Grafana stat panel reliably displays a single value, even with
multiple scrape targets.

monitoring/grafana/provisioning/dashboards/json/demos-overview.json [104]

-"expr": "demos_node_metadata",
+"expr": "max by (version, version_name) (demos_node_metadata)",
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential "Multiple series" error in the Grafana stat panel and proposes a robust aggregation using max() to prevent it, improving dashboard reliability.

Low
Possible issue
Use consistent metric names

Standardize metric names in the network-peers.json dashboard by adding the
demos_ prefix to queries like peer_online_count to ensure data is displayed
correctly.

monitoring/grafana/provisioning/dashboards/json/network-peers.json [89-98]

 "targets": [
   {
     "datasource": {
       "type": "prometheus",
       "uid": "prometheus"
     },
-    "expr": "peer_online_count",
+    "expr": "demos_peer_online_count",
     "refId": "A"
   }
 ]

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: This is a critical bug fix, as the dashboard uses inconsistent metric names (peer_online_count) compared to other dashboards and documentation (demos_peer_online_count), which would cause panels to fail to load data.

High
Align dashboard to exported metrics

Standardize metric names in the consensus-blockchain.json dashboard by adding
the demos_ prefix to all queries (e.g., changing block_height to
demos_block_height) to ensure data is displayed correctly.

monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json [81-90]

 "targets": [
   {
     "datasource": {
       "type": "prometheus",
       "uid": "prometheus"
     },
-    "expr": "block_height",
+    "expr": "demos_block_height",
     "refId": "A"
   }
 ]

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: This is a critical bug fix, as the dashboard uses inconsistent metric names (block_height) compared to other dashboards and documentation (demos_block_height), which would cause panels to fail to load data.

High
Avoid overlapping metric collections

Prevent overlapping metric collections by adding a guard flag to the setInterval
callback in the start method, ensuring collectAll is not executed again if a
previous run is still in progress.

src/features/metrics/MetricsCollector.ts [95-100]

 private collectionInterval: Timer | null = null
 private running = false
+private collecting = false
 ...
 public async start(): Promise<void> {
     if (this.running) {
         log.warning("[METRICS COLLECTOR] Already running")
         return
     }
     ...
-    this.collectionInterval = setInterval(
-        async () => {
+    this.collectionInterval = setInterval(async () => {
+        if (this.collecting) return
+        this.collecting = true
+        try {
             await this.collectAll()
-        },
-        this.config.collectionIntervalMs,
-    )
+        } finally {
+            this.collecting = false
+        }
+    }, this.config.collectionIntervalMs)
 
     this.running = true
     log.info("[METRICS COLLECTOR] Started")
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical race condition where setInterval could cause overlapping executions of an async function, leading to excessive resource usage and unpredictable behavior.

Medium
Always clear fetch timeouts

Move the clearTimeout call into a finally block within checkEndpoint and
checkInfoEndpoint to ensure the timeout is always cleared, even if the fetch
call fails.

src/features/metrics/MetricsCollector.ts [545-638]

 private async checkEndpoint(
     baseUrl: string,
     path: string,
     name: string,
 ): Promise<boolean> {
     const startTime = Date.now()
+    const controller = new AbortController()
+    const timeout = setTimeout(() => controller.abort(), 5000)
+
     try {
-        const controller = new AbortController()
-        const timeout = setTimeout(() => controller.abort(), 5000)
-
         const response = await fetch(`${baseUrl}${path}`, {
             method: "GET",
             signal: controller.signal,
         })
-
-        clearTimeout(timeout)
 
         const responseTime = Date.now() - startTime
         const isHealthy = response.ok ? 1 : 0
 
         this.metricsService.setGauge("node_http_health", isHealthy, {
             endpoint: name,
         })
         this.metricsService.setGauge(
             "node_http_response_time_ms",
             responseTime,
             { endpoint: name },
         )
         return response.ok
     } catch {
         this.metricsService.setGauge("node_http_health", 0, {
             endpoint: name,
         })
         this.metricsService.setGauge("node_http_response_time_ms", 0, {
             endpoint: name,
         })
         return false
+    } finally {
+        clearTimeout(timeout)
     }
 }
 
 private async checkInfoEndpoint(baseUrl: string): Promise<void> {
     const startTime = Date.now()
+    const controller = new AbortController()
+    const timeout = setTimeout(() => controller.abort(), 5000)
+
     try {
-        const controller = new AbortController()
-        const timeout = setTimeout(() => controller.abort(), 5000)
-
         const response = await fetch(`${baseUrl}/info`, {
             method: "GET",
             signal: controller.signal,
         })
-
-        clearTimeout(timeout)
 
         const responseTime = Date.now() - startTime
         const isHealthy = response.ok ? 1 : 0
 
         this.metricsService.setGauge("node_http_health", isHealthy, {
             endpoint: "info",
         })
         this.metricsService.setGauge(
             "node_http_response_time_ms",
             responseTime,
             { endpoint: "info" },
         )
 
-        // Extract node info from response if successful
         if (response.ok) {
             const info = (await response.json()) as {
                 version?: string
                 version_name?: string
                 identity?: string
             }
 
-            // Set node_metadata metric with labels (value is always 1)
             this.metricsService.setGauge("node_metadata", 1, {
                 version: info.version || "unknown",
                 version_name: info.version_name || "unknown",
                 identity: info.identity
                     ? `${info.identity.slice(0, 10)}...${info.identity.slice(-6)}`
                     : "unknown",
             })
         }
     } catch {
         this.metricsService.setGauge("node_http_health", 0, {
             endpoint: "info",
         })
         this.metricsService.setGauge("node_http_response_time_ms", 0, {
             endpoint: "info",
         })
+    } finally {
+        clearTimeout(timeout)
     }
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential resource leak where timeouts are not cleared on fetch errors and proposes the correct fix using a finally block, improving code robustness.

Low
Security
Sanitize env values used in shell

Sanitize environment variables in collectDockerHealth and collectPortHealth
before using them to construct shell commands to prevent potential command
injection vulnerabilities.

src/features/metrics/MetricsCollector.ts [646-717]

 private async collectDockerHealth(): Promise<void> {
-    // Get ports from env to construct exact container names (matching run script)
-    const pgPort = process.env.PG_PORT || "5332"
-    const tlsnPort = process.env.TLSNOTARY_PORT || "7047"
-    ...
+    const pgPort = String(parseInt(process.env.PG_PORT ?? "5332", 10))
+    const tlsnPort = String(parseInt(process.env.TLSNOTARY_PORT ?? "7047", 10))
+
+    const containers = [
+        { name: `postgres_${pgPort}`, displayName: "postgres" },
+        { name: `tlsn-notary-${tlsnPort}`, displayName: "tlsn" },
+        { name: "ipfs", displayName: "ipfs" },
+    ]
+
     for (const { name, displayName } of containers) {
         try {
             const { stdout } = await execAsync(
                 `docker ps --filter "name=${name}" --format "{{.Status}}" 2>/dev/null || echo ""`,
             )
             const isUp = stdout.trim().toLowerCase().includes("up") ? 1 : 0
             this.metricsService.setGauge("service_docker_container_up", isUp, {
                 container: displayName,
             })
         } catch {
-            // Docker not available or container not found
             this.metricsService.setGauge("service_docker_container_up", 0, {
                 container: displayName,
             })
         }
     }
 }
 
 private async collectPortHealth(): Promise<void> {
-    // Read ports from environment variables (matching run script naming)
-    const postgresPort = process.env.PG_PORT || "5332"
-    const tlsnPort = process.env.TLSNOTARY_PORT || "7047"
-    const omniPort = process.env.OMNI_PORT || "53551"
-    const ipfsSwarmPort = process.env.IPFS_SWARM_PORT || "4001"
-    const ipfsApiPort = process.env.IPFS_API_PORT || "5001"
-    ...
+    const sanitizePort = (raw: string, fallback: string): string => {
+        const p = String(parseInt(raw, 10))
+        return /^\d+$/.test(p) ? p : fallback
+    }
+
+    const postgresPort = sanitizePort(process.env.PG_PORT ?? "5332", "5332")
+    const tlsnPort = sanitizePort(process.env.TLSNOTARY_PORT ?? "7047", "7047")
+    const omniPort = sanitizePort(process.env.OMNI_PORT ?? "53551", "53551")
+    const ipfsSwarmPort = sanitizePort(process.env.IPFS_SWARM_PORT ?? "4001", "4001")
+    const ipfsApiPort = sanitizePort(process.env.IPFS_API_PORT ?? "5001", "5001")
+
+    const ports = [
+        { port: postgresPort, service: "postgres" },
+        { port: tlsnPort, service: "tlsn" },
+        { port: omniPort, service: "omniprotocol" },
+        { port: ipfsSwarmPort, service: "ipfs_swarm" },
+        { port: ipfsApiPort, service: "ipfs_api" },
+    ]
+
     for (const { port, service } of ports) {
         try {
-            // Use netstat or ss to check if port is listening
             const { stdout } = await execAsync(
                 `ss -tlnp 2>/dev/null | grep ":${port} " || netstat -tlnp 2>/dev/null | grep ":${port} " || echo ""`,
             )
             const isOpen = stdout.trim().length > 0 ? 1 : 0
             this.metricsService.setGauge("service_port_open", isOpen, {
                 port,
                 service,
             })
         } catch {
             this.metricsService.setGauge("service_port_open", 0, {
                 port,
                 service,
             })
         }
     }
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a command injection vulnerability where unsanitized environment variables are interpolated into shell commands, which is a critical security risk.

High
Restrict Grafana Live origins

To prevent potential cross-site WebSocket abuse, restrict Grafana Live's
allowed_origins from * to a specific URL, such as http://localhost:3000.

monitoring/grafana/grafana.ini [62-64]

 [live]
 # Enable Grafana Live for real-time updates
-allowed_origins = *
+# Restrict to your Grafana URL(s) to avoid cross-site websocket abuse
+allowed_origins = http://localhost:3000
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This is a valid security hardening suggestion that mitigates the risk of cross-site WebSocket hijacking by restricting allowed_origins instead of allowing any origin.

Medium
  • More

Previous suggestions

✅ Suggestions up to commit 56b0df9
CategorySuggestion                                                                                                                                    Impact
High-level
Simplify metrics by using standard exporters

Refactor the MetricsCollector to only export application-specific metrics.
Replace the custom collection of system metrics (CPU, memory, Docker, ports),
which currently uses shell commands, with standard Prometheus exporters like
node-exporter and blackbox-exporter for improved security, portability, and
adherence to best practices.

Examples:

src/features/metrics/MetricsCollector.ts [636-664]
    private async collectDockerHealth(): Promise<void> {
        // Get ports from env to construct exact container names (matching run script)
        const pgPort = process.env.PG_PORT || "5332"
        const tlsnPort = process.env.TLSNOTARY_PORT || "7047"

        // Container names match exactly what the run script creates
        const containers = [
            { name: `postgres_${pgPort}`, displayName: "postgres" },
            { name: `tlsn-notary-${tlsnPort}`, displayName: "tlsn" },
            { name: "ipfs", displayName: "ipfs" }, // IPFS uses simple name

 ... (clipped 19 lines)
src/features/metrics/MetricsCollector.ts [673-707]
    private async collectPortHealth(): Promise<void> {
        // Read ports from environment variables (matching run script naming)
        const postgresPort = process.env.PG_PORT || "5332"
        const tlsnPort = process.env.TLSNOTARY_PORT || "7047"
        const omniPort = process.env.OMNI_PORT || "53551"
        const ipfsSwarmPort = process.env.IPFS_SWARM_PORT || "4001"
        const ipfsApiPort = process.env.IPFS_API_PORT || "5001"

        const ports = [
            { port: postgresPort, service: "postgres" },

 ... (clipped 25 lines)

Solution Walkthrough:

Before:

// src/features/metrics/MetricsCollector.ts

class MetricsCollector {
  private async collectAll(): Promise<void> {
    await Promise.all([
      this.collectBlockchainMetrics(), // App-specific, OK
      this.collectPeerMetrics(),       // App-specific, OK

      // System metrics collected via Node.js APIs or shell commands
      this.collectSystemMetrics(),
      this.collectNetworkIOMetrics(),
      this.collectDockerHealth(),      // Uses `exec('docker ps ...')`
      this.collectPortHealth(),        // Uses `exec('ss ... || netstat ...')`
    ]);
  }

  private async collectDockerHealth(): Promise<void> {
    const { stdout } = await execAsync(`docker ps --filter "name=..." ...`);
    // ...
  }

  private async collectPortHealth(): Promise<void> {
    const { stdout } = await execAsync(`ss ... || netstat ...`);
    // ...
  }
}

After:

// src/features/metrics/MetricsCollector.ts (Simplified)

class MetricsCollector {
  // Only collects application-internal metrics
  private async collectAll(): Promise<void> {
    await Promise.all([
      this.collectBlockchainMetrics(),
      this.collectPeerMetrics(),
      this.collectNodeHttpHealth(), // Internal health check is fine
    ]);
  }

  // Methods like collectSystemMetrics, collectDockerHealth, 
  // collectPortHealth, and collectNetworkIOMetrics are removed.
}

// monitoring/prometheus/prometheus.yml (Conceptual)
scrape_configs:
  - job_name: 'demos-node'
    static_configs:
      - targets: ['host.docker.internal:9090'] // Scrapes app metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100'] // Scrapes host metrics (CPU, RAM, etc.)
  - job_name: 'blackbox-exporter'
    static_configs:
      - targets: ['postgres:5432', 'tlsn:7047'] // Probes service ports
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a major architectural issue in the new MetricsCollector, where system-level metrics are gathered using non-portable and potentially insecure shell commands, and proposes a more robust, standard, and secure solution using dedicated Prometheus exporters.

High
Possible issue
Invoke function to get state

Invoke the getSharedState() function to retrieve the shared state object.
Assigning the function reference directly to sharedState will cause a TypeError
on property access.

src/features/metrics/MetricsCollector.ts [261]

-const sharedState = getSharedState
+const sharedState = getSharedState()
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical bug where the getSharedState function is assigned instead of called, which would cause a TypeError at runtime.

High
Replace undefined instanceof checks

Replace instanceof TimeoutError and instanceof AbortError checks with string
comparisons against error.name to prevent potential ReferenceError exceptions in
the catch block.

src/index.ts [776-782]

 } catch (error) {
-    if (error instanceof TimeoutError) {
+    const name = error instanceof Error ? error.name : String(error)
+    if (name === "TimeoutError") {
         log.info("[MAIN] No wild peers found, starting sync loop")
-    } else if (error instanceof AbortError) {
+    } else if (name === "AbortError") {
         log.info("[MAIN] Wait aborted, starting sync loop")
     }
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion fixes a likely ReferenceError in an error handling block by replacing instanceof checks for potentially out-of-scope error classes with a safer check on the error's name property.

Medium
Use deriv() for gauge rate calculation
Suggestion Impact:The PromQL expressions were updated from rate(...) to deriv(...) for block height rate calculations. Additionally, the metric name was changed from block_height to demos_block_height, but the core suggestion (using deriv for a gauge) was implemented for both 5m and 1m queries.

code diff:

@@ -643,7 +643,7 @@
             "type": "prometheus",
             "uid": "prometheus"
           },
-          "expr": "rate(block_height[5m]) * 60",
+          "expr": "deriv(demos_block_height[5m]) * 60",
           "legendFormat": "Blocks/min (5m avg)",
           "refId": "A"
         },
@@ -652,7 +652,7 @@
             "type": "prometheus",
             "uid": "prometheus"
           },
-          "expr": "rate(block_height[1m]) * 60",
+          "expr": "deriv(demos_block_height[1m]) * 60",
           "legendFormat": "Blocks/min (1m avg)",

Replace the rate() function with deriv() in the PromQL query for the
block_height gauge to correctly calculate its rate of change.

monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json [646-648]

-"expr": "rate(block_height[5m]) * 60",
+"expr": "deriv(block_height[5m]) * 60",
 "legendFormat": "Blocks/min (5m avg)",
 "refId": "A"

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that rate() is inappropriate for a gauge metric like block_height and proposes using the correct deriv() function, which fixes a bug in the dashboard's rate calculation.

Medium
Stop metrics collector on shutdown
Suggestion Impact:The shutdown code was updated to import getMetricsCollector() and call .stop() before stopping the metrics server, and related log/error messages were adjusted accordingly.

code diff:

+        // REVIEW: Stop Metrics collector and server if running
         if (indexState.metricsServer) {
-            console.log("[SHUTDOWN] Stopping Metrics server...")
+            console.log("[SHUTDOWN] Stopping Metrics collector and server...")
             try {
+                // Stop the collector first to clear interval timer and prevent collection during shutdown
+                const { getMetricsCollector } = await import("./features/metrics")
+                getMetricsCollector().stop()
                 indexState.metricsServer.stop()
             } catch (error) {
-                console.error("[SHUTDOWN] Error stopping Metrics server:", error)
+                console.error("[SHUTDOWN] Error stopping Metrics:", error)
             }

Stop the metricsCollector during graceful shutdown to clear its interval timer
and ensure a clean process exit.

src/index.ts [901-909]

 // REVIEW: Stop Metrics server if running
 if (indexState.metricsServer) {
     console.log("[SHUTDOWN] Stopping Metrics server...")
     try {
+        // Stop the collector first to prevent it from running during shutdown
+        const { getMetricsCollector } = await import("./features/metrics")
+        getMetricsCollector().stop()
         indexState.metricsServer.stop()
     } catch (error) {
         console.error("[SHUTDOWN] Error stopping Metrics server:", error)
     }
 }

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the metricsCollector is not stopped during shutdown, which could prevent the process from exiting cleanly or cause errors.

Medium
Avoid duplicate metric registration
Suggestion Impact:The commit renamed the MetricsCollector gauge and corresponding setGauge usage from "node_info" to "node_metadata", resolving the metric name collision (though it used a different new name than the suggestion's "node_metadata_info").

code diff:

-        // === Node Info Metric (static labels with node metadata) ===
-        ms.createGauge(
-            "node_info",
-            "Node information with version and identity labels",
+        // === Node Metadata Metric (static labels with node metadata) ===
+        ms.createGauge(
+            "node_metadata",
+            "Node metadata with version and identity labels",
             ["version", "version_name", "identity"],
         )
 
@@ -381,6 +381,25 @@
     }
 
     /**
+     * Report basic network interface metrics with zero values
+     * Used as fallback when /proc/net/dev is unavailable (non-Linux or read error)
+     */
+    private reportBasicNetworkInterfaces(
+        interfaces: NodeJS.Dict<os.NetworkInterfaceInfo[]>,
+    ): void {
+        for (const [name] of Object.entries(interfaces)) {
+            if (name !== "lo") {
+                this.metricsService.setGauge("system_network_rx_bytes_total", 0, {
+                    interface: name,
+                })
+                this.metricsService.setGauge("system_network_tx_bytes_total", 0, {
+                    interface: name,
+                })
+            }
+        }
+    }
+
+    /**
      * Collect network I/O metrics
      */
     private async collectNetworkIOMetrics(): Promise<void> {
@@ -440,22 +459,13 @@
                         })
                     }
                 } catch {
-                    // Fall back to basic interface listing
-                    for (const [name] of Object.entries(interfaces)) {
-                        if (name !== "lo") {
-                            this.metricsService.setGauge(
-                                "system_network_rx_bytes_total",
-                                0,
-                                { interface: name },
-                            )
-                            this.metricsService.setGauge(
-                                "system_network_tx_bytes_total",
-                                0,
-                                { interface: name },
-                            )
-                        }
-                    }
+                    // Fallback for Linux if /proc/net/dev fails
+                    this.reportBasicNetworkInterfaces(interfaces)
                 }
+            } else {
+                // Fallback for non-Linux platforms (macOS, Windows)
+                // Report interface names with zero values to maintain metric consistency
+                this.reportBasicNetworkInterfaces(interfaces)
             }
 
             this.lastNetworkTime = now
@@ -608,8 +618,8 @@
                     identity?: string
                 }
 
-                // Set node_info metric with labels (value is always 1)
-                this.metricsService.setGauge("node_info", 1, {
+                // Set node_metadata metric with labels (value is always 1)
+                this.metricsService.setGauge("node_metadata", 1, {
                     version: info.version || "unknown",
                     version_name: info.version_name || "unknown",
                     identity: info.identity

Rename the "node_info" metric in MetricsCollector to avoid a name collision with
the same metric defined in MetricsService.

src/features/metrics/MetricsCollector.ts [217-222]

-// === Node Info Metric (static labels with node metadata) ===
+// === Node Metadata Metric (static labels with node metadata) ===
 ms.createGauge(
-    "node_info",
-    "Node information with version and identity labels",
+    "node_metadata_info",
+    "Node metadata with version and identity labels",
     ["version", "version_name", "identity"],
 )

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a metric name collision between MetricsCollector and MetricsService for "node_info", preventing one from overwriting the other.

Medium
Prevent division-by-zero in PromQL query
Suggestion Impact:The PromQL expression was updated to include a zero-division guard using +(total == 0) in the denominator, preventing division-by-zero. The metrics were also renamed to demos_peer_online_count/demos_peers_total, so it is not an exact match to the suggested snippet but implements the same safeguard.

code diff:

-          "expr": "(peer_online_count / peers_total) * 100",
+          "expr": "(demos_peer_online_count / (demos_peers_total + (demos_peers_total == 0))) * 100",
           "refId": "A"

Modify the PromQL query for peer health percentage to handle cases where
peers_total is zero, preventing division-by-zero errors.

monitoring/grafana/provisioning/dashboards/json/network-peers.json [284-285]

-"expr": "(peer_online_count / peers_total) * 100",
+"expr": "(peer_online_count / (peers_total + (peers_total == 0))) * 100",
 "refId": "A"

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion provides a robust solution to prevent a potential division-by-zero error in the PromQL query, improving the reliability of the dashboard panel.

Medium
Add network metrics for non-Linux
Suggestion Impact:The commit added a helper method (reportBasicNetworkInterfaces) to emit zero-valued RX/TX metrics per interface, replaced the prior inline catch fallback with this helper, and added an explicit non-Linux else branch that invokes the same fallback to ensure metrics exist on macOS/Windows.

code diff:

@@ -381,6 +381,25 @@
     }
 
     /**
+     * Report basic network interface metrics with zero values
+     * Used as fallback when /proc/net/dev is unavailable (non-Linux or read error)
+     */
+    private reportBasicNetworkInterfaces(
+        interfaces: NodeJS.Dict<os.NetworkInterfaceInfo[]>,
+    ): void {
+        for (const [name] of Object.entries(interfaces)) {
+            if (name !== "lo") {
+                this.metricsService.setGauge("system_network_rx_bytes_total", 0, {
+                    interface: name,
+                })
+                this.metricsService.setGauge("system_network_tx_bytes_total", 0, {
+                    interface: name,
+                })
+            }
+        }
+    }
+
+    /**
      * Collect network I/O metrics
      */
     private async collectNetworkIOMetrics(): Promise<void> {
@@ -440,22 +459,13 @@
                         })
                     }
                 } catch {
-                    // Fall back to basic interface listing
-                    for (const [name] of Object.entries(interfaces)) {
-                        if (name !== "lo") {
-                            this.metricsService.setGauge(
-                                "system_network_rx_bytes_total",
-                                0,
-                                { interface: name },
-                            )
-                            this.metricsService.setGauge(
-                                "system_network_tx_bytes_total",
-                                0,
-                                { interface: name },
-                            )
-                        }
-                    }
+                    // Fallback for Linux if /proc/net/dev fails
+                    this.reportBasicNetworkInterfaces(interfaces)
                 }
+            } else {
+                // Fallback for non-Linux platforms (macOS, Windows)
+                // Report interface names with zero values to maintain metric consistency
+                this.reportBasicNetworkInterfaces(interfaces)
             }

Add an else block to the Linux-specific network I/O metrics collection to
provide a fallback for non-Linux platforms, ensuring basic metrics are reported.

src/features/metrics/MetricsCollector.ts [392-459]

 // On Linux, read from /proc/net/dev for accurate stats
 if (process.platform === "linux") {
     try {
         const fs = await import("node:fs/promises")
         const data = await fs.readFile("/proc/net/dev", "utf-8")
         const lines = data.split("\n").slice(2) // Skip header lines
 
         for (const line of lines) {
             const parts = line.trim().split(/\s+/)
             if (parts.length < 10) continue
 
             const iface = parts[0].replace(":", "")
             if (iface === "lo") continue // Skip loopback
 
             const rxBytes = parseInt(parts[1], 10)
             const txBytes = parseInt(parts[9], 10)
 
             this.metricsService.setGauge(
                 "system_network_rx_bytes_total",
                 rxBytes,
                 { interface: iface },
             )
             this.metricsService.setGauge(
                 "system_network_tx_bytes_total",
                 txBytes,
                 { interface: iface },
             )
 
             // Calculate rates
             const last = this.lastNetworkStats.get(iface)
             if (last) {
                 const rxRate = (rxBytes - last.rx) / timeDeltaSec
                 const txRate = (txBytes - last.tx) / timeDeltaSec
                 this.metricsService.setGauge(
                     "system_network_rx_rate_bytes",
                     Math.max(0, rxRate),
                     { interface: iface },
                 )
                 this.metricsService.setGauge(
                     "system_network_tx_rate_bytes",
                     Math.max(0, txRate),
                     { interface: iface },
                 )
             }
 
             this.lastNetworkStats.set(iface, {
                 rx: rxBytes,
                 tx: txBytes,
             })
         }
     } catch {
-        // Fall back to basic interface listing
-        for (const [name] of Object.entries(interfaces)) {
-            if (name !== "lo") {
-                this.metricsService.setGauge(
-                    "system_network_rx_bytes_total",
-                    0,
-                    { interface: name },
-                )
-                this.metricsService.setGauge(
-                    "system_network_tx_bytes_total",
-                    0,
-                    { interface: name },
-                )
-            }
-        }
+        // Fallback for Linux if /proc/net/dev fails
+        this.reportZeroNetworkMetrics(interfaces)
     }
+} else {
+    // Fallback for non-Linux platforms
+    this.reportZeroNetworkMetrics(interfaces)
 }

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why: The suggestion improves cross-platform compatibility by ensuring basic network metrics are reported on non-Linux systems, which prevents missing metrics.

Low
General
Ensure robust monitoring stack shutdown
Suggestion Impact:The monitoring shutdown logic was changed from manual cd into/out of the monitoring directory to a subshell that performs `cd monitoring && docker compose down`, preserving the script working directory and adding a warning if shutdown fails.

code diff:

@@ -944,10 +944,10 @@
 if [ "$MONITORING_DISABLED" != "true" ] && [ -d "monitoring" ]; then
     echo "🛑 Stopping monitoring stack..."
 
-    # Try graceful shutdown first with short timeout
-    cd monitoring
-    docker compose down --timeout 5 2>/dev/null || true
-    cd ..
+    # Try graceful shutdown first with short timeout (subshell to preserve working directory)
+    (
+        cd monitoring && docker compose down --timeout 5 2>/dev/null
+    ) || echo "⚠️  Warning: Failed to stop monitoring stack cleanly."

Encapsulate the monitoring stack shutdown commands in a subshell to ensure they
execute in the correct directory without affecting the main script's working
directory.

run [943-953]

 # Stop monitoring stack if it was started (enabled by default)
 if [ "$MONITORING_DISABLED" != "true" ] && [ -d "monitoring" ]; then
     echo "🛑 Stopping monitoring stack..."
 
     # Try graceful shutdown first with short timeout
-    cd monitoring
-    docker compose down --timeout 5 2>/dev/null || true
-    cd ..
+    (
+        cd monitoring && docker compose down --timeout 5 2>/dev/null
+    ) || echo "⚠️  Warning: Failed to stop monitoring stack cleanly."
 
     echo "✅ Monitoring stack stopped"
 fi

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a bug where the cleanup logic fails if the script is not run from the root directory and provides a robust fix using a subshell.

Low
enable strict error handling

Add set -euo pipefail to the top of the script to enable strict error handling,
making it more robust and preventing it from continuing on errors.

run [9]

+set -euo pipefail
 MONITORING_DISABLED=false
Suggestion importance[1-10]: 5

__

Why: The suggestion improves script robustness by enabling strict error handling, which is a good practice for shell scripts, but it's a general improvement rather than a fix for a specific bug in the PR.

Low

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/libs/blockchain/gcr/gcr.ts (1)

176-192: Don’t swallow the exception; the current message likely mislabels DB failures as “no balance”.
findOne(...) returning “no row” shouldn’t throw; the catch is more likely real DB/query/connection errors. Log the error object (at least at debug/warn) so incidents are diagnosable.

Proposed change
         try {
             const response = await gcrRepository.findOne({
                 select: ["details"],
                 where: { publicKey: address },
             })
             return response ? response.details.content.balance : 0
         } catch (e) {
-            log.debug("[GET BALANCE] No balance for: " + address)
+            log.warn(
+                "[GCR] getGCRNativeBalance failed for " + address + ": " + String(e),
+            )
             return 0
         }
package.json (1)

55-116: Add lockfile to ensure reproducible installs for prom-client@^15.1.3.

prom-client 15.1.3 is compatible with Bun and ESM contexts (supports CommonJS-to-ESM interop; explicit Bun fixes in v15.1.2+), but without a committed lockfile, the caret version ^15.1.3 can drift to newer minor/patch releases unexpectedly. Generate and commit a lockfile (package-lock.json, bun.lockb, or equivalent) to pin exact versions. Verify that the project's Node.js runtime meets prom-client's requirement (Node.js ≥16, ≥18, or ≥20).

src/features/web2/dahr/DAHRFactory.ts (1)

48-55: Fix log prefix + consider whether sessionId should be logged.

The message says [DAHRManager] in DAHRFactory.createDAHR (Line 52); also verify sessionId isn’t a sensitive/token-like identifier before logging.

Proposed fix
-        log.info("DAHR", `[DAHRManager] Creating new DAHR instance with sessionId: ${sessionId}`)
+        log.info("DAHR", `[DAHRFactory] Creating new DAHR instance with sessionId: ${sessionId}`)
run (1)

350-367: Ctrl+C cleanup likely stops the wrong Postgres compose directory (leaves DB running).

This script now uses postgres_${PG_PORT} elsewhere, but ctrl_c() still does cd postgres (Line 353). On interrupt, that can fail to stop the running DB (and the cd is also fragile if the current directory isn’t repo root).

Proposed fix
 function ctrl_c() {
     HAS_BEEN_INTERRUPTED=true
     if [ "$EXTERNAL_DB" = false ]; then
-        cd postgres
-        docker compose down
-        cd ..
+        PG_FOLDER="postgres_${PG_PORT}"
+        if [ -d "$PG_FOLDER" ]; then
+            (cd "$PG_FOLDER" && docker compose down) || true
+        elif [ -d "postgres" ]; then
+            (cd "postgres" && docker compose down) || true
+        fi
     fi
     # Stop TLSNotary container if running (enabled by default)
     if [ "$TLSNOTARY_DISABLED" != "true" ] && [ -d "tlsnotary" ]; then
         (cd tlsnotary && docker compose down --timeout 5 2>/dev/null) || true
         # Force kill if still running
         docker rm -f "tlsn-notary-${TLSNOTARY_PORT:-7047}" 2>/dev/null || true
     fi
     # Stop monitoring stack if running (enabled by default)
     if [ "$MONITORING_DISABLED" != "true" ] && [ -d "monitoring" ]; then
         (cd monitoring && docker compose down --timeout 5 2>/dev/null) || true
     fi
 }
src/libs/blockchain/routines/validateTransaction.ts (1)

162-183: BALANCE ERROR catch continues execution; likely returns success after a balance lookup failure.

After the catch, defineGas() can still proceed to compute gas and return [true, gasOperation] even though balance retrieval failed (and validityData.data.valid isn’t set to false here). If this is intended only for non-PROD, it should be explicit.

Proposed fix (fail fast)
     try {
         fromBalance = await GCR.getGCRNativeBalance(from)
     } catch (e) {
         log.error("TX", "[Native Tx Validation] [BALANCE ERROR] No balance found for this address: " + from)
         validityData.data.message =
             "[Native Tx Validation] [BALANCE ERROR] No balance found for this address: " +
             from +
             "\n"
-        // Hash the validation data
-        const hash = Hashing.sha256(JSON.stringify(validityData.data))
-        // Sign the hash
-        const signature = await ucrypto.sign(
-            getSharedState.signingAlgorithm,
-            new TextEncoder().encode(hash),
-        )
-        validityData.signature = {
-            type: getSharedState.signingAlgorithm,
-            data: uint8ArrayToHex(signature.signature),
-        }
+        validityData.data.valid = false
+        validityData = await signValidityData(validityData)
+        return [false, validityData]
     }
🤖 Fix all issues with AI agents
In @.env.example:
- Around line 35-39: The .env.example comment for Prometheus metrics can cause a
port conflict because METRICS_PORT defaults to 9090 (the Prometheus server
default); update the comment above METRICS_ENABLED / METRICS_PORT in
.env.example to warn users about the potential conflict and suggest using a
different default port (e.g., 9091 or 9092) or mapping Prometheus externally,
and optionally change METRICS_PORT default value to a non-conflicting port to
match the monitoring README.

In @INSTALL.md:
- Around line 427-444: The two Markdown tables under the "Required Ports" and
"Optional Ports" headings violate MD058 by lacking blank lines around them;
update INSTALL.md to add a blank line before each table and a blank line after
each table (i.e., ensure there is an empty line above the first "| Port |
Service | Description |" row and an empty line after the final table row for
both the Required Ports and Optional Ports sections) so markdownlint MD058
passes.

In @monitoring/docker-compose.yml:
- Around line 41-42: The docker-compose uses a weak default admin password via
GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-demos}; remove the insecure
fallback and require explicit configuration or implement secure generation:
update the compose entry to reference
GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD} (no default) or replace
with a startup mechanism that generates a random password, sets
GF_SECURITY_ADMIN_PASSWORD for the Grafana container and prints it to logs on
first boot; also add a prominent warning in the documentation instructing users
to set GRAFANA_ADMIN_PASSWORD before deployment and reference the
GF_SECURITY_ADMIN_USER/GF_SECURITY_ADMIN_PASSWORD variables in the docs.

In @monitoring/grafana/grafana.ini:
- Around line 52-55: The setting disable_sanitize_html under the [panels]
section currently disables HTML sanitization and permits XSS; change
disable_sanitize_html back to false to re-enable sanitization, and add a clear
inline comment near the [panels] section explaining the branding trade-off and
that any intentional override requires explicit justification and restricted
dashboard-edit permissions; alternatively, if branding HTML is required, keep
disable_sanitize_html = true only after adding documentation of the risk and
ensuring dashboard creation/edit rights are limited to a trusted admin role
(update any related dashboard-permission config and docs accordingly).
- Around line 61-64: The Grafana Live configuration currently uses a permissive
wildcard for WebSocket origins; update the [live] section's allowed_origins
setting (the allowed_origins key) to list only specific trusted origins (e.g.,
https://your-app.example.com, https://admin.example.com) instead of "*", or wire
it to a secure environment/config variable so production deployments do not
accept all origins.

In @monitoring/grafana/provisioning/dashboards/json/system-health.json:
- Around line 579-590: Update the Grafana Prometheus query for the TLSNotary
container to use the collector's label value by changing the metric selector in
any occurrences of service_docker_container_up{container="tlsn-server"} to
service_docker_container_up{container="tlsn"} (leave postgres and ipfs selectors
as-is); ensure the stat panel using the expr
"service_docker_container_up{container=\"tlsn\"}" is updated so it matches the
MetricsCollector label names.
- Around line 38-103: Update all Prometheus query expressions in this dashboard
to use the MetricsCollector prefix: replace bare metric names like
system_cpu_usage_percent, system_memory_usage_percent, system_load_average_*,
service_docker_container_up, and service_port_open with their prefixed forms
(prepend "demos_") so the targets' expr fields use
demos_system_cpu_usage_percent, demos_system_memory_usage_percent,
demos_system_load_average_*, demos_service_docker_container_up,
demos_service_port_open; reference the MetricsService.ts prefix configuration
(line 33) and mirror the approach used in demos-overview.json when editing the
target objects' "expr" values.

In @monitoring/prometheus/prometheus.yml:
- Around line 26-42: The README incorrectly instructs users to set
PROMETHEUS_PORT=3333 when it should instruct setting METRICS_PORT=3333 for the
node metrics exporter; update the README text in the "Enabling Metrics on Your
Node" section to reference METRICS_PORT (not PROMETHEUS_PORT) and clarify that
PROMETHEUS_PORT controls the Prometheus service external port (default 9091). In
prometheus.yml update the 'demos-node' job (job_name: 'demos-node') to either
document that the target host.docker.internal:9090 must match the node's
METRICS_PORT or make the target configurable via METRICS_PORT so Prometheus
scrapes the actual node metrics port; also update docs/diagram to explicitly
distinguish the Prometheus service port (PROMETHEUS_PORT, default 9091) from the
node metrics port (METRICS_PORT, default 9090).

In @monitoring/README.md:
- Around line 186-198: The troubleshooting step incorrectly instructs users to
curl http://localhost:3333/metrics; update the README's "Grafana shows 'No
Data'" section to use the actual default Prometheus metrics port by replacing
that URL with http://localhost:9090/metrics so the curl command checks the
correct endpoint.
- Around line 33-40: The README currently uses ENABLE_PROMETHEUS and
PROMETHEUS_PORT:3333 which conflicts with the actual config in .env.example and
MetricsServer.ts (which expect METRICS_ENABLED and METRICS_PORT with default
9090); update the README text and example env block to use METRICS_ENABLED=true
and METRICS_PORT=9090, change the example URL from http://localhost:3333/metrics
to http://localhost:9090/metrics, and also update the architecture diagram lines
that reference port 3333 to 9090 so documentation aligns with the
METRICS_ENABLED/METRICS_PORT variables used by MetricsServer.ts.

In @run:
- Around line 813-853: The Grafana health-check loop uses curl -sf which can
hang; update the curl invocation in the while condition (while ! curl -sf
"http://localhost:$GRAFANA_PORT/api/health" > /dev/null 2>&1;) to include
timeouts (e.g. --connect-timeout 1 and --max-time 2) so each attempt cannot
stall; also apply the same timeout flags to the TLSNotary health check curl call
elsewhere in the script to ensure those startup loops respect the intended time
bounds.

In @src/features/metrics/MetricsCollector.ts:
- Around line 636-664: The collectDockerHealth method interpolates PG_PORT and
TLSNOTARY_PORT directly into a shell command, risking command injection; add a
helper like sanitizePort(value, defaultValue) that validates the port with a
/^\d+$/ check (log a warning and return the default on invalid input), call it
to produce sanitized pgPort and tlsnPort values, and use those sanitized values
in the execAsync docker command and container name templates inside
collectDockerHealth (retain existing metric setting and error handling).

In @src/features/metrics/MetricsServer.ts:
- Line 10: Change the runtime import of Server to a type-only import: replace
the existing `import { Server } from "bun"` with `import type { Server } from
"bun"` because `Server` is only used for type annotations (see `private server:
Server | null = null` in the `MetricsServer` class); update the import so the
type is erased at runtime and follows Bun conventions.

In @src/features/metrics/MetricsService.ts:
- Around line 153-158: The histogram registration in MetricsService using
createHistogram("peer_latency_seconds", ..., ["peer_id"], ...) risks cardinality
explosion because each unique peer_id creates a new time series; remove the raw
peer_id label and replace it with a lower-cardinality dimension (e.g.,
peer_group, region, or a fixed-size bucket), or restrict labels to only top-N
peers and aggregate the rest into an "other" bucket before calling
createHistogram; alternatively, stop using peer-specific labels and record
per-peer latency only via aggregated metrics or log-based sampling to avoid
unbounded cardinality.
- Around line 109-111: The log message in MetricsService ("[METRICS]
MetricsService initialized on port ...") is misleading because MetricsService
does not bind an HTTP port; update the log in the MetricsService initialization
(the log.info call inside the MetricsService constructor/initialize method) to
remove the implication that it is listening — either log that the service was
initialized with a configured port value (e.g. "[METRICS] MetricsService
initialized (configured port: ... )") or remove the port entirely and move/emit
the actual "listening on port" message from MetricsServer where the server
binds. Ensure you change only the string in the log.info call associated with
MetricsService and add the explicit listening log in MetricsServer's start/bind
function if not already present.
- Around line 519-522: The shutdown() method only clears this.initialized but
leaves the static MetricsService.instance intact, causing getInstance() to
return a stale instance that cannot reinitialize; update shutdown() to also
clear the singleton (reset MetricsService.instance to undefined/null) or
implement logic in getInstance()/initialize to recreate the instance when
initialized is false (refer to MetricsService.instance, shutdown(),
getInstance(), and the initialized guard) so subsequent initialization attempts
produce a fresh, usable MetricsService.

In @src/index.ts:
- Around line 582-590: The metricsCollector created via getMetricsCollector is
started but not saved to indexState and not stopped in gracefulShutdown; update
the startup sequence to store the instance (e.g., assign metricsCollector to
indexState.metricsCollector or a similar field) and modify gracefulShutdown to
check for indexState.metricsCollector and call its stop() (await if async)
before completing shutdown, ensuring any errors are caught/logged; reference the
metricsCollector variable, getMetricsCollector(), start(), stop(), indexState,
and gracefulShutdown when applying the changes.

In @src/libs/peer/PeerManager.ts:
- Around line 81-83: The assignment to peerObject.connection.string uses
peerData.url without validating its type or content; update the branch in
PeerManager (the code handling peerData with "url") to first check that
peerData.url is a non-empty string (e.g., typeof peerData.url === "string" and
peerData.url.trim().length > 0) before assigning to
peerObject.connection.string, and if the check fails either skip the assignment
and log an error/warning or throw a clear validation error so downstream URL
parsing does not receive undefined/null/non-string values.
🧹 Nitpick comments (14)
src/utilities/tui/CategorizedLogger.ts (1)

838-855: Approve with optional safety improvement.

The caching logic is correct and provides a solid performance optimization. Cache invalidation via entryCounter is sound, and the implementation handles all edge cases properly (buffer overflow, clearing, empty state).

🛡️ Optional: Prevent cache corruption from external mutations

The method returns the cached array reference directly. While internal usages are safe (they use slice()/filter()), external callers could mutate the returned array and corrupt the cache.

Option 1: Return a readonly type (zero-cost)

-getAllEntries(): LogEntry[] {
+getAllEntries(): readonly LogEntry[] {

Option 2: Defensive copy (small performance cost)

     // Return cached result if entry counter hasn't changed
     if (this.allEntriesCache !== null && this.allEntriesCacheLastCounter === this.entryCounter) {
-        return this.allEntriesCache
+        return [...this.allEntriesCache]
     }
     
     // ...existing rebuild logic...
-    return this.allEntriesCache
+    return [...this.allEntriesCache]

Option 1 is preferred as it provides type-level safety without runtime overhead.

src/features/web2/dahr/DAHRFactory.ts (2)

16-29: Cleanup logging looks good; consider logging failures from dahr.stopProxy().

If stopProxy() can throw, one failing DAHR may abort cleanup and leak the rest; consider try/catch per instance (and still delete).


63-73: “No DAHR found” at info level may be too chatty.

If this can happen on normal flows, consider debug or rate-limiting to avoid log spam.

src/libs/blockchain/routines/validateTransaction.ts (1)

145-161: FROM ERROR path: consider reusing signValidityData() to avoid duplicated signing code.

You already have signValidityData(); using it here reduces duplication and keeps signing behavior consistent.

src/features/metrics/MetricsCollector.ts (4)

61-68: Singleton ignores config updates after first instantiation.

Once getInstance() is called, subsequent calls with a different config parameter are silently ignored. If callers expect to update the configuration, this could lead to unexpected behavior.

Consider either:

  1. Documenting that config is only honored on first call
  2. Throwing if a different config is passed to an already-initialized instance
  3. Allowing config updates via a separate method

95-103: Set running = true before starting the interval to avoid race window.

Currently running is set after setInterval, which creates a brief window where stop() would see running=false but the interval is active, potentially leaving the interval running.

Proposed fix
         // Start periodic collection
+        this.running = true
         this.collectionInterval = setInterval(
             async () => {
                 await this.collectAll()
             },
             this.config.collectionIntervalMs,
         )

-        this.running = true
         log.info("[METRICS COLLECTOR] Started")

442-458: Fallback sets metrics to zero, which may be misleading on non-Linux platforms.

When /proc/net/dev isn't available (macOS, Windows), the fallback sets network metrics to 0. This could be misinterpreted as "no network activity" rather than "metrics unavailable."

Consider logging a debug message indicating that detailed network I/O is unavailable on the current platform, or omitting the metrics entirely rather than reporting zeros.


541-572: Clear the timeout in the catch block to avoid leaving dangling timer references.

When an error occurs (e.g., network failure before timeout), the timeout continues running until it fires. While not a significant leak due to the short duration, using try/finally would be cleaner.

Proposed fix
     private async checkEndpoint(
         baseUrl: string,
         path: string,
         name: string,
     ): Promise<boolean> {
         const startTime = Date.now()
+        const controller = new AbortController()
+        const timeout = setTimeout(() => controller.abort(), 5000)
         try {
-            const controller = new AbortController()
-            const timeout = setTimeout(() => controller.abort(), 5000)
-
             const response = await fetch(`${baseUrl}${path}`, {
                 method: "GET",
                 signal: controller.signal,
             })

-            clearTimeout(timeout)
-
             const responseTime = Date.now() - startTime
             const isHealthy = response.ok ? 1 : 0

             this.metricsService.setGauge("node_http_health", isHealthy, {
                 endpoint: name,
             })
             this.metricsService.setGauge(
                 "node_http_response_time_ms",
                 responseTime,
                 { endpoint: name },
             )
             return response.ok
         } catch {
             this.metricsService.setGauge("node_http_health", 0, {
                 endpoint: name,
             })
             this.metricsService.setGauge("node_http_response_time_ms", 0, {
                 endpoint: name,
             })
             return false
+        } finally {
+            clearTimeout(timeout)
         }
     }
monitoring/grafana/provisioning/dashboards/json/network-peers.json (1)

278-290: Potential division by zero in Peer Health % calculation.

The expression (peer_online_count / peers_total) * 100 will produce NaN or an error if peers_total is 0 (e.g., when the node first starts or has no known peers).

Consider using a safer PromQL expression:

Suggested safer expression
-          "expr": "(peer_online_count / peers_total) * 100",
+          "expr": "(peer_online_count / (peers_total > 0 or vector(1))) * 100",

Or alternatively use clamp_min:

(peer_online_count / clamp_min(peers_total, 1)) * 100
src/features/metrics/MetricsServer.ts (2)

159-166: Singleton factory ignores config on subsequent calls.

After the first instantiation, config passed to getMetricsServer() is silently ignored. If different configs are passed on subsequent calls, the caller may expect them to take effect. Consider logging a warning or documenting this behavior.

♻️ Optional: Add warning when config is ignored
 export const getMetricsServer = (
     config?: Partial<MetricsServerConfig>,
 ): MetricsServer => {
     if (!metricsServerInstance) {
         metricsServerInstance = new MetricsServer(config)
+    } else if (config) {
+        log.warning("[METRICS SERVER] Config ignored - server already instantiated")
     }
     return metricsServerInstance
 }

64-72: Consider adding error handling for Bun.serve().

If the port is already in use or binding fails, Bun.serve() will throw. Wrapping this in try/catch would provide clearer error messaging and graceful failure.

♻️ Proposed fix
-        this.server = Bun.serve({
-            port: this.config.port,
-            hostname: this.config.hostname,
-            fetch: async (req) => this.handleRequest(req),
-        })
-
-        log.info(
-            `[METRICS SERVER] Started on http://${this.config.hostname}:${this.config.port}/metrics`,
-        )
+        try {
+            this.server = Bun.serve({
+                port: this.config.port,
+                hostname: this.config.hostname,
+                fetch: async (req) => this.handleRequest(req),
+            })
+
+            log.info(
+                `[METRICS SERVER] Started on http://${this.config.hostname}:${this.config.port}/metrics`,
+            )
+        } catch (error) {
+            log.error(
+                `[METRICS SERVER] Failed to start on port ${this.config.port}: ${error}`,
+            )
+            throw error
+        }
monitoring/grafana/provisioning/dashboards/json/system-health.json (1)

1305-1306: Consider increasing refresh interval for production.

A 5-second refresh interval is quite aggressive and may increase load on Prometheus, especially if multiple users view the dashboard. Consider 10s or 30s for production use.

src/features/metrics/MetricsService.ts (2)

73-78: Config is ignored on subsequent getInstance() calls.

Same pattern issue as MetricsServer: if config differs on subsequent calls, it's silently ignored. Consider logging a warning or throwing if configs conflict.


325-340: Silent no-op when metric not found.

If a metric name is misspelled or not registered, these methods silently do nothing. This could make debugging difficult. Consider adding debug-level logging when a metric lookup fails.

♻️ Optional: Add debug logging for missing metrics
     public incrementCounter(
         name: string,
         labels?: Record<string, string>,
         value = 1,
     ): void {
         if (!this.config.enabled) return
         const fullName = this.config.prefix + name
         const counter = this.counters.get(fullName)
         if (counter) {
             if (labels) {
                 counter.inc(labels, value)
             } else {
                 counter.inc(value)
             }
+        } else {
+            log.debug(`[METRICS] Counter not found: ${fullName}`)
         }
     }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 525b2fb and 56b0df9.

⛔ Files ignored due to path filters (5)
  • monitoring/grafana/branding/demos-icon.svg is excluded by !**/*.svg
  • monitoring/grafana/branding/demos-logo-morph.svg is excluded by !**/*.svg
  • monitoring/grafana/branding/demos-logo-white.svg is excluded by !**/*.svg
  • monitoring/grafana/branding/favicon.png is excluded by !**/*.png
  • monitoring/grafana/branding/logo.jpg is excluded by !**/*.jpg
📒 Files selected for processing (34)
  • .beads/.local_version
  • .env.example
  • INSTALL.md
  • README.md
  • monitoring/README.md
  • monitoring/docker-compose.yml
  • monitoring/grafana/grafana.ini
  • monitoring/grafana/provisioning/dashboards/dashboard.yml
  • monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json
  • monitoring/grafana/provisioning/dashboards/json/demos-overview.json
  • monitoring/grafana/provisioning/dashboards/json/network-peers.json
  • monitoring/grafana/provisioning/dashboards/json/system-health.json
  • monitoring/grafana/provisioning/datasources/prometheus.yml
  • monitoring/prometheus/prometheus.yml
  • package.json
  • run
  • src/features/incentive/PointSystem.ts
  • src/features/metrics/MetricsCollector.ts
  • src/features/metrics/MetricsServer.ts
  • src/features/metrics/MetricsService.ts
  • src/features/metrics/index.ts
  • src/features/web2/dahr/DAHRFactory.ts
  • src/index.ts
  • src/libs/blockchain/gcr/gcr.ts
  • src/libs/blockchain/routines/validateTransaction.ts
  • src/libs/crypto/cryptography.ts
  • src/libs/identity/identity.ts
  • src/libs/network/endpointHandlers.ts
  • src/libs/network/manageAuth.ts
  • src/libs/network/manageExecution.ts
  • src/libs/peer/PeerManager.ts
  • src/libs/utils/keyMaker.ts
  • src/utilities/tui/CategorizedLogger.ts
  • src/utilities/tui/TUIManager.ts
🧰 Additional context used
🧬 Code graph analysis (11)
src/libs/blockchain/gcr/gcr.ts (1)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/features/web2/dahr/DAHRFactory.ts (2)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/features/web2/dahr/DAHR.ts (1)
  • sessionId (57-59)
src/libs/network/manageAuth.ts (1)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/libs/network/endpointHandlers.ts (1)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/libs/utils/keyMaker.ts (2)
src/libs/identity/identity.ts (1)
  • ensureIdentity (62-83)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/libs/crypto/cryptography.ts (2)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/utilities/sharedState.ts (1)
  • getSharedState (349-351)
src/libs/network/manageExecution.ts (1)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/features/metrics/MetricsServer.ts (1)
src/features/metrics/MetricsService.ts (1)
  • MetricsService (46-523)
src/features/metrics/MetricsService.ts (1)
src/features/metrics/index.ts (3)
  • MetricsConfig (13-13)
  • MetricsService (11-11)
  • getMetricsService (12-12)
src/libs/blockchain/routines/validateTransaction.ts (1)
src/utilities/tui/CategorizedLogger.ts (1)
  • log (349-380)
src/index.ts (5)
src/features/metrics/index.ts (2)
  • getMetricsServer (18-18)
  • getMetricsCollector (24-24)
src/features/metrics/MetricsServer.ts (1)
  • getMetricsServer (159-166)
src/features/metrics/MetricsCollector.ts (1)
  • getMetricsCollector (718-720)
src/utilities/waiter.ts (1)
  • Waiter (25-150)
src/exceptions/index.ts (2)
  • TimeoutError (4-9)
  • AbortError (14-19)
🪛 dotenv-linter (4.0.0)
.env.example

[warning] 39-39: [UnorderedKey] The METRICS_HOST key should go before the METRICS_PORT key

(UnorderedKey)

🪛 LanguageTool
monitoring/README.md

[style] ~82-~82: Consider a different adjective to strengthen your wording.
Context: ...orter (optional) Host-level metrics for deeper system insights: ```bash docker compose...

(DEEP_PROFOUND)

🪛 markdownlint-cli2 (0.18.1)
INSTALL.md

430-430: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


438-438: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

README.md

111-111: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


120-120: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


132-132: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🔇 Additional comments (45)
src/libs/peer/PeerManager.ts (1)

76-87: Code is defensively sound, but the new format is not yet deployed; capabilities field is undocumented.

The dual-format handling is well-implemented with appropriate fallback logic. However, verification shows that the new object format is not currently in use—peer list files only contain the old string format. The capabilities field mentioned in the comment is not implemented anywhere in the codebase; clarify whether this is reserved for future use or should be removed from the comment.

Consider:

  • Removing or updating the capabilities reference in the comment if it's not intended for near-term implementation
  • Documenting when the new format migration is planned, if applicable
  • Ensuring peer list generation tools produce the new format before it's adopted

The warning log and continue statement appropriately handle malformed entries.

src/utilities/tui/CategorizedLogger.ts (2)

230-232: LGTM! Well-designed cache fields.

The cache fields are properly typed and initialized to ensure the cache is rebuilt on first access.


900-901: LGTM! Correct cache invalidation.

Setting the cache to null after clearing buffers ensures the cache is rebuilt on the next access. Not resetting entryCounter is the correct design, as it maintains the cache key across buffer clears.

.beads/.local_version (1)

1-1: Version bump looks fine.
No concerns in this file.

src/features/incentive/PointSystem.ts (1)

92-92: Good: const linkedNomis is the right semantic here.
You’re mutating the array contents (via push), not reassigning the binding.

src/libs/identity/identity.ts (1)

67-67: LGTM! Clean logging migration.

The replacement of terminal-kit with centralized logging maintains the same informational intent while aligning with the new observability infrastructure.

Also applies to: 72-72

src/libs/crypto/cryptography.ts (1)

228-244: LGTM! Appropriate log levels for cryptography diagnostics.

The migration from terminal-kit to centralized logging uses appropriate severity levels: log.debug for normalization attempts, log.warning for fallback scenarios, and log.error for exceptions.

src/libs/network/endpointHandlers.ts (1)

85-85: LGTM! Consistent logging migration across transaction handlers.

All terminal-kit calls have been properly replaced with centralized logging at appropriate severity levels, maintaining the same diagnostic information flow.

Also applies to: 87-87, 141-141, 170-170, 224-224, 288-288

src/utilities/tui/TUIManager.ts (1)

928-939: Excellent performance optimization for TUI responsiveness!

The introduction of the logsNeedUpdate flag effectively addresses TUI unresponsiveness by deferring expensive log filtering and scroll updates to the render cycle (every 100ms) instead of executing them on every log entry. This prevents UI blocking when logs arrive rapidly.

The implementation correctly:

  • Sets the flag in handleLogEntry without immediate processing
  • Performs deferred updates in render() only when needed and not in CMD mode
  • Resets the flag after updating to prevent redundant work
  • Maintains auto-scroll behavior within the batched update

Also applies to: 1012-1020

src/libs/utils/keyMaker.ts (1)

4-4: LGTM! Clean logging migration for key generation utility.

All console.log statements have been replaced with centralized logging. The key material logging at lines 35-36 is intentional for this development utility.

Also applies to: 12-12, 17-17, 28-28, 35-37, 42-42

src/features/web2/dahr/DAHRFactory.ts (2)

3-3: Confirm src/utilities/logger supports default import + log.info(category, message) shape.

This file uses import log from "src/utilities/logger" and calls log.info("DAHR", "..."); please ensure this matches the logger’s exported type (default vs named export, and arg order).


35-41: Singleton creation log is fine (avoid noisy logs if instance is accessed frequently).

src/libs/network/manageAuth.ts (2)

47-49: Readonly branch log message is clear and consistent.


11-18: > Likely an incorrect or invalid review comment.

run (2)

9-10: Monitoring flag plumbing (-m/--no-monitoring) looks consistent.

Also applies to: 61-65, 446-450, 455-472


943-953: Shutdown path for monitoring is fine (nice symmetry with startup).

INSTALL.md (1)

64-70: install-deps.sh + Rust/Cargo note is clear and actionable.

Also applies to: 201-211

src/libs/blockchain/routines/validateTransaction.ts (2)

29-39: Logging change is fine; ensure log.info("TX", ...) matches logger API.


186-208: Insufficient gas log/message alignment looks good.

src/features/metrics/MetricsCollector.ts (1)

492-505: LGTM - Good cardinality limiting for peer metrics.

Capping to 20 peers prevents metric label explosion. The truncated peer_id also helps keep cardinality manageable while still being identifiable.

monitoring/grafana/provisioning/datasources/prometheus.yml (1)

1-23: LGTM - Datasource provisioning correctly configured.

The configuration properly uses the Docker service name prometheus for inter-container communication, sets POST method for better query handling, and marks it as the default non-editable datasource.

README.md (2)

154-183: LGTM - Comprehensive network ports documentation.

The ports table and firewall examples are well-documented. Good distinction between required and optional ports, with the security note about PostgreSQL being local-only.


109-117: The demos_ metric prefix documentation is accurate. MetricsService properly applies the demos_ prefix (configured at line 33) to all registered metrics via the fullName = this.config.prefix + name pattern. The documentation correctly reflects the final metric names exposed to Prometheus.

monitoring/grafana/provisioning/dashboards/dashboard.yml (1)

1-19: LGTM - Dashboard provisioning correctly configured.

The configuration properly sets up the dashboard provider with reasonable defaults. Note that allowUiUpdates: true allows dashboard editing in the UI, but changes are ephemeral and will be lost on container restart unless exported back to the JSON files.

monitoring/grafana/grafana.ini (1)

66-69: Feature toggles are valid for Grafana 10.2.2 in use.

All three toggles (publicDashboards, topnav, newPanelChromeUI) are documented and available in Grafana 10.2.2. Note: publicDashboards is expected to graduate from feature toggle to GA (renamed as "Shared dashboards") in Grafana 11.x; plan to update this configuration if upgrading.

monitoring/docker-compose.yml (3)

13-34: Prometheus service configuration looks good.

The service is well-configured with appropriate retention settings, lifecycle API enabled for reloads, and proper volume mounts. The host.docker.internal:host-gateway extra host enables scraping the node running on the host machine.


35-100: Grafana configuration is comprehensive and well-structured.

Good use of environment variables for customization, appropriate security settings (disabled analytics, gravatar, sign-up), and proper provisioning volume mounts. The feature toggles and branding setup align with the PR's objectives.


101-130: Node Exporter and infrastructure configuration look correct.

Using Docker Compose profiles for the optional node-exporter is a good pattern. The volume mounts for host metrics access are correct, and the named volumes with explicit names aid in management.

src/index.ts (3)

768-828: TUI-aware stdin handling fix is well-implemented.

The separation of TUI vs non-TUI modes correctly addresses the terminal-kit stdin conflict. In TUI mode, stdin manipulation is avoided since terminal-kit controls it via grabInput(). In non-TUI mode, the Enter-key skip behavior is preserved with proper cleanup in the finally block.


559-599: Metrics startup flow is well-structured.

Good use of dynamic import for lazy loading, proper error handling with failsafe continuation, and consistent port allocation pattern matching other services (MCP, OmniProtocol).


901-909: Metrics server shutdown is properly handled.

The graceful shutdown correctly stops the metrics server with error handling to prevent shutdown failures from propagating.

src/features/metrics/index.ts (1)

1-26: Clean barrel export module.

The module correctly aggregates and re-exports the metrics API surface. The JSDoc module documentation is helpful.

monitoring/prometheus/prometheus.yml (1)

44-52: Node Exporter job configuration is correct.

The job correctly targets the Docker service name node-exporter:9100 which resolves within the Docker network when the --profile full option is used.

monitoring/grafana/provisioning/dashboards/json/consensus-blockchain.json (2)

640-661: Block production rate calculation looks correct.

The rate calculations using rate(block_height[5m]) * 60 and rate(block_height[1m]) * 60 correctly compute blocks per minute from the counter increase rate.


81-92: The metric names in this dashboard are correct and consistent with the MetricsCollector implementation. The MetricsCollector creates metrics without the demos_ prefix (block_height, seconds_since_last_block, last_block_tx_count), which matches what this dashboard queries. The dashboard is properly configured.

Likely an incorrect or invalid review comment.

monitoring/grafana/provisioning/dashboards/json/network-peers.json (1)

89-101: Same metric naming inconsistency as consensus-blockchain dashboard.

This dashboard uses peer_online_count, peer_offline_count, peers_total without the demos_ prefix, while demos-overview.json uses demos_peer_online_count. Ensure consistency with the actual metric names exposed by the MetricsCollector.

monitoring/grafana/provisioning/dashboards/json/demos-overview.json (4)

1-112: Well-structured overview dashboard with clear branding.

The dashboard header with inline SVG logo and node version info provides good visual identity. The use of liveNow: true enables real-time streaming updates.


126-204: Blockchain status panels are well-configured.

Good use of value mappings for ONLINE/OFFLINE status, appropriate thresholds for block lag (30s yellow, 60s red), and consistent styling across panels.


906-938: Docker and port monitoring panels provide useful infrastructure visibility.

The container health (demos_service_docker_container_up) and port status (demos_service_port_open) metrics with UP/DOWN and OPEN/CLOSED mappings give operators quick visibility into service health.


1167-1187: Dashboard global settings are appropriate.

5-second refresh interval is suitable for real-time monitoring, and browser timezone with 1-hour default time range are sensible defaults.

src/features/metrics/MetricsServer.ts (2)

78-128: LGTM!

The request handling logic is well-structured with proper error handling on the /metrics endpoint and appropriate HTTP status codes for each route.


133-153: LGTM!

Lifecycle methods are clean and provide proper state management.

src/features/metrics/MetricsService.ts (3)

10-17: LGTM!

Imports from prom-client are correct and include all necessary metric types and utilities.


223-318: LGTM!

Metric creation methods are well-implemented with proper deduplication logic preventing duplicate registrations.


526-530: LGTM!

Clean export pattern providing both the class and a convenience factory function.

tcsenpai and others added 5 commits January 11, 2026 16:12
- Fix PromQL: use deriv() instead of rate() for gauge metrics
- Add MetricsCollector.stop() to graceful shutdown sequence
- Rename node_info to node_metadata to avoid metric collision
- Handle division by zero in peer health percentage query
- Add non-Linux fallback for network I/O metrics collection
- Use subshell pattern for monitoring stack shutdown in run script
- Clarify METRICS_PORT comment (node endpoint vs Prometheus server)
- Fix monitoring/README.md env var names and example ports
- Fix MD058 lint: add blank lines around tables in INSTALL.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update all PromQL expressions in system-health.json to use the demos_
prefix that MetricsService automatically applies to all metric names:
- demos_system_cpu_usage_percent
- demos_system_memory_usage_percent
- demos_system_memory_used_bytes
- demos_system_memory_total_bytes
- demos_system_load_average_1m/5m/15m
- demos_service_docker_container_up
- demos_service_port_open

Also fix TLSNotary container label from "tlsn-server" to "tlsn" to match
the displayName used in MetricsCollector.collectDockerHealth().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add header comment to prometheus.yml explaining port distinction
- Document that node metrics target must match METRICS_PORT from main .env
- Add "Important Port Distinction" section to README Configuration
- Fix troubleshooting curl example port from 3333 to 9090
- Clarify PROMETHEUS_PORT table entry (server port, not node metrics)

METRICS_PORT (9090) = Demos node metrics endpoint (main .env)
PROMETHEUS_PORT (9091) = Prometheus server external port (monitoring/.env)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevent cardinality explosion by removing the peer_id label from the
peer_latency_seconds histogram. Each unique peer would create new time
series, causing unbounded growth.

Aggregated latency across all peers is sufficient for monitoring;
individual peer debugging should use structured logging instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tion

- Add curl timeout flags (--connect-timeout 1 --max-time 2) to health check
  loops for TLSNotary and Grafana to prevent hanging when services are slow
- Fix MetricsService log message to say "configured port" instead of
  "initialized on port" since the service doesn't bind to the port
- Add URL validation in PeerManager to ensure peerData.url is a non-empty
  string before assignment, logging a warning and skipping invalid entries

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

@tcsenpai tcsenpai merged commit 3b411c8 into testnet Jan 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant