Skip to content

[grafana] No alerting rules in dashboard — bot down, queue depth spike, and simulation failure rate unmonitored #284

@obchain

Description

@obchain

Refs #54

Location

deploy/grafana/charon.json (no alerting section present)

Problem

The dashboard has 9 visualization panels but zero alerting rules. For a production liquidation bot, silent failure is the primary operational risk. The following conditions should generate alerts but currently have no coverage:

  1. Bot down: charon_scanner_blocks_total does not increase over 60s — bot has stopped scanning.
  2. Queue depth spike: charon_executor_queue_depth > threshold — possible stall in the executor.
  3. High simulation failure rate: rate(charon_executor_simulations_total{result="failure"}[5m]) / rate(charon_executor_simulations_total[5m]) > 0.5 — contract or RPC issue.
  4. Zero liquidations in 1h: increase(charon_executor_opportunities_queued_total[1h]) == 0 — scanner or health check broken.
  5. High drop rate: rate(charon_executor_opportunities_dropped_total[5m]) / rate(charon_executor_opportunities_queued_total[5m]) > 0.9 — upstream pipeline issue.

Impact

Without alerts, the operator must watch the dashboard continuously to detect bot failure. A stopped bot silently misses liquidation opportunities. Given the financial stakes (Venus liquidations), undetected downtime has direct cost.

Suggested Fix

Add Grafana unified alerting rules to the dashboard JSON for at minimum conditions 1-3. Alternatively, ship a Prometheus alerting rules YAML file at deploy/grafana/alerts.yaml alongside the dashboard JSON.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlayer:devopsCI / deploy / infra / telemetrypriority:p2-polishNice-to-have / polishstatus:readyScoped and ready to pick uptype:featureNew capability or deliverable

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions