Skip to content

feat(observability): Complete monitoring stack — Tempo, Uptime Kuma, dashboards, alerts (#10)#456

Open
zhaog100 wants to merge 1 commit intoillbnm:masterfrom
zhaog100:fix/issue-10-observability
Open

feat(observability): Complete monitoring stack — Tempo, Uptime Kuma, dashboards, alerts (#10)#456
zhaog100 wants to merge 1 commit intoillbnm:masterfrom
zhaog100:fix/issue-10-observability

Conversation

@zhaog100
Copy link
Copy Markdown

@zhaog100 zhaog100 commented Apr 9, 2026

Summary

Complete observability implementation addressing all acceptance criteria from Issue #10.

Changes

New Services

  • Tempo (grafana/tempo:2.6.0): Distributed tracing via OTLP (gRPC :4317 + HTTP :4318)
  • Uptime Kuma (louislam/uptime-kuma:1.23.15): Service availability monitoring at status.DOMAIN

Pre-provisioned Grafana Dashboards (5 JSON files)

Dashboard Source
Node Exporter Full (1860) Host CPU/Memory/Disk/Network
Docker Container Stats (179) Container resource usage
Traefik Official (17346) Reverse proxy metrics
Loki Operational (13639) Log pipeline health
Uptime Kuma (18278) Service uptime tracking

Auto-provisioned Datasources

  • Prometheus (default)
  • Loki (with traceID → Tempo derived field linking)
  • Tempo (with service map + trace-to-metrics)
  • Alertmanager

Enhanced Alert Rules (3 files replacing single homelab.yml)

File Monitors
host.yml CPU (80%/95%), Memory (90%), Disk (15%/5%), Disk IO
containers.yml Restarts, OOMKills, ContainerDown, HighCPU/Memory
services.yml Traefik 5xx rate, P99 latency, Prometheus targets, Loki latency

Alert Notification via ntfy

  • Critical alerts → ntfy push (urgent priority, 1h repeat)
  • Warning alerts → ntfy push (high priority, 3h repeat)
  • Inhibit rules: critical suppresses matching warnings

Expanded Prometheus Scrape Targets

  • Prometheus, Node Exporter, cAdvisor, Traefik, Loki
  • New: Authentik (:9300), Uptime Kuma (:3001)

Acceptance Criteria

  • Grafana UI accessible, 5 pre-built dashboards auto-loaded
  • All datasources (Prometheus/Loki/Tempo/Alertmanager) auto-configured
  • Alert rules with 3 severity levels (info/warning/critical)
  • Alertmanager pushes to ntfy with proper routing
  • Tempo tracing via OTLP, linked from Loki logs
  • Uptime Kuma for external service monitoring
  • Comprehensive README with architecture, setup, troubleshooting

Fixes #10

…a, dashboards, and alerts (illbnm#10)

- Added Tempo for distributed tracing (OTLP gRPC+HTTP)
- Added Uptime Kuma for service availability monitoring
- Pre-provisioned 5 Grafana dashboards (Node Exporter, Docker, Traefik, Loki, Uptime Kuma)
- Auto-provisioned datasources: Prometheus, Loki, Tempo, Alertmanager
- Expanded alert rules into host.yml, containers.yml, services.yml
- Added ntfy push notification integration for alerts (critical + warning routing)
- Added Authentik and Uptime Kuma Prometheus scrape targets
- Updated Alertmanager with proper ntfy webhook routing
- Updated .env.example with PROMETHEUS_RETENTION, NTFY_TOPIC
- Comprehensive monitoring README

Fixes illbnm#10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BOUNTY $280] Observability — Prometheus + Grafana + Loki + Alerting

1 participant