feat(observability): Complete monitoring stack — Tempo, Uptime Kuma, dashboards, alerts (#10) by zhaog100 · Pull Request #456 · illbnm/homelab-stack

zhaog100 · 2026-04-09T01:49:04Z

Summary

Complete observability implementation addressing all acceptance criteria from Issue #10.

Changes

New Services

Tempo (grafana/tempo:2.6.0): Distributed tracing via OTLP (gRPC :4317 + HTTP :4318)
Uptime Kuma (louislam/uptime-kuma:1.23.15): Service availability monitoring at status.DOMAIN

Pre-provisioned Grafana Dashboards (5 JSON files)

Dashboard	Source
Node Exporter Full (1860)	Host CPU/Memory/Disk/Network
Docker Container Stats (179)	Container resource usage
Traefik Official (17346)	Reverse proxy metrics
Loki Operational (13639)	Log pipeline health
Uptime Kuma (18278)	Service uptime tracking

Auto-provisioned Datasources

Prometheus (default)
Loki (with traceID → Tempo derived field linking)
Tempo (with service map + trace-to-metrics)
Alertmanager

Enhanced Alert Rules (3 files replacing single homelab.yml)

File	Monitors
`host.yml`	CPU (80%/95%), Memory (90%), Disk (15%/5%), Disk IO
`containers.yml`	Restarts, OOMKills, ContainerDown, HighCPU/Memory
`services.yml`	Traefik 5xx rate, P99 latency, Prometheus targets, Loki latency

Alert Notification via ntfy

Critical alerts → ntfy push (urgent priority, 1h repeat)
Warning alerts → ntfy push (high priority, 3h repeat)
Inhibit rules: critical suppresses matching warnings

Expanded Prometheus Scrape Targets

Prometheus, Node Exporter, cAdvisor, Traefik, Loki
New: Authentik (:9300), Uptime Kuma (:3001)

Acceptance Criteria

Grafana UI accessible, 5 pre-built dashboards auto-loaded
All datasources (Prometheus/Loki/Tempo/Alertmanager) auto-configured
Alert rules with 3 severity levels (info/warning/critical)
Alertmanager pushes to ntfy with proper routing
Tempo tracing via OTLP, linked from Loki logs
Uptime Kuma for external service monitoring
Comprehensive README with architecture, setup, troubleshooting

Fixes #10

…a, dashboards, and alerts (illbnm#10) - Added Tempo for distributed tracing (OTLP gRPC+HTTP) - Added Uptime Kuma for service availability monitoring - Pre-provisioned 5 Grafana dashboards (Node Exporter, Docker, Traefik, Loki, Uptime Kuma) - Auto-provisioned datasources: Prometheus, Loki, Tempo, Alertmanager - Expanded alert rules into host.yml, containers.yml, services.yml - Added ntfy push notification integration for alerts (critical + warning routing) - Added Authentik and Uptime Kuma Prometheus scrape targets - Updated Alertmanager with proper ntfy webhook routing - Updated .env.example with PROMETHEUS_RETENTION, NTFY_TOPIC - Comprehensive monitoring README Fixes illbnm#10

zhaog100 mentioned this pull request Apr 9, 2026

[BOUNTY $280] Observability — Prometheus + Grafana + Loki + Alerting #10

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): Complete monitoring stack — Tempo, Uptime Kuma, dashboards, alerts (#10)#456

feat(observability): Complete monitoring stack — Tempo, Uptime Kuma, dashboards, alerts (#10)#456
zhaog100 wants to merge 1 commit intoillbnm:masterfrom
zhaog100:fix/issue-10-observability

zhaog100 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhaog100 commented Apr 9, 2026

Summary

Changes

New Services

Pre-provisioned Grafana Dashboards (5 JSON files)

Auto-provisioned Datasources

Enhanced Alert Rules (3 files replacing single homelab.yml)

Alert Notification via ntfy

Expanded Prometheus Scrape Targets

Acceptance Criteria

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant