Centralized observability stack for all services. Owned by the Platform team.
| Component | Role | Port |
|---|---|---|
| OTEL Collector | Telemetry gateway — receives traces/metrics/logs from all services | 4317 (gRPC), 4318 (HTTP) |
| Prometheus | Metrics storage & alerting | 9090 |
| Grafana | Dashboards & visualization | 3000 |
| Grafana Tempo | Distributed trace storage | 3200 |
| Node Exporter | Host metrics (CPU, memory, disk) | 9100 |
Your Services
│
│ OTLP (gRPC :4317 or HTTP :4318)
▼
OTEL Collector (gateway)
├── Traces ──────────────► Tempo ──────────► Grafana
├── Metrics ──► Prometheus exporter (:8889)
│ ▲
│ Prometheus scrapes
│ │
└──────────────────────────────────────────► Grafana
Services never talk directly to Prometheus or Tempo. They only talk to the OTEL Collector endpoint. This decouples your services from backend changes.
# 1. Copy environment file
cp .env.example .env
# 2. Start the stack
docker compose up -d
# 3. Check all services are healthy
docker compose ps
# 4. Open Grafana
open http://localhost:3000 # admin / admin (change in .env)Set these environment variables in your service:
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_SERVICE_NAME=your-service-name
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local,service.version=1.0.0
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlpNever hardcode the collector address. Always use OTEL_EXPORTER_OTLP_ENDPOINT.
Dashboards live in grafana/dashboards/ as JSON files and are auto-provisioned.
To add a new dashboard:
- Create it in Grafana UI
- Export as JSON (
Share → Export → Save to file) - Place in
grafana/dashboards/ - Commit the JSON — it auto-loads on next restart
Alert rules live in prometheus/rules/alerts.yml. After editing, hot-reload without restart:
curl -X POST http://localhost:9090/-/reloadAdd a scrape config to prometheus/prometheus.yml:
- job_name: "your-service"
static_configs:
- targets: ["your-service:8080"]Then reload: curl -X POST http://localhost:9090/-/reload
infra-observability/
├── otel-collector/
│ └── config/
│ └── otel-collector-config.yaml # receiver/processor/exporter pipelines
├── prometheus/
│ ├── prometheus.yml # scrape configs
│ └── rules/
│ ├── alerts.yml # alerting rules
│ └── recording-rules.yml # pre-computed expensive queries
├── grafana/
│ ├── provisioning/
│ │ ├── datasources/ # auto-provisioned datasources
│ │ └── dashboards/ # dashboard loader config
│ └── dashboards/ # dashboard JSON files
├── tempo/
│ └── tempo.yaml # trace storage config
├── docker-compose.yml # local dev stack
├── .env.example # environment variable template
└── README.md
Platform / SRE team. For questions open an issue or ping #platform in Slack.