Skip to content

Conversation

@adriansalamon
Copy link
Member

@adriansalamon adriansalamon commented Sep 18, 2025

Adds:

  • Loki, a log aggregation service/database. Stores logs in S3
  • Prometheus, a metrics collection service. Uses service discovery to find metrics for systems in nomad. Also collects node metrics for all nodes, and nomad metrics for all clients.
  • Grafana, a nice frontend to visualize/build dashboards for both.
  • Vector, a log scraping/processing service. Deployed on each nomad client.

Some implementation details:

We use vector as a nomad job, where it mounts to the local docker socket to read docker logs, and ships them to loki.

Prometheus can use Nomad service discovery to find scraping targets. For example, you could add a new endpoint to your service that exposes metrics (different than the public port ofc.), and add metrics like:

service {
        port     = "metrics_http"
        provider = "nomad"
        tags = [
          "prometheus.scrape=true",
          "traefik.enable=true",
          "traefik.http.routers.tiki-metrics.rule=Host(`tiki-metrics.nomad.dsekt.internal`)",
          "traefik.http.routers.tiki-metrics.entrypoints=web-internal",
        ]
      }
}

For service discovery for node metrics and similar, we can use DNS based service discovery. Ie. prometheus does a query to _node._tcp.monitoring.dsekt.internal, and gets SRV records for IPs and ports to scrape.

Note: Grafana, Loki, and Prometheus jobs are already deployed and working.

PS. I have no idea of how we effectively deploy/test this.

@datasektionen datasektionen deleted a comment from github-actions bot Sep 18, 2025
@datasektionen datasektionen deleted a comment from github-actions bot Sep 22, 2025
@datasektionen datasektionen deleted a comment from github-actions bot Sep 25, 2025
@adriansalamon adriansalamon marked this pull request as ready for review September 25, 2025 17:10
@adriansalamon
Copy link
Member Author

On another note, we could consider using consul for service discovery and as a distributed kv-store because it is very nice ☺️

@foodelevator
Copy link
Member

On the same other note: the current solution, i.e. using traefik and manual dns config for service "discovery" was picked because nomads built in service discovery doesn't support discovering services in other namespaces, and when I tried consul I tried to use the service mesh which I did not get working so I threw out consul entirely. But using it for service discovery but not service mesh might be nice.

@datasektionen datasektionen deleted a comment from github-actions bot Nov 1, 2025
@datasektionen datasektionen deleted a comment from github-actions bot Nov 1, 2025
@adriansalamon
Copy link
Member Author

This is all deployed and working. Still todo:

Copy link
Contributor

@Poizon7 Poizon7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Poizon7 Poizon7 merged commit cfd81f8 into main Nov 3, 2025
@foodelevator foodelevator deleted the feat/log-stack branch November 3, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants