Skip to content

feat(observability): add runner VM hostmetrics Grafana dashboard#187

Merged
cbartz merged 12 commits intomainfrom
feat/hostmetrics-grafana-dashboard
Apr 28, 2026
Merged

feat(observability): add runner VM hostmetrics Grafana dashboard#187
cbartz merged 12 commits intomainfrom
feat/hostmetrics-grafana-dashboard

Conversation

@cbartz
Copy link
Copy Markdown
Collaborator

@cbartz cbartz commented Apr 23, 2026

Summary

Adds a read-only Grafana dashboard for runner VM host-level metrics, served via cos-configuration-k8s using the grafana-dashboard relation. Provisioned dashboards are filesystem-managed in Grafana, so they cannot be edited through the UI regardless of user role.

image image image

Changes

  • runner_grafana_dashboards/runner_vm_hostmetrics.json: new dashboard covering CPU, memory, disk I/O, filesystem, network traffic and load averages
    • Layout mirrors the upstream OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row of gauges/stats, then CPU, Memory, Disk I/O, Filesystem and Network sections
    • Template variables: github_repository, github_workflow, github_job, github_runner — all single-select with includeAll: true and allValue: ".*". Picking a value scopes panels to one host (the upstream design); picking "All" widens the scope as a regex shortcut.
    • Label matchers use =~ so the regex from "All" interpolates correctly
    • Metric names follow the OpenTelemetry hostmetrics receiver Prometheus convention (system_cpu_time_seconds_total, system_memory_usage_bytes, …); the four GitHub-context labels are added by the OTel collector pipeline in github-runner-operator#781
    • __inputs declares the Prometheus / Mimir datasource and editable: false is set for clarity (provisioned dashboards are read-only either way)
  • README.md: documents the repo layout, the observability dashboard delivery mechanism, and the conventions for both per-charm and runner-VM dashboards

Notes

Closes / relates to: ISD-5152

cbartz added 4 commits April 23, 2026 14:21
Adds a read-only Grafana dashboard (editable: false) for runner VM
host-level metrics to be served via cos-configuration-k8s using the
grafana-dashboard relation, which provisions it as an immutable
filesystem dashboard in Grafana.

The dashboard covers:
- CPU utilisation by state and load averages
- Memory usage by state
- Disk I/O throughput and operations
- Filesystem usage % by mount point
- Network traffic, errors and drops

Template variables:
- github_job_id: filter by GitHub Actions workflow run job ID
- instance: filter by runner hostname

Metric names follow the OpenTelemetry hostmetrics receiver prometheus
convention (e.g. system_cpu_time_seconds_total). The github_job_id
label is expected to be set as a resource attribute by the otelcol
pipeline collecting metrics from the runner VMs.

Related: ISD-5152
Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the
purpose explicit at the repo root level (runner VM host metrics, not
charm workload metrics).

Update README with:
- Repository layout overview
- Observability section explaining the cos-configuration-k8s delivery
  mechanism and the immutability guarantee
- Table of conventions for where dashboards live and what
  grafana_dashboards_path value to use in Terraform
Replace github_job_id with github_job and instance with github_runner
to match the actual attribute labels set by the pre-job OTel config
(see canonical/github-runner-operator#781).

Add github_repository and github_workflow template variables so the
dashboard can be filtered the same way as the existing PS6 hostmetrics
dashboard.
…ayout

Restructure the runner VM hostmetrics dashboard to follow the upstream
OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row
of CPU/Memory/Root FS gauges plus Load/Cores/Total Memory stats, then
CPU, Memory, Disk I/O, Filesystem and Network sections with read/write
and rx/tx split axes.

Make every templating variable support "All" via includeAll, multi-select
and allValue ".*", and switch all label matchers to =~ so regex
interpolation works.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new provisioned (read-only) Grafana dashboard for GitHub Actions runner VM host-level metrics and documents how dashboards are laid out and delivered via cos-configuration-k8s.

Changes:

  • Add runner_grafana_dashboards/runner_vm_hostmetrics.json with CPU/memory/disk/filesystem/network panels based on OTel hostmetrics Prometheus metrics.
  • Update README.md to document repository layout and the Grafana dashboard delivery mechanism/conventions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
runner_grafana_dashboards/runner_vm_hostmetrics.json New Grafana dashboard JSON for runner VM host metrics with templating variables and PromQL queries.
README.md Documents dashboard locations and provisioning conventions via cos-configuration-k8s / grafana-dashboard.

Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
Comment thread README.md Outdated
…dashboard

When the runner variable resolves to multiple series (multi-select or
"All"), several panels previously produced misleading values:

- CPU Cores stat / System Load "cores" reference: count(count by (cpu) ...)
  collapses cpu indexes across runners, returning the max-cores-on-any-host
  rather than fleet total. Group by github_runner so cpu indexes stay
  distinct, then expose total cores in the stat panel and per-runner cores
  on the load panel (so the reference aligns with the averaged load lines).
- System Load 1m/5m/15m: bare metric returns one series per runner with
  identical legends ("1m"/"5m"/"15m"), making the chart unreadable. Wrap
  in avg() to get one fleet-average line per period.
- Disk Busy %: sum by (device) of fractional busy time can exceed 1 with
  multiple runners and gets silently clamped by max:1. Switch to
  avg by (device) so the value stays a meaningful 0-1 fleet average.

Also soften the README guidance on editable: false. cos-configuration-k8s
provisions dashboards from the filesystem, which makes them read-only in
Grafana regardless of the flag, so the explicit "must" requirement was
contradicted by existing dashboards in charms/planner-operator/.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new filesystem-provisioned Grafana dashboard for GitHub Actions runner VM host-level metrics and documents how dashboards are organized and delivered to Grafana via cos-configuration-k8s.

Changes:

  • Add runner_vm_hostmetrics.json dashboard covering CPU, memory, disk I/O, filesystem, network, load.
  • Document repository layout and Grafana dashboard delivery/conventions in README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
runner_grafana_dashboards/runner_vm_hostmetrics.json New hostmetrics dashboard with templated filters and multiple panels for VM resource metrics
README.md Adds repo layout + explains dashboard provisioning via cos-configuration-k8s and dashboard authoring conventions

Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
cbartz added 2 commits April 27, 2026 13:58
Expand the dashboard description to spell out the expected usage of the
github_runner variable: scope it to a flavor regex (e.g. flavor-x-.*)
when comparing fleets, or pick a single runner for per-host inspection.

Aggregating by device/mountpoint without grouping by github_runner is
intentional — it produces meaningful fleet totals/averages when the
matched runners share device semantics — but assumes operators don't
mix heterogeneous flavors under "All".
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a filesystem-provisioned (read-only) Grafana dashboard for GitHub Actions runner VM host-level metrics, and documents how dashboards are delivered via cos-configuration-k8s.

Changes:

  • Add runner_vm_hostmetrics.json dashboard covering CPU, memory, disk I/O, filesystem, and network metrics with GitHub-context templating variables.
  • Document dashboard delivery mechanism and repository/dashboard layout conventions in README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
runner_grafana_dashboards/runner_vm_hostmetrics.json New hostmetrics dashboard for runner VMs with PromQL queries and templating variables.
README.md Documents dashboard provisioning via cos-configuration-k8s and expected dashboard directory conventions.

Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json Outdated
Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
- Root FS gauge: restrict the denominator to state=~"used|free" to match
  the Filesystem Utilization bargauge and df semantics. Without this,
  reserved blocks (e.g. ext4's 5% root reservation) inflate the
  denominator and the gauge reads artificially low.
- System Load cores override: the field matcher still pointed at the
  old "cores" legend after the per-runner rename, so the red dashed
  styling never applied. Update the matcher to "cores (per runner)".
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a filesystem-provisioned, read-only Grafana dashboard to visualize GitHub Actions runner VM host-level metrics, and documents the dashboard delivery/layout conventions for this monorepo.

Changes:

  • Add a new runner VM hostmetrics dashboard JSON with CPU, memory, disk I/O, filesystem, and network panels plus GitHub-context template variables.
  • Document the repository’s Grafana dashboard layout and cos-configuration-k8s provisioning conventions in the README.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
runner_grafana_dashboards/runner_vm_hostmetrics.json New Grafana dashboard for runner VM hostmetrics (Prometheus/OTel hostmetrics naming + GitHub-context filters).
README.md Documents dashboard directory conventions and delivery mechanism via cos-configuration-k8s and the grafana-dashboard relation.

cbartz added 2 commits April 27, 2026 15:12
Drop multi-select on the GitHub-context variables (kept includeAll +
allValue: ".*" so picking "All" still widens the scope as a regex).
Single-select matches the upstream OpenTelemetry hostmetrics design and
makes per-host attribution work — multi-runner aggregations under
sum by (device) collapsed identically-named devices across hosts and
hid which runner was responsible for any given spike.

With single-select assured, simplify the dense per-device/per-mountpoint
panels back to bare metrics (drop sum by device on disk I/O throughput,
disk IOPS, disk busy %, memory usage, filesystem usage, network
throughput/packets/errors). Revert the multi-runner-defensive variants
of CPU Cores, System Load 1m/5m/15m and the cores reference series.

Aggregations are kept where they are inherent to the metric: overview
gauges (CPU/Memory/Root FS), Memory Utilization (sum/sum ratio),
Filesystem Utilization (sum by mountpoint ratio) and TCP Connections
(sum by state).

Drop the cross-host-aggregation note from the dashboard description
since the design no longer relies on it.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Grafana dashboard for GitHub Actions runner VM host-level metrics and documents how dashboards are organized and delivered via cos-configuration-k8s.

Changes:

  • Add a filesystem-provisioned Grafana dashboard JSON for runner VM hostmetrics (CPU, memory, disk, filesystem, network, load).
  • Document repository layout and the Grafana dashboard delivery mechanism/conventions in README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
runner_grafana_dashboards/runner_vm_hostmetrics.json New hostmetrics dashboard JSON with templated GitHub-context filters and PromQL queries.
README.md Documents dashboard locations and delivery conventions via cos-configuration-k8s.

Comment thread runner_grafana_dashboards/runner_vm_hostmetrics.json
@cbartz cbartz marked this pull request as ready for review April 27, 2026 13:56
@cbartz cbartz merged commit eb043c6 into main Apr 28, 2026
36 checks passed
@cbartz cbartz deleted the feat/hostmetrics-grafana-dashboard branch April 28, 2026 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants