Skip to content

Core components install random failure during create-cluster #7780

@DavidePrincipi

Description

@DavidePrincipi

The installation of core components may fail during create-cluster if variables like LOKI_ADDR cannot be discovered in Redis. This happens if Loki installation is not completed yet.

Steps to reproduce

  • Run the Create New Cluster procedure

Expected behavior

All components and services are correctly installed and started.

Actual behavior

The node_exporter.service unit is not enabled and didn't start.

The leader node itself has an offline alert.

Image

Log evidence, LOKI_ADDR is accessed by concurrent services (Metrics) just before Loki installation completes:

Dec 05 07:04:48 rl1 runagent[35965]: Traceback (most recent call last):
Dec 05 07:04:48 rl1 runagent[35965]:   File "/home/metrics1/.config/bin/provision-prometheus", line 222, in <module>
Dec 05 07:04:48 rl1 runagent[35965]:     generate_prometheus_config(redis_client)
Dec 05 07:04:48 rl1 runagent[35965]:   File "/home/metrics1/.config/bin/provision-prometheus", line 55, in generate_prometheus_config
Dec 05 07:04:48 rl1 runagent[35965]:     logcli["LOKI_ADDR"] = logcli["LOKI_ADDR"] + ':' + logcli["LOKI_HTTP_PORT"]
Dec 05 07:04:48 rl1 runagent[35965]:                           ~~~~~~^^^^^^^^^^^^^
Dec 05 07:04:48 rl1 runagent[35965]: KeyError: 'LOKI_ADDR'
Dec 05 07:04:48 rl1 systemd[34508]: Started libcrun container.
Dec 05 07:04:48 rl1 systemd[34571]: prometheus.service: Control process exited, code=exited, status=1/FAILURE
Dec 05 07:04:48 rl1 podman[36012]: loki
Dec 05 07:04:48 rl1 systemd[34508]: Started Loki pod service.
Dec 05 07:04:48 rl1 agent@loki1[34538]: task/module/loki1/55290a26-9233-43e0-bc40-8414f72b1029: action "create-module" status is "completed" (0) at step 20systemd

Metrics installation failure aborts create-cluster action, leaving node_exporter.service unconfigured and stopped.

Dec 05 07:04:46 rl1 agent@metrics1[34607]: task/module/metrics1/2b0e231e-7c05-4dad-a5bd-e6ce6af4b1da: action "create-module" status is "aborted" (1) at step 80start_services

Components

  • Core 3.15
  • Metrics 1.2.0
  • Loki 1.4.0

See also

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions