Update alertmanager to upstream v0.15.1 with memberlist by bboreham · Pull Request #929 · cortexproject/cortex

bboreham · 2018-08-09T15:42:49Z

Fixes #793
Fixes #1205
Fixes #343 because that code is removed
Fixes #899 because that code is removed
Fixes #900 because the message is now at debug level upstream

Options like -alertmanager.mesh.peer.service from the previous implementation are removed.

In a Kubernetes deployment, it can be run as a StatefulSet: suppose the members of the set are named a1, a2 and a3, then all can be run as alertmanager -peer a1 -peer a2 -peer a3.

I have tested as individual Docker containers:

docker network create cortex
docker run --name=configs --net=cortex -d quay.io/weaveworks/cortex-configs --database.uri=memory://
docker run --name=a1 --net=cortex -d quay.io/weaveworks/cortex-alertmanager --alertmanager.configs.url=configs -cluster.peer=a2:9094 --alertmanager.web.external-url=/api/prom/alertmanager
docker run --name=a2 --net=cortex -d quay.io/weaveworks/cortex-alertmanager --alertmanager.configs.url=http://configs -cluster.peer=a1:9094 --alertmanager.web.external-url=/api/prom/alertmanager

Then tried various curl commands against the API.

pkg/alertmanager/alertmanager.go

csmarchbanks

This seems fine code wise.

Have you tried deploying it and having it try to discover other members based on the alertmanager service url? I am hoping not to do a stateful set if possible.

bboreham · 2018-08-28T10:19:17Z

No I have not tried via discovery. My vague memory is that it won't work as-is.

Also I have not done the config for statefulset, but in my mind statefulset seems better in every way for a component like this. Interested to hear what makes you avoid them.

csmarchbanks · 2018-08-28T17:31:13Z

The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle.

khaines · 2018-09-16T16:39:51Z

The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle.

This is really the only pain point we have with stateful sets in our environment. In order to do any changes that are not the few fields that k8s allows you to change, one must remove the entire stateful set first, thus incurring some level of downtime to the component.

khaines · 2018-09-16T16:43:10Z

if the peer list could have been rendered into a config file and loaded via config map, this might avoid the issue with updating fields in a Statefulset spec.

rndstr · 2019-01-26T02:23:24Z

this would also close #1205

bboreham · 2019-08-05T16:47:19Z

Rebased to latest master.

I now think a statefulset is not required, because we just need to have each peer find any existing peer and that can be done via regular service discovery.

More notes of commands used in testing:

docker run --name=configs --net=cortex -d quay.io/cortexproject/cortex -target configs --database.uri=memory://
docker run --name=a1 --net=cortex -d quay.io/cortexproject/cortex -target=alertmanager --alertmanager.configs.url=http://configs --alertmanager.web.external-url=/api/prom/alertmanager
docker run --name=a2 --net=cortex -d quay.io/cortexproject/cortex -target=alertmanager --alertmanager.configs.url=http://configs -cluster.peer=a1:9094 --alertmanager.web.external-url=/api/prom/alertmanager

curl http://172.26.0.3/status

curl -X POST -H X-Scope-OrgID:3 http://172.26.0.2/api/prom/configs/alertmanager -d '{"alertmanager_config": "route:\n receiver: foo\nreceivers:\n- name: foo"}'

curl -X POST -H X-Scope-OrgID:3 http://172.26.0.3/api/prom/alertmanager/api/v1/silences -d '{"startsAt": "2019-08-05T16:18:03Z", "endsAt": "2019-08-05T19:18:03Z", "updatedAt": "2019-08-05T16:18:03Z", "comment": "bar", "createdBy": "bryan", "matchers": [{"name": "m1", "value": "v1"}]}'

curl -H X-Scope-OrgID:3 http://172.26.0.3/api/prom/alertmanager/api/v1/silences
curl -H X-Scope-OrgID:3 http://172.26.0.4/api/prom/alertmanager/api/v1/silences

bboreham · 2019-08-05T17:02:47Z

I just tested this as a Kubernetes Deployment with -cluster.peer=alertmanager.cortex.svc.cluster.local.:9094 and it worked fine.

May need to check what exactly Memberlist does when you give it a Kubernetes Service address, ~~i.e. one that moves around from call to call~~.

UPDATE: Prometheus alertmanager takes the name you give it and does a DNS lookup, so a headless service is perfect. It does this once at startup, so each new alertmanager would connect to all existing alertmanagers. Dead ones are removed from the list.

bboreham · 2019-08-06T15:07:34Z

I'm now running this in a staging cluster. All seems fine, although there is a bit of log noise at startup:

level=warn ts=2019-08-06T13:12:49.059667018Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:1136
level=warn ts=2019-08-06T13:12:49.059676114Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:853
level=warn ts=2019-08-06T13:12:49.059685719Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:1122

I think this is because gossip starts before the configs are all loaded, and we receive updates from already-running alert managers about instances we don't know about yet.

Still, the log noise is down about 100x from the current version.

pkg/alertmanager/multitenant.go

bboreham · 2019-08-08T10:37:26Z

pkg/alertmanager/multitenant.go

+	flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
+	flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).")
+	flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.")
+	flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.")


I wonder if we really need to expose all of these "cluster" parameters.

We could probably get away with defaults for most of these. Do you plan to actually change any of them?

bboreham · 2019-08-08T11:05:12Z

Have rebased against master, and now this PR undoes some of the vendor hacks introduced by #1510 - putting Alertmanager back on mainline.

csmarchbanks

This seems sane to me, but I don't actually run alertmanager. Perhaps @khaines could take a look as well since I believe he heavily uses alertmanager?

csmarchbanks · 2019-08-12T18:48:57Z

pkg/alertmanager/multitenant.go

+	flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
+	flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).")
+	flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.")
+	flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.")


We could probably get away with defaults for most of these. Do you plan to actually change any of them?

bboreham · 2019-08-13T11:01:39Z

I removed most of the new configuration parameters (left the commit in, so they can be retrieved if we do need them).

Signed-off-by: Bryan Boreham <bryan@weave.works>

Don't expect any of these will need to be configured. Signed-off-by: Bryan Boreham <bryan@weave.works>

khaines

LGTM. Glad to see this getting updated!

bboreham force-pushed the update-alertmanager branch 2 times, most recently from 249336a to d805c18 Compare August 9, 2018 16:11

csmarchbanks reviewed Aug 10, 2018

View reviewed changes

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved

csmarchbanks reviewed Aug 10, 2018

View reviewed changes

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved

bboreham force-pushed the update-alertmanager branch 2 times, most recently from b811561 to a38a9c2 Compare August 23, 2018 12:08

csmarchbanks approved these changes Aug 23, 2018

View reviewed changes

bboreham mentioned this pull request Nov 20, 2018

Bump github.com/prometheus/alertmanager from fb713f6 to 0.15.3 #1124

Closed

bboreham changed the title ~~Update alertmanager to upstream v0.15.1 with memberlist~~ WIP: Update alertmanager to upstream v0.15.1 with memberlist Jan 16, 2019

bboreham mentioned this pull request Feb 28, 2019

Config updates cause alerting rules to forget firing state #493

Closed

bboreham force-pushed the update-alertmanager branch 2 times, most recently from 2810427 to 26d9030 Compare August 5, 2019 16:42

bboreham force-pushed the update-alertmanager branch from 26d9030 to d0c0c9b Compare August 5, 2019 16:48

bboreham changed the title ~~WIP: Update alertmanager to upstream v0.15.1 with memberlist~~ Update alertmanager to upstream v0.15.1 with memberlist Aug 5, 2019

csmarchbanks reviewed Aug 6, 2019

View reviewed changes

pkg/alertmanager/multitenant.go Show resolved Hide resolved

csmarchbanks self-requested a review August 6, 2019 16:25

bboreham commented Aug 8, 2019

View reviewed changes

bboreham force-pushed the update-alertmanager branch from 3332bf8 to 9b72c2e Compare August 8, 2019 11:01

csmarchbanks approved these changes Aug 12, 2019

View reviewed changes

bboreham force-pushed the update-alertmanager branch from d1643c1 to 1e39f02 Compare August 13, 2019 10:54

bboreham and others added 4 commits August 16, 2019 15:07

Update for newer upstream alertmanager

2b2bdb7

Signed-off-by: Bryan Boreham <bryan@weave.works>

Update vendor for newer alertmanager code

33944e9

Signed-off-by: Bryan Boreham <bryan@weave.works>

Add feature and flag changes to CHANGELOG

91a26a6

Signed-off-by: Bryan Boreham <bryan@weave.works>

Remove command-line options for Alertmanager cluster params

f547d62

Don't expect any of these will need to be configured. Signed-off-by: Bryan Boreham <bryan@weave.works>

bboreham force-pushed the update-alertmanager branch from 1e39f02 to f547d62 Compare August 16, 2019 15:09

khaines approved these changes Aug 16, 2019

View reviewed changes

bboreham merged commit 7f3895e into master Aug 16, 2019

bboreham deleted the update-alertmanager branch August 16, 2019 15:17

bboreham mentioned this pull request Aug 19, 2019

Alertmanager is borked #1592

Closed

Conversation

bboreham commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 28, 2018

Uh oh!

csmarchbanks commented Aug 28, 2018

Uh oh!

khaines commented Sep 16, 2018

Uh oh!

khaines commented Sep 16, 2018

Uh oh!

rndstr commented Jan 26, 2019

Uh oh!

bboreham commented Aug 5, 2019

Uh oh!

bboreham commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Aug 6, 2019

Uh oh!

Uh oh!

bboreham Aug 8, 2019

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 8, 2019

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

csmarchbanks Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 13, 2019

Uh oh!

khaines left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bboreham commented Aug 9, 2018 •

edited

Loading

bboreham commented Aug 5, 2019 •

edited

Loading