Update alertmanager to upstream v0.15.1 with memberlist#929
Conversation
249336a to
d805c18
Compare
b811561 to
a38a9c2
Compare
csmarchbanks
left a comment
There was a problem hiding this comment.
This seems fine code wise.
Have you tried deploying it and having it try to discover other members based on the alertmanager service url? I am hoping not to do a stateful set if possible.
|
No I have not tried via discovery. My vague memory is that it won't work as-is. Also I have not done the config for statefulset, but in my mind statefulset seems better in every way for a component like this. Interested to hear what makes you avoid them. |
|
The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle. |
This is really the only pain point we have with stateful sets in our environment. In order to do any changes that are not the few fields that k8s allows you to change, one must remove the entire stateful set first, thus incurring some level of downtime to the component. |
|
if the peer list could have been rendered into a config file and loaded via config map, this might avoid the issue with updating fields in a Statefulset spec. |
|
this would also close #1205 |
2810427 to
26d9030
Compare
|
Rebased to latest master. I now think a statefulset is not required, because we just need to have each peer find any existing peer and that can be done via regular service discovery. More notes of commands used in testing: |
26d9030 to
d0c0c9b
Compare
|
I just tested this as a Kubernetes Deployment with May need to check what exactly Memberlist does when you give it a Kubernetes Service address, UPDATE: Prometheus alertmanager takes the name you give it and does a DNS lookup, so a headless service is perfect. It does this once at startup, so each new alertmanager would connect to all existing alertmanagers. Dead ones are removed from the list. |
|
I'm now running this in a staging cluster. All seems fine, although there is a bit of log noise at startup: I think this is because gossip starts before the configs are all loaded, and we receive updates from already-running alert managers about instances we don't know about yet. Still, the log noise is down about 100x from the current version. |
pkg/alertmanager/multitenant.go
Outdated
| flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.") | ||
| flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).") | ||
| flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.") | ||
| flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.") |
There was a problem hiding this comment.
I wonder if we really need to expose all of these "cluster" parameters.
There was a problem hiding this comment.
We could probably get away with defaults for most of these. Do you plan to actually change any of them?
3332bf8 to
9b72c2e
Compare
|
Have rebased against master, and now this PR undoes some of the vendor hacks introduced by #1510 - putting Alertmanager back on mainline. |
csmarchbanks
left a comment
There was a problem hiding this comment.
This seems sane to me, but I don't actually run alertmanager. Perhaps @khaines could take a look as well since I believe he heavily uses alertmanager?
pkg/alertmanager/multitenant.go
Outdated
| flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.") | ||
| flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).") | ||
| flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.") | ||
| flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.") |
There was a problem hiding this comment.
We could probably get away with defaults for most of these. Do you plan to actually change any of them?
d1643c1 to
1e39f02
Compare
|
I removed most of the new configuration parameters (left the commit in, so they can be retrieved if we do need them). |
Signed-off-by: Bryan Boreham <bryan@weave.works>
Signed-off-by: Bryan Boreham <bryan@weave.works>
Signed-off-by: Bryan Boreham <bryan@weave.works>
Don't expect any of these will need to be configured. Signed-off-by: Bryan Boreham <bryan@weave.works>
1e39f02 to
f547d62
Compare
khaines
left a comment
There was a problem hiding this comment.
LGTM. Glad to see this getting updated!
Fixes #793
Fixes #1205
Fixes #343 because that code is removed
Fixes #899 because that code is removed
Fixes #900 because the message is now at debug level upstream
Options like
-alertmanager.mesh.peer.servicefrom the previous implementation are removed.In a Kubernetes deployment, it can be run as a StatefulSet: suppose the members of the set are named a1, a2 and a3, then all can be run as
alertmanager -peer a1 -peer a2 -peer a3.I have tested as individual Docker containers:
Then tried various
curlcommands against the API.