memberlist reconnect by stuartnelson3 · Pull Request #1384 · prometheus/alertmanager

stuartnelson3 · 2018-05-15T15:55:16Z

add reconnection support for dead peers

todo:

tests (only verified manually so far)
metrics
cleanup extra unused bits (most of this code is taken from https://github.com/hashicorp/serf/blob/80ab48778d/serf/serf.go)
decide on "smart" defaults for DefaultReconnectInterval and DefaultReconnectTimeout

edit:

I also included the logWriter{} wrapper I was using to expose memberlist logging. It's very verbose, and doesn't really conform to how we've been logging, so I'm not sure how best to expose it (or if I should just remove it from this PR).

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 · 2018-05-15T19:02:27Z

cluster/cluster.go

+	DefaultProbeTimeout      = 500 * time.Millisecond
+	DefaultProbeInterval     = 1 * time.Second
+	DefaultReconnectInterval = 10 * time.Second
+	DefaultReconnectTimeout  = 6 * time.Hour


what should these values be?

Reconnection should probably be indefinite, for as long as SD returns the AM.

Peers only exist from starting the binary (one of the --cluster.peer flags), or as an instance that connects to a running AM later. AM doesn't do any form of SD lookup to find its peers, so I think there needs to be some form of timeout since we have no way of knowing if a former peer is unreachable or has ceased to exist.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 · 2018-05-16T12:57:32Z

cluster/cluster.go

+func (p *Peer) reconnect() {
+	p.peerLock.RLock()
+	failedPeers := p.failedPeers
+	p.peerLock.RUnlock()


The reconnect test was locking up when this was doing the normal Lock(); defer Unlock(). I don't know * why *, though, and couldn't see any obvious reason.

stuartnelson3 · 2018-05-16T13:08:29Z

@simonpasquier

simonpasquier · 2018-05-16T14:04:11Z

Thanks, I'll have a look!

simonpasquier

Looks good overall.

AFAICT there are still a few situations that aren't handled properly:

when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.
with asymmetric configurations (eg peer A is started without --cluster.peer, peer B with --cluster.peer=<peer A>), peer B will never try to reconnect if A is down when B starts.

simonpasquier · 2018-05-16T14:09:47Z

cluster/cluster.go

@@ -143,9 +198,188 @@ func Join(
 	if n > 0 {
 		go p.warnIfAlone(l, 10*time.Second)


IMO we can get rid of this goroutine.

simonpasquier · 2018-05-16T14:15:00Z

cluster/cluster.go

+
+		return float64(len(p.failedPeers))
+	})
+	p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{


Instead of failed/successful reconnections, it could be failed/total reconnections. Also _total suffix for counters?

stuartnelson3 · 2018-05-17T15:28:28Z

when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.

sounds good to me.

with asymmetric configurations (eg peer A is started without --cluster.peer, peer B with --cluster.peer=), peer B will never try to reconnect if A is down when B starts.

Ah, so checking the result of the initial memberlist.Join and adding any non-connected nodes to the failedPeers list. Makes sense.

If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 · 2018-05-17T16:19:25Z

updated. feel free to suggest a better way to grab the nodes that we failed to connect to.

simonpasquier · 2018-05-18T15:27:08Z

cluster/cluster.go

+
+	p.reconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Name: "alertmanager_cluster_reconnections_total",
+		Help: "A counter of the number of successful cluster peer reconnections.",


successful to be removed.

simonpasquier · 2018-05-18T15:28:20Z

cluster/cluster.go

+		return float64(len(p.failedPeers))
+	})
+	p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Name: "alertmanager_cluster_failed_reconnections_total",


alertmanager_cluster_reconnections_failed_total might be more idiomatic?

simonpasquier · 2018-05-18T15:37:29Z

cluster/cluster.go

+		peers:  map[string]peer{},
+	}
+
+	if reconnectInterval != 0 {


It could be done at the very end of the function once it is certain that it will return without error.

simonpasquier · 2018-05-18T15:48:22Z

cluster/cluster.go

-	if n > 0 {
-		go p.warnIfAlone(l, 10*time.Second)
-	}
+	p.setInitialFailed(resolvedPeers)


IMO it would be simpler to initialize p.failedPeers with all known peers before calling ml.Join(...).

that's WAAAY better :)

simonpasquier · 2018-05-18T15:50:17Z

cluster/cluster.go

+	pr, ok := p.peers[n.Address()]
+	if !ok {
+		// Why are we receiving an update from a node that never
+		// joined?


Could be that the process has restarted and receives an out-of-bound notification?

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 · 2018-06-01T08:34:50Z

@simonpasquier I think I've addressed all your comments, let me know what you think

stuartnelson3 · 2018-06-05T12:28:37Z

Going to merge this so we can get another rc out

mxinden · 2018-06-05T12:40:18Z

@stuartnelson3 Let me know if you want me to follow up with #1363 and #1364.

stuartnelson3 · 2018-06-05T15:20:10Z

as soon as I feel confident about #1389 let's push this out. That's the only thing (in my mind) blocking 0.15.0

There were issues with message queueing (we were generating more messages than could be gossiped) that I need to resolve; I'll be working on it tomorrow.

EDIT: Thanks for your patience, I know this has been a realllllllllly long release cycle

stuartnelson3 added 5 commits May 15, 2018 17:43

initial impl

e143cc2

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Add reconnectTimeout

f8c2574

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Fix locking

17f5851

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Remove unused PeerStatuses

521d274

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Add metrics

e5dbfb8

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 commented May 15, 2018

View reviewed changes

Actually use peerJoinCounter

80d831f

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 force-pushed the stn/memberlist-reconnect branch from eb955b2 to 80d831f Compare May 15, 2018 20:35

stuartnelson3 added 3 commits May 16, 2018 10:06

Cleanup peers map on peer timeout

a336f12

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Add reconnect test

319ea04

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

test removing failed peers

102dd0a

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 changed the title ~~[wip] memberlist reconnect~~ memberlist reconnect May 16, 2018

stuartnelson3 commented May 16, 2018

View reviewed changes

stuartnelson3 requested review from brancz, fabxc and grobie May 16, 2018 13:07

simonpasquier mentioned this pull request May 17, 2018

cluster: advertise explicitly for empty addresses #1386

Merged

simonpasquier reviewed May 17, 2018

View reviewed changes

stuartnelson3 added 4 commits May 17, 2018 17:44

Add failed peers from creation

77d86d5

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Remove warnIfAlone()

b65bead

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

Update metric names

6472126

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

simonpasquier reviewed May 18, 2018

View reviewed changes

stuartnelson3 mentioned this pull request May 30, 2018

Release Alertmanager v0.15.0 #1340

Closed

stuartnelson3 force-pushed the stn/memberlist-reconnect branch from 7657bc7 to 36d80ab Compare June 1, 2018 08:34

Address comments

36d80ab

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

stuartnelson3 merged commit db4af95 into master Jun 5, 2018

stuartnelson3 deleted the stn/memberlist-reconnect branch June 5, 2018 12:28

		@@ -143,9 +198,188 @@ func Join(
		if n > 0 {
		go p.warnIfAlone(l, 10*time.Second)

Conversation

stuartnelson3 commented May 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartnelson3 commented May 16, 2018

Uh oh!

simonpasquier commented May 16, 2018

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartnelson3 commented May 17, 2018

Uh oh!

stuartnelson3 commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartnelson3 commented Jun 1, 2018

Uh oh!

stuartnelson3 commented Jun 5, 2018

Uh oh!

mxinden commented Jun 5, 2018

Uh oh!

stuartnelson3 commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stuartnelson3 commented May 15, 2018 •

edited

Loading

stuartnelson3 commented Jun 5, 2018 •

edited

Loading