Conversation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
| DefaultProbeTimeout = 500 * time.Millisecond | ||
| DefaultProbeInterval = 1 * time.Second | ||
| DefaultReconnectInterval = 10 * time.Second | ||
| DefaultReconnectTimeout = 6 * time.Hour |
There was a problem hiding this comment.
what should these values be?
There was a problem hiding this comment.
Reconnection should probably be indefinite, for as long as SD returns the AM.
There was a problem hiding this comment.
Peers only exist from starting the binary (one of the --cluster.peer flags), or as an instance that connects to a running AM later. AM doesn't do any form of SD lookup to find its peers, so I think there needs to be some form of timeout since we have no way of knowing if a former peer is unreachable or has ceased to exist.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
eb955b2 to
80d831f
Compare
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
| func (p *Peer) reconnect() { | ||
| p.peerLock.RLock() | ||
| failedPeers := p.failedPeers | ||
| p.peerLock.RUnlock() |
There was a problem hiding this comment.
The reconnect test was locking up when this was doing the normal Lock(); defer Unlock(). I don't know * why *, though, and couldn't see any obvious reason.
|
Thanks, I'll have a look! |
simonpasquier
left a comment
There was a problem hiding this comment.
Looks good overall.
AFAICT there are still a few situations that aren't handled properly:
- when a peer restarts, the other peer still tries to reconnect even after the successful rejoin of the first one because the name (ULID) of the first peer has changed. One solution would be use the peer's address instead of its name as the key.
- with asymmetric configurations (eg peer A is started without
--cluster.peer, peer B with--cluster.peer=<peer A>), peer B will never try to reconnect if A is down when B starts.
cluster/cluster.go
Outdated
| @@ -143,9 +198,188 @@ func Join( | |||
| if n > 0 { | |||
| go p.warnIfAlone(l, 10*time.Second) | |||
There was a problem hiding this comment.
IMO we can get rid of this goroutine.
|
|
||
| return float64(len(p.failedPeers)) | ||
| }) | ||
| p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Instead of failed/successful reconnections, it could be failed/total reconnections. Also _total suffix for counters?
sounds good to me.
Ah, so checking the result of the initial |
If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
|
updated. feel free to suggest a better way to grab the nodes that we failed to connect to. |
cluster/cluster.go
Outdated
|
|
||
| p.reconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
| Name: "alertmanager_cluster_reconnections_total", | ||
| Help: "A counter of the number of successful cluster peer reconnections.", |
There was a problem hiding this comment.
successful to be removed.
cluster/cluster.go
Outdated
| return float64(len(p.failedPeers)) | ||
| }) | ||
| p.failedReconnectionsCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
| Name: "alertmanager_cluster_failed_reconnections_total", |
There was a problem hiding this comment.
alertmanager_cluster_reconnections_failed_total might be more idiomatic?
cluster/cluster.go
Outdated
| peers: map[string]peer{}, | ||
| } | ||
|
|
||
| if reconnectInterval != 0 { |
There was a problem hiding this comment.
It could be done at the very end of the function once it is certain that it will return without error.
cluster/cluster.go
Outdated
| if n > 0 { | ||
| go p.warnIfAlone(l, 10*time.Second) | ||
| } | ||
| p.setInitialFailed(resolvedPeers) |
There was a problem hiding this comment.
IMO it would be simpler to initialize p.failedPeers with all known peers before calling ml.Join(...).
There was a problem hiding this comment.
that's WAAAY better :)
| pr, ok := p.peers[n.Address()] | ||
| if !ok { | ||
| // Why are we receiving an update from a node that never | ||
| // joined? |
There was a problem hiding this comment.
Could be that the process has restarted and receives an out-of-bound notification?
7657bc7 to
36d80ab
Compare
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
|
@simonpasquier I think I've addressed all your comments, let me know what you think |
|
Going to merge this so we can get another rc out |
|
@stuartnelson3 Let me know if you want me to follow up with #1363 and #1364. |
|
as soon as I feel confident about #1389 let's push this out. That's the only thing (in my mind) blocking 0.15.0 There were issues with message queueing (we were generating more messages than could be gossiped) that I need to resolve; I'll be working on it tomorrow. EDIT: Thanks for your patience, I know this has been a realllllllllly long release cycle |
add reconnection support for dead peers
todo:
DefaultReconnectIntervalandDefaultReconnectTimeoutedit:
I also included the
logWriter{}wrapper I was using to expose memberlist logging. It's very verbose, and doesn't really conform to how we've been logging, so I'm not sure how best to expose it (or if I should just remove it from this PR).