Increase raft notify buffer.#6863
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6863 +/- ##
==========================================
+ Coverage 65.77% 65.81% +0.04%
==========================================
Files 435 435
Lines 52405 52405
==========================================
+ Hits 34470 34492 +22
+ Misses 13798 13779 -19
+ Partials 4137 4134 -3
Continue to review full report at Codecov.
|
|
|
||
| // Set up a channel for reliable leader notifications. | ||
| raftNotifyCh := make(chan bool, 1) | ||
| raftNotifyCh := make(chan bool, 1000) |
There was a problem hiding this comment.
🤔 I wonder how to reason about how much buffering "is enough". This is certainly better than just 1 element but what is the bound to avoid deadlock entirely? I.e. if you have 1000 raft transactions per second and very fast leader flap can it still deadlock?
There was a problem hiding this comment.
Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.
So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.
That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.
So how about making this 10 for now and updating the comment with a link to this PR?
There was a problem hiding this comment.
Yes I think it still can. Because the buffer could still fill up and block. I think we could merge this PR and then add another one for aggressively reading that chan. Or I can add it here.
banks
left a comment
There was a problem hiding this comment.
Sorry I proposed this last year and never submitted :(
|
|
||
| // Set up a channel for reliable leader notifications. | ||
| raftNotifyCh := make(chan bool, 1) | ||
| raftNotifyCh := make(chan bool, 1000) |
There was a problem hiding this comment.
Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.
So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.
That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.
So how about making this 10 for now and updating the comment with a link to this PR?
|
Sounds good @banks! I will make the changes and will also head over to the stats issues and propose a solution. |
Fixes #6852.
Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.