-
Notifications
You must be signed in to change notification settings - Fork 655
More logging for raft/processInternalRaftRequest #2389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2389 +/- ##
==========================================
- Coverage 60.37% 60.36% -0.02%
==========================================
Files 128 128
Lines 26260 26275 +15
==========================================
+ Hits 15855 15860 +5
- Misses 9010 9019 +9
- Partials 1395 1396 +1 |
aaronlehmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is swarmkit#9393? Is there a problem you're trying to debug?
manager/state/raft/raft.go
Outdated
| case x, ok := <-ch: | ||
| if !ok { | ||
| return nil, ErrLostLeadership | ||
| if err, ok := x.(error); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.
Thanks for pointing this out!
I guess I need to read the code again. The reason I'm doing this is because I think assuming ErrLostLeadership here may not accurate (please correct me I'm wrong). So, I'm trying to see if a more meaningful error can be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is accurate. That's the only thing that causes proposals to get canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in investigating a case where proposal fails with ErrLostLeadership but the leader does not actually lose leadership. Unfortunately, its not reproducible any more.
Updated description. |
manager/state/raft/raft.go
Outdated
| // cancelAll, or by its own check of signalledLeadership. | ||
| n.wait.cancelAll() | ||
| } else if !wasLeader && rd.SoftState.RaftState == raft.StateLeader { | ||
| log.G(ctx).Infof("Manager is now a leader.", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message doesn't contain the format specifier in the message.
manager/state/raft/raft.go
Outdated
| // Wait notification channel was closed. This should only happen if the wait was cancelled. | ||
| log.G(ctx).Errorf("Wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anshulpundir based on our discussion, let's update these message to remove "wait cancelled". This will allow us to distinguish this case from the one below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to lowercase. The log above (line 1713) can help us differentiate this from the case below.
| // ensures that if a new request is registered during | ||
| // this transition, it will either be cancelled by | ||
| // cancelAll, or by its own check of signalledLeadership. | ||
| n.wait.cancelAll() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anshulpundir do we need to put a log message just before this call to cancelAll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See log on line 615. Lemme know if you think we need another log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, we don't. Please ignore this comment.
|
Please sign your commits following these rules: $ git clone -b "log" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354263464
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -fAmending updates the existing PR. You DO NOT need to open a new one. |
manager/state/raft/raft.go
Outdated
| // Wait notification channel was closed. This should only happen if the wait was cancelled. | ||
| log.G(ctx).Errorf("wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to pontificate on this too much, but can we make this message textually different from the one below, instead of just using lower and upper case as the difference? By convention, all messages are lower case.
cc @aaronlehmann if you have ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sadness of not including file names/line numbers in log messages :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if your concern is to differentiate the two cases, it is still possible because there's another log before this one. @nishanttotla
| // ensures that if a new request is registered during | ||
| // this transition, it will either be cancelled by | ||
| // cancelAll, or by its own check of signalledLeadership. | ||
| n.wait.cancelAll() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See log on line 615. Lemme know if you think we need another log.
manager/state/raft/raft.go
Outdated
| if rd.SoftState != nil { | ||
| if wasLeader && rd.SoftState.RaftState != raft.StateLeader { | ||
| wasLeader = false | ||
| log.G(ctx).Infof("soft state changed for node %x. Manager no longer a leader. Cancelling all waits.", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // position and cancelling the transaction. This entry still needs | ||
| // to be commited since other nodes have already created a new | ||
| // transaction to commit the data. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this since the only way we can get here is if the wait item was removed by calling cancelAll(). Please let me know if you think otherwise @aaronlehmann thx!
nishanttotla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the To the question of whether We only care about instances outside There is one call to There are some calls to So I really think that the logic in |
|
@aaronlehmann on a related note, in |
I'm pretty sure its not the path you just pointed out, because the error code is different.
Agreed
I also think the logic is correct, there are maybe some redundancies .e.g. the call to
I'll try to address the redundancy. The reason I put the bit about the channel being closed was to differentiate that select case (since there's no file names/line numbers in the logs, which is a pain btw). Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging. |
It's doing that to remove the wait entry, now that the function will no longer be waiting for the entry. It's probably not necessary in the
That's true for debug-level logs, not for other log levels though. |
I'll see if we can run with debug on in test runs and change ones that expose internal details to debug level. |
ddbda82 to
1602550
Compare
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
| // If we can read from the channel, wait item was triggered. Otherwise it was cancelled. | ||
| x, ok := <-ch | ||
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likelyly is a misspelling.
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message may appear at shutdown, because that's when the context gets cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will adjust the comment.
On a related node, we don't wait for all transaction for complete during shutdown ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, transactions can take an arbitrarily long time to reach consensus.
| } | ||
|
|
||
| if !n.wait.trigger(r.ID, r) { | ||
| log.G(ctx).Errorf("wait not found for raft id %x", r.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"proposal id"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I'll fix this.
anshulpundir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for reviewing! @aaronlehmann
| } | ||
|
|
||
| if !n.wait.trigger(r.ID, r) { | ||
| log.G(ctx).Errorf("wait not found for raft id %x", r.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I'll fix this.
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will adjust the comment.
On a related node, we don't wait for all transaction for complete during shutdown ?
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Adding a new manager node or promoting a worker to a manager fails with "XXX: node lost leader status". After the failure, the leadership does not actually change. Here, my hypothesis is that the reason for raft proposals to fail may not just be ErrLostLeadership.