Skip to content

Conversation

@anshulpundir
Copy link
Contributor

@anshulpundir anshulpundir commented Sep 26, 2017

Adding a new manager node or promoting a worker to a manager fails with "XXX: node lost leader status". After the failure, the leadership does not actually change. Here, my hypothesis is that the reason for raft proposals to fail may not just be ErrLostLeadership.

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

Merging #2389 into master will decrease coverage by 0.01%.
The diff coverage is 70%.

@@            Coverage Diff             @@
##           master    #2389      +/-   ##
==========================================
- Coverage   60.37%   60.36%   -0.02%     
==========================================
  Files         128      128              
  Lines       26260    26275      +15     
==========================================
+ Hits        15855    15860       +5     
- Misses       9010     9019       +9     
- Partials     1395     1396       +1

Copy link
Collaborator

@aaronlehmann aaronlehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is swarmkit#9393? Is there a problem you're trying to debug?

case x, ok := <-ch:
if !ok {
return nil, ErrLostLeadership
if err, ok := x.(error); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.

Copy link
Contributor Author

@anshulpundir anshulpundir Sep 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.

Thanks for pointing this out!
I guess I need to read the code again. The reason I'm doing this is because I think assuming ErrLostLeadership here may not accurate (please correct me I'm wrong). So, I'm trying to see if a more meaningful error can be returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is accurate. That's the only thing that causes proposals to get canceled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in investigating a case where proposal fails with ErrLostLeadership but the leader does not actually lose leadership. Unfortunately, its not reproducible any more.

@anshulpundir anshulpundir changed the title Cleaning out processInternalRaftRequest() error logic. Added logs and errors for swarmkit#9393. Cleaning out processInternalRaftRequest() error logic. Added logs and errors for orca#9393. Sep 26, 2017
@anshulpundir
Copy link
Contributor Author

anshulpundir commented Sep 26, 2017

What is swarmkit#9393? Is there a problem you're trying to debug?

Updated description.

@moby moby deleted a comment from GordonTheTurtle Sep 26, 2017
@anshulpundir anshulpundir changed the title Cleaning out processInternalRaftRequest() error logic. Added logs and errors for orca#9393. Cleaning out processInternalRaftRequest() error logic. Added logs and errors. Sep 26, 2017
@anshulpundir anshulpundir changed the title Cleaning out processInternalRaftRequest() error logic. Added logs and errors. More logging for raft/processInternalRaftRequest Sep 26, 2017
// cancelAll, or by its own check of signalledLeadership.
n.wait.cancelAll()
} else if !wasLeader && rd.SoftState.RaftState == raft.StateLeader {
log.G(ctx).Infof("Manager is now a leader.", n.opts.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message doesn't contain the format specifier in the message.

// Wait notification channel was closed. This should only happen if the wait was cancelled.
log.G(ctx).Errorf("Wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID)
if atomic.LoadUint32(&n.signalledLeadership) == 1 {
log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID)
Copy link
Contributor

@nishanttotla nishanttotla Sep 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulpundir based on our discussion, let's update these message to remove "wait cancelled". This will allow us to distinguish this case from the one below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to lowercase. The log above (line 1713) can help us differentiate this from the case below.

// ensures that if a new request is registered during
// this transition, it will either be cancelled by
// cancelAll, or by its own check of signalledLeadership.
n.wait.cancelAll()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulpundir do we need to put a log message just before this call to cancelAll?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See log on line 615. Lemme know if you think we need another log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, we don't. Please ignore this comment.

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "log" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354263464
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

// Wait notification channel was closed. This should only happen if the wait was cancelled.
log.G(ctx).Errorf("wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID)
if atomic.LoadUint32(&n.signalledLeadership) == 1 {
log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to pontificate on this too much, but can we make this message textually different from the one below, instead of just using lower and upper case as the difference? By convention, all messages are lower case.

cc @aaronlehmann if you have ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sadness of not including file names/line numbers in log messages :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if your concern is to differentiate the two cases, it is still possible because there's another log before this one. @nishanttotla

// ensures that if a new request is registered during
// this transition, it will either be cancelled by
// cancelAll, or by its own check of signalledLeadership.
n.wait.cancelAll()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See log on line 615. Lemme know if you think we need another log.

if rd.SoftState != nil {
if wasLeader && rd.SoftState.RaftState != raft.StateLeader {
wasLeader = false
log.G(ctx).Infof("soft state changed for node %x. Manager no longer a leader. Cancelling all waits.", n.opts.ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// position and cancelling the transaction. This entry still needs
// to be commited since other nodes have already created a new
// transaction to commit the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this since the only way we can get here is if the wait item was removed by calling cancelAll(). Please let me know if you think otherwise @aaronlehmann thx!

Copy link
Contributor

@nishanttotla nishanttotla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aaronlehmann
Copy link
Collaborator

I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the ErrLostLeadership returned here: https://github.com/anshulpundir/swarmkit/blob/2780e221f84d2b9a8b10c4c8040bfd94dd276995/manager/state/raft/raft.go#L922-L924

To the question of whether ErrLostLeadership is the right one to return from processInternalRaftRequest, it's fairly straightforward to look at all the code paths where these waits are cancelled. Just search for n.wait.cancel and n.wait.cancelAll.

We only care about instances outside processInternalRaftRequest, because ones inside the function can't execute concurrently with that select statement.

There is one call to cancelAll in Run, in the situation where the node does lose leadership.

There are some calls to cancel in configure, but these are cancelling a special wait registered there, and it will never be one that processInternalRaftRequest is waiting for.

So I really think that the logic in processInternalRaftRequest is correct and we don't need this extra logging. Also, some of the new logs are redundant with leadership-related logs produced by etcd/raft, and some of them expose internal details that I don't think should go in user-facing logs (for example, stuff about a channel being closed).

@nishanttotla
Copy link
Contributor

@aaronlehmann on a related note, in processInternalRaftRequest, in the select loop for the case of waitCtx.Done(), why are we calling wait.cancel again?

@anshulpundir
Copy link
Contributor Author

anshulpundir commented Sep 28, 2017

I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the ErrLostLeadership returned here: https://github.com/anshulpundir/swarmkit/blob/2780e221f84d2b9a8b10c4c8040bfd94dd276995/manager/state/raft/raft.go#L922-L924

I'm pretty sure its not the path you just pointed out, because the error code is different.

To the question of whether ErrLostLeadership is the right one to return from processInternalRaftRequest, it's fairly straightforward to look at all the code paths where these waits are cancelled. Just search for n.wait.cancel and n.wait.cancelAll.
We only care about instances outside processInternalRaftRequest, because ones inside the function can't execute concurrently with that select statement.
There is one call to cancelAll in Run, in the situation where the node does lose leadership.
There are some calls to cancel in configure, but these are cancelling a special wait registered there, and it will never be one that processInternalRaftRequest is waiting for.

Agreed

So I really think that the logic in processInternalRaftRequest is correct and we don't need this extra logging.

I also think the logic is correct, there are maybe some redundancies .e.g. the call to
n.wait.cancel(r.ID) when the context might already be cancelled. I still suspect it is this code path that is returning ErrLostLeadership.
I don't entirely agree with not needing extra logging. I think its better to be verbose here rather than not since this code path is not commonly executed anyway, and having more logs for better debugability.

Also, some of the new logs are redundant with leadership-related logs produced by etcd/raft, and some of them expose internal details that I don't think should go in user-facing logs (for example, stuff about a channel being closed).

I'll try to address the redundancy. The reason I put the bit about the channel being closed was to differentiate that select case (since there's no file names/line numbers in the logs, which is a pain btw). Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging.

@aaronlehmann
Copy link
Collaborator

on a related note, in processInternalRaftRequest, in the select loop for the case of waitCtx.Done(), why are we calling wait.cancel again?

It's doing that to remove the wait entry, now that the function will no longer be waiting for the entry. It's probably not necessary in the waitCtx.Done case, but it doesn't hurt either, and having it there avoids making assumptions about what other code will do.

Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging.

That's true for debug-level logs, not for other log levels though.

@anshulpundir
Copy link
Contributor Author

That's true for debug-level logs, not for other log levels though.

I'll see if we can run with debug on in test runs and change ones that expose internal details to debug level.

@anshulpundir anshulpundir force-pushed the log branch 2 times, most recently from ddbda82 to 1602550 Compare September 29, 2017 01:08
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
// If we can read from the channel, wait item was triggered. Otherwise it was cancelled.
x, ok := <-ch
if !ok {
log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likelyly is a misspelling.

if !ok {
log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)
if atomic.LoadUint32(&n.signalledLeadership) == 1 {
log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message may appear at shutdown, because that's when the context gets cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will adjust the comment.

On a related node, we don't wait for all transaction for complete during shutdown ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, transactions can take an arbitrarily long time to reach consensus.

}

if !n.wait.trigger(r.ID, r) {
log.G(ctx).Errorf("wait not found for raft id %x", r.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"proposal id"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I'll fix this.

Copy link
Contributor Author

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for reviewing! @aaronlehmann

}

if !n.wait.trigger(r.ID, r) {
log.G(ctx).Errorf("wait not found for raft id %x", r.ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I'll fix this.

if !ok {
log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)
if atomic.LoadUint32(&n.signalledLeadership) == 1 {
log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will adjust the comment.

On a related node, we don't wait for all transaction for complete during shutdown ?

denverdino pushed a commit to AliyunContainerService/swarmkit that referenced this pull request Nov 5, 2017
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants