election: make sure old elec will resign#1214
Conversation
|
/run-all-tests MySQL for unit test doesn't up |
| e.l.Info("will try resign leader") | ||
| timeoutCtx, cancel := context.WithTimeout(context.Background(), time.Duration(e.sessionTTL)*time.Second) | ||
| if err := elec.Resign(timeoutCtx); err != nil { | ||
| e.l.Warn("fail to resign leader", zap.Stringer("current member", e.info), zap.Error(err)) | ||
| } else { | ||
| e.l.Info("finish resign leader") |
There was a problem hiding this comment.
How about use debug, otherwise master will log it every TTL
| campaignWg.Wait() | ||
| cancel2() |
There was a problem hiding this comment.
Can we move it in front of L309 now?
There was a problem hiding this comment.
I guess can't. For multi-leader election, non-leader's elec.Campaign will block thus can't campaignWg.Done(). If we move campaignWg.Wait() before above two if, behaviour will change to "failed campaign won't restart a new one". Not sure this is expected
|
So the issue is because ctx done, then resign fail to remove the key? |
I'm not sure what happened in origin issue, only found there's an orphan election key and |
|
Should we prepare to merge this PR before 2.0 GA, or wait better locating problem to prevent bring more bug? |
|
according to logs in https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/dm_ghpr_test/detail/dm_ghpr_test/8507/pipeline/77/ I found there should be a txn deletes the old election KV before but for the buggy master that can't resign, there's no deleteOp |
|
can't reproduce 🤔 |
What problem does this PR solve?
close #1212
What is changed and how it works?
if old election didn't resign, new election created with same session and key-prefix will always read that orphan "election key" by
Observe, and can't clean it usingResign, unless TTL timeout (60s currently)Check List
Tests
Code changes
Side effects
Related changes
fix a bug that sometimes can't evict leader of DM-master for a short time