Skip to content
This repository was archived by the owner on Nov 24, 2023. It is now read-only.

election: make sure old elec will resign#1214

Closed
lance6716 wants to merge 11 commits into
pingcap:masterfrom
lance6716:explict-resign
Closed

election: make sure old elec will resign#1214
lance6716 wants to merge 11 commits into
pingcap:masterfrom
lance6716:explict-resign

Conversation

@lance6716
Copy link
Copy Markdown
Collaborator

@lance6716 lance6716 commented Oct 23, 2020

What problem does this PR solve?

close #1212

What is changed and how it works?

if old election didn't resign, new election created with same session and key-prefix will always read that orphan "election key" by Observe, and can't clean it using Resign, unless TTL timeout (60s currently)

        elec := concurrency.NewElection(session, e.key)
	go func() {
		fmt.Println(elec.Campaign(ctx, e.key))
	}()
	ch := elec.Observe(ctx)
	info := <-ch
	fmt.Println("1", info)

	elec2 := concurrency.NewElection(session, e.key)
        // observe old election key
	ch2 := elec2.Observe(ctx)
	info3 := <-ch2
	fmt.Println("3", info3)
	fmt.Println(elec2.Resign(ctx))

	elec2 = concurrency.NewElection(session, e.key)
        // observe old election key
	ch2 = elec2.Observe(ctx)
	info3 = <-ch2
	fmt.Println("4", info3)
	fmt.Println(elec2.Resign(ctx))

Check List

Tests

  • Unit test

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch
  • Need to update release note
    fix a bug that sometimes can't evict leader of DM-master for a short time

@lance6716 lance6716 added needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated status/WIP This PR is still work in progress type/bug-fix Bug fix labels Oct 23, 2020
@lance6716 lance6716 added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP This PR is still work in progress labels Oct 24, 2020
@lance6716
Copy link
Copy Markdown
Collaborator Author

/run-all-tests

MySQL for unit test doesn't up

@lance6716 lance6716 added status/WIP This PR is still work in progress and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Oct 25, 2020
@lance6716 lance6716 added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP This PR is still work in progress labels Oct 25, 2020
Comment thread pkg/election/election.go Outdated
Comment on lines +364 to +369
e.l.Info("will try resign leader")
timeoutCtx, cancel := context.WithTimeout(context.Background(), time.Duration(e.sessionTTL)*time.Second)
if err := elec.Resign(timeoutCtx); err != nil {
e.l.Warn("fail to resign leader", zap.Stringer("current member", e.info), zap.Error(err))
} else {
e.l.Info("finish resign leader")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use debug, otherwise master will log it every TTL

Comment thread pkg/election/election.go
Comment on lines +324 to +325
campaignWg.Wait()
cancel2()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move it in front of L309 now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess can't. For multi-leader election, non-leader's elec.Campaign will block thus can't campaignWg.Done(). If we move campaignWg.Wait() before above two if, behaviour will change to "failed campaign won't restart a new one". Not sure this is expected

@GMHDBJD
Copy link
Copy Markdown
Collaborator

GMHDBJD commented Oct 26, 2020

So the issue is because ctx done, then resign fail to remove the key?

@lance6716
Copy link
Copy Markdown
Collaborator Author

So the issue is because ctx done, then resign fail to remove the key?

I'm not sure what happened in origin issue, only found there's an orphan election key and Resign didn't cover all branch. Maybe there're other cases cause this issue 😢

@lance6716
Copy link
Copy Markdown
Collaborator Author

Should we prepare to merge this PR before 2.0 GA, or wait better locating problem to prevent bring more bug?

@lance6716 lance6716 added the status/DNM Do not merge, test is failing or blocked by another PR label Oct 27, 2020
@lance6716
Copy link
Copy Markdown
Collaborator Author

according to logs in https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/dm_ghpr_test/detail/dm_ghpr_test/8507/pipeline/77/

I found there should be a txn deletes the old election KV

[2021-03-22T05:28:38.305Z] [2021/03/22 13:28:08.914 +08:00] [DEBUG] [interceptor.go:181] ["request stats"] [component="embed etcd"] ["start time"=2021/03/22 13:28:08.908 +08:00] ["time spent"=5.561244ms] [remote=127.0.0.1:58236] ["response type"=/etcdserverpb.KV/Txn] ["request count"=1] ["request size"=38] ["response count"=0] ["response size"=40] ["request content"="compare:<target:CREATE key:\"/dm-master/leader/12e07858679d7603\" create_revision:7 > success:<request_delete_range:<key:\"/dm-master/leader/12e07858679d7603\" > > failure:<>"]

before fail to campaign.

but for the buggy master that can't resign, there's no deleteOp

@lance6716
Copy link
Copy Markdown
Collaborator Author

can't reproduce 🤔

@lance6716 lance6716 closed this Apr 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated status/DNM Do not merge, test is failing or blocked by another PR status/PTAL This PR is ready for review. Add this label back after committing new changes type/bug-fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

evict leader can't take effect

2 participants