server: use leader lease to determine tso service validity#1676
Merged
Conversation
Signed-off-by: disksing <i@disksing.com>
Signed-off-by: disksing <i@disksing.com>
Codecov Report
@@ Coverage Diff @@
## master #1676 +/- ##
==========================================
- Coverage 76.3% 76.26% -0.04%
==========================================
Files 157 158 +1
Lines 15347 15402 +55
==========================================
+ Hits 11710 11747 +37
- Misses 2644 2654 +10
- Partials 993 1001 +8
Continue to review full report at Codecov.
|
shafreeck
reviewed
Aug 15, 2019
rleungx
reviewed
Aug 15, 2019
| if cost := time.Since(start); cost > slowRequestTime { | ||
| log.Warn("lease grants too slow", zap.Duration("cost", cost)) | ||
| } | ||
| if err != nil { |
Member
There was a problem hiding this comment.
How about returning this error before line 51?
| func (l *LeaderLease) KeepAlive(ctx context.Context) { | ||
| ctx, cancel := context.WithCancel(ctx) | ||
| defer cancel() | ||
| timeCh := l.keepAliveWorker(ctx, l.leaseTimeout/3) |
Contributor
Author
There was a problem hiding this comment.
Arbitrary value borrowed from etcd's keep alive code.
nolouch
reviewed
Aug 15, 2019
|
|
||
| // Close releases the lease. | ||
| func (l *LeaderLease) Close() error { | ||
| return l.lease.Close() |
Contributor
There was a problem hiding this comment.
May we need Revoke the lease before close? actually Close do not try to release the lease.
Signed-off-by: disksing <i@disksing.com>
Signed-off-by: disksing <i@disksing.com>
Signed-off-by: disksing <i@disksing.com>
Contributor
Author
|
/test |
Contributor
Author
rleungx
approved these changes
Aug 19, 2019
Contributor
Author
|
/build |
Contributor
Author
|
Created an issue for the test failure: #1693 |
Luffbee
pushed a commit
that referenced
this pull request
Aug 27, 2019
* *: unify get store function everywhere (#1671) Signed-off-by: Ryan Leung <rleungx@gmail.com> * server: use leader lease to determine tso service validity (#1676) Signed-off-by: disksing <i@disksing.com> * test: fix tests (#1696) * test: fix region syncer test Signed-off-by: disksing <i@disksing.com> * add config-check flag for pd-server (#1695) Signed-off-by: cwen0 <cwenyin0@gmail.com> * operator: rewrite move region related functions (#1667) * *: support setting endKey for ScanRange (#1700) Signed-off-by: disksing <i@disksing.com> * *: reduce some unnecessary parameters (#1698) Signed-off-by: Ryan Leung <rleungx@gmail.com> * schedule: Do not send an operator of a region wth a stale epoch (#1659) * schedule: Do not send an operator of a region wth a stale epoch Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: check the version changed by the operator self Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix unit test Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix to avoid dispatching a stale opstep Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool" Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: fix typo in comment Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * fix typo Co-Authored-By: Ryan Leung <rleungx@gmail.com> * dispatch: fix unittest Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refine format Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * server: fix the dead lock in scatter region (#1706) Signed-off-by: Ryan Leung <rleungx@gmail.com>
Luffbee
added a commit
that referenced
this pull request
Sep 9, 2019
* *: unify get store function everywhere (#1671) Signed-off-by: Ryan Leung <rleungx@gmail.com> * remove unnecessary parentheses * server: use leader lease to determine tso service validity (#1676) Signed-off-by: disksing <i@disksing.com> * change internal stat values to float64 * add pending operator influence * add metrics of pending influence * fix metrics * fix panic * adjust pending influence of balanceHotWrite * change weight of pending influence * test: fix tests (#1696) * test: fix region syncer test Signed-off-by: disksing <i@disksing.com> * decrease region rolling window; store pending influence in scheduler * add config-check flag for pd-server (#1695) Signed-off-by: cwen0 <cwenyin0@gmail.com> * decrease possiblility transfer hot write leader * change pending influence weight * add unstarted op metrics * add logs for debug * add log for debug * add logs for debug * add logs for debug * add logs for debug * add logs for debug * add logs for debug * add logs for debug * Revert "add logs for debug" This reverts commit e74c7a9. * add metrics for hotspot operators * operator: rewrite move region related functions (#1667) * add metrics for pending operators * *: support setting endKey for ScanRange (#1700) Signed-off-by: disksing <i@disksing.com> * fix bug * fix bug * fix bug * fix metrics thread-safe bug * fix logic bug * *: reduce some unnecessary parameters (#1698) Signed-off-by: Ryan Leung <rleungx@gmail.com> * schedule: Do not send an operator of a region wth a stale epoch (#1659) * schedule: Do not send an operator of a region wth a stale epoch Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: check the version changed by the operator self Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix unit test Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix to avoid dispatching a stale opstep Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool" Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: fix typo in comment Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * fix typo Co-Authored-By: Ryan Leung <rleungx@gmail.com> * dispatch: fix unittest Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refine format Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * server: fix the dead lock in scatter region (#1706) Signed-off-by: Ryan Leung <rleungx@gmail.com> * add drop time for operator * use IsDropped to recognize canceled ops * try to fix trans leader burst * try to fix trans leader burst * add zombie influence * change select src dst strategy; improve op_controller * change select src strategy * fix bug * tools: fix set namespace in pd-ctl (#1701) Signed-off-by: Ryan Leung <rleungx@gmail.com> * tools: fix parse url without http prefix (#1703) Signed-off-by: Ryan Leung <rleungx@gmail.com> * tests: support deadlock detection in make test (#1704) Signed-off-by: Ryan Leung <rleungx@gmail.com> * Makefile: fix failpoint enable (#1722) Signed-off-by: nolouch <nolouch@gmail.com> * checker: fix the issue that a region does not merge to the sibling with smaller size (#1723) Signed-off-by: disksing <i@disksing.com> * tools: balance region simulator (#1708) * scheduler: do not remove the operator when the step does not finish (#1715) Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * operator: fix the AddLearner config version judgment (#1732) Signed-off-by: nolouch <nolouch@gmail.com> * tools: fix TLS in pd control (#1729) Signed-off-by: Ryan Leung <rleungx@gmail.com> * syncer: support TLS for region syncer (#1728) Signed-off-by: Ryan Leung <rleungx@gmail.com> * schedule: fix a thread-safe bug and improve code (#1719)
Contributor
|
@disksing should we cherry pick to 3.0,3.1? |
Luffbee
added a commit
that referenced
this pull request
Sep 11, 2019
* *: unify get store function everywhere (#1671) Signed-off-by: Ryan Leung <rleungx@gmail.com> * server: use leader lease to determine tso service validity (#1676) Signed-off-by: disksing <i@disksing.com> * test: fix tests (#1696) * test: fix region syncer test Signed-off-by: disksing <i@disksing.com> * add config-check flag for pd-server (#1695) Signed-off-by: cwen0 <cwenyin0@gmail.com> * operator: rewrite move region related functions (#1667) * *: support setting endKey for ScanRange (#1700) Signed-off-by: disksing <i@disksing.com> * *: reduce some unnecessary parameters (#1698) Signed-off-by: Ryan Leung <rleungx@gmail.com> * schedule: Do not send an operator of a region wth a stale epoch (#1659) * schedule: Do not send an operator of a region wth a stale epoch Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: check the version changed by the operator self Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix unit test Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * schedule: fix to avoid dispatching a stale opstep Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool" Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: fix typo in comment Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * fix typo Co-Authored-By: Ryan Leung <rleungx@gmail.com> * dispatch: fix unittest Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * dispatch: refine format Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * server: fix the dead lock in scatter region (#1706) Signed-off-by: Ryan Leung <rleungx@gmail.com> * tools: fix set namespace in pd-ctl (#1701) Signed-off-by: Ryan Leung <rleungx@gmail.com> * tools: fix parse url without http prefix (#1703) Signed-off-by: Ryan Leung <rleungx@gmail.com> * tests: support deadlock detection in make test (#1704) Signed-off-by: Ryan Leung <rleungx@gmail.com> * Makefile: fix failpoint enable (#1722) Signed-off-by: nolouch <nolouch@gmail.com> * checker: fix the issue that a region does not merge to the sibling with smaller size (#1723) Signed-off-by: disksing <i@disksing.com> * tools: balance region simulator (#1708) * scheduler: do not remove the operator when the step does not finish (#1715) Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> * operator: fix the AddLearner config version judgment (#1732) Signed-off-by: nolouch <nolouch@gmail.com> * tools: fix TLS in pd control (#1729) Signed-off-by: Ryan Leung <rleungx@gmail.com> * syncer: support TLS for region syncer (#1728) Signed-off-by: Ryan Leung <rleungx@gmail.com> * schedule: fix a thread-safe bug and improve code (#1719) * statistics: fix region flow calculation (#1688) Signed-off-by: jiyingtk <jiyingtk@mail.ustc.edu.cn> * makefile: improve deadlock-enable/disable (#1736) * api: fix missing keys statistic in region information (#1741) Signed-off-by: nolouch <nolouch@gmail.com> * *: update go version to 1.13 (#1742) Signed-off-by: disksing <i@disksing.com> * coordinator: add the operator cost time in log field (#1748) Signed-off-by: nolouch <nolouch@gmail.com>
Contributor
Author
|
@nolouch I think no. |
nolouch
added a commit
that referenced
this pull request
Feb 14, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
STATUS: It seems to be better merged after #1668. You can review it now and give comments.
What problem does this PR solve?
Currently, the safety of timestamp allocation relies on the leader flag be updated in time when lease timeout.
There is a small possibility (for instance, process paused, runtime schedule slow) that the watch channel does not notify in time, which may cause 2 PD servers serve timestamp at the same time. Then the transaction may become corrupted.
Another issue is when a PD server becomes leader, it won't serve timestamps until the leader flag is set. But the leader flag is set after loading all regions from storage, which may take considerably long time for a large cluster.
This PR fixes #1661 and part of #1658
What is changed and how it works?
Introduce
LeaderLeaseto periodically renew lease and update expire time.Timestamp service only relies on the lease expire time.
Note that if 2
time.Times both have monotonic part, comparison of them will be evaluated by monotonic time, which is not affected by wall clock changes.Check List
Tests
Side effects
Related changes