Simplify PD node status by baurine · Pull Request #474 · pingcap/tiup

baurine · 2020-06-09T10:11:59Z

What problem does this PR solve?

Resolve a part of #472

Currently, the "Unhealthy" PD node is recognized as "Down" status, and the "Unhealthy" status never displays.

What is changed and how it works?

Differ the "Down" and "Unhealthy" status for PD node, when all nodes don't work, recognize them as "Down", else recognize the not work node as "Unhealthy".

Check List

Tests

Manual test (add detailed scripts or steps below)

Deploy a cluster with 3 PD nodes, pause 1 node.

Before:

After:

Pause 2 PD nodes:

baurine · 2020-06-09T10:14:36Z

@lucklove @HunDunDM please help take a look as well, thanks!

codecov-commenter · 2020-06-09T10:28:04Z

Codecov Report

Merging #474 into master will decrease coverage by 0.88%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #474      +/-   ##
==========================================
- Coverage   39.84%   38.96%   -0.89%     
==========================================
  Files         201      205       +4     
  Lines       14847    15386     +539     
==========================================
+ Hits         5916     5995      +79     
- Misses       8072     8516     +444     
- Partials      859      875      +16

Flag	Coverage Δ
#coverage	`38.96% <ø> (-0.89%)`	⬇️

Impacted Files	Coverage Δ
...m/pingcap/tiup/components/playground/playground.go	`19.67% <0.00%> (-8.94%)`	⬇️
go/src/github.com/pingcap/tiup/cmd/mirror.go	`41.56% <0.00%> (-5.27%)`	⬇️
...c/github.com/pingcap/tiup/pkg/utils/http_client.go	`71.42% <0.00%> (-4.77%)`	⬇️
go/src/github.com/pingcap/tiup/cmd/root.go	`60.43% <0.00%> (-3.85%)`	⬇️
...ingcap/tiup/components/playground/instance/tidb.go	`86.95% <0.00%> (-3.75%)`	⬇️
...ingcap/tiup/components/playground/instance/tikv.go	`77.04% <0.00%> (-2.27%)`	⬇️
...om/pingcap/tiup/components/cluster/command/root.go	`43.03% <0.00%> (-2.24%)`	⬇️
...cap/tiup/components/playground/instance/tiflash.go	`62.79% <0.00%> (-0.99%)`	⬇️
...ingcap/tiup/components/cluster/command/scale_in.go	`10.34% <0.00%> (-0.77%)`	⬇️
...om/pingcap/tiup/components/cluster/command/stop.go	`21.21% <0.00%> (-0.67%)`	⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cd64e4...3e9d3c2. Read the comment docs.

breezewish · 2020-06-09T15:29:08Z

From the user perspective do we have differences about unhealthy and down? @HunDunDM PTAL

AstroProfundis · 2020-06-09T18:24:07Z


-	healths, err := curPdAPI.GetHealth()
+	// "Down" means all PD nodes don't work,
+	// while "Unhealthy" means current PD node doesn't work


I'm confused by this, I thought "Down" means the specific PD node is not working, and "Unhealthy" means the PD cluster is not in proper state.

But according to the old logic, they will be all recognized as "Down" status, we never can go to the return "Unhealthy" + suffix line logic.

disksing · 2020-06-11T07:47:59Z

I don't think we have made a clear distinction between these two different terms.
IMO it is a good idea to distinguish between two states: single node down / majority nodes down. However, the state names would be better if they are more self-explanatory.

baurine · 2020-06-11T09:42:20Z

In fact, my key point is that, according to the code, we use the "Healthy" status to represent the work PD node, so it is natural to think that we should use "Unhealthy" to represent the not work PD node. It is true the "Unhealthy" is used in the code, but unfortunately, its logic never reach, it is always recognized as "Down".

(I guess the "Healthy" and "Unhealthy" come from the /pd/health API response. In the response we use health: true/false to represent whether a node is working.)

So, the PD node has 3 main kinds of status: "Healthy", "Unhealthy" and "Down", but only "Healthy" and "Down" happen.

Actually, for other components (TiDB / TiKV / TiFlash), we use "Up" to represent work node, and use "Down" to represent not work node.

So it seems here exists some inconsistency. Now I don't think this PR is a good solution. Or maybe we can use "Up" to replace "Healthy", remove the unreached "Unhealthy", and keep "Down" for PD node to keep consistency with other components, how about it?

disksing · 2020-06-11T10:08:18Z

Sounds reasonable to me. Any thought? /cc @nolouch @HunDunDM

AstroProfundis · 2020-06-11T10:22:40Z

It was because there is a state in PD's API response, it was never clearly documented what those states represent. It's ok to just remove the health/unhealthy term in response body and use up/down for only check response http code.

But I would wonder: what on earth does the health state in PD's response mean?

AstroProfundis · 2020-06-11T10:25:01Z

What state would a PD node report, if another node in the cluster is down, while itself is still working, health = true or false ?

HunDunDM · 2020-06-11T12:01:19Z

We determine that the health of an instance is based on that the http server can respond normally.

HunDunDM · 2020-06-11T12:03:31Z

In fact, my key point is that, according to the code, we use the "Healthy" status to represent the work PD node, so it is natural to think that we should use "Unhealthy" to represent the not work PD node. It is true the "Unhealthy" is used in the code, but unfortunately, its logic never reach, it is always recognized as "Down".

(I guess the "Healthy" and "Unhealthy" come from the /pd/health API response. In the response we use health: true/false to represent whether a node is working.)

So, the PD node has 3 main kinds of status: "Healthy", "Unhealthy" and "Down", but only "Healthy" and "Down" happen.

Actually, for other components (TiDB / TiKV / TiFlash), we use "Up" to represent work node, and use "Down" to represent not work node.

So it seems here exists some inconsistency. Now I don't think this PR is a good solution. Or maybe we can use "Up" to replace "Healthy", remove the unreached "Unhealthy", and keep "Down" for PD node to keep consistency with other components, how about it?

Sounds reasonable to me too.

AstroProfundis · 2020-06-11T12:11:16Z

We determine that the health of an instance is based on that the http server can respond normally.

Thanks for the clarification, my curiosity is that when would a PD node be responding to API calls and return an object with health: false; and what would other PD node report in this filed, when a node is down in the cluster.

HunDunDM · 2020-06-11T13:21:00Z

When you request /health for any pd:

If it is a follower node, this node will internally proxy the request to the leader node.
If it is a leader node, this node will try to request /ping of all pds (including itself), and mark the node as healthy if it can respond normally. Then return the result.

AstroProfundis · 2020-06-12T03:05:44Z

* If it is a leader node, this node will try to request `/ping` of all pds (including itself), and mark the node as healthy if it can respond normally. Then return the result.

What if a follower is down, will the leader respond unhealthy state? And if I understand it correctly, in this circumstance, all alive nodes will report the same unhealthy state?

HunDunDM · 2020-06-12T03:10:38Z

The leader will only mark the node as unhealthy.

all alive nodes will report the same unhealthy state

yes.

baurine · 2020-06-12T05:10:51Z

@AstroProfundis , this is an example response for /pd/health API which requests from http://10.0.1.11:2379 or http://10.0.1.14:2379:

[
  {
    "name": "pd-10.0.1.14-2379",
    "member_id": 4436600309396281139,
    "client_urls": [
      "http://10.0.1.14:2379"
    ],
    "health": true
  },
  {
    "name": "pd-10.0.1.11-2379",
    "member_id": 16310129336451408704,
    "client_urls": [
      "http://10.0.1.11:2379"
    ],
    "health": true
  },
  {
    "name": "pd-10.0.1.15-2379",
    "member_id": 16570216285418057773,
    "client_urls": [
      "http://10.0.1.15:2379"
    ],
    "health": false
  }
]

AstroProfundis · 2020-06-12T08:25:49Z

OK I understand. So there are 2 possible solutions:

Keep the current healthy/unhealthy state in displaying, but request that from any alive PD instead of only the node itself (we have similar approach when finding PD leader in the code), if the node with problem is not found in the first success response, it's marked as down
Request the PD API of the specific node, ignore the state of PD node in response, use an up/down state based on the http response code of /pd/health, not parsing the content at all

I'd prefer 2.

baurine · 2020-06-12T08:57:12Z

Prefer 2 as well.

This reverts commit 32bb5b0.

baurine · 2020-06-12T12:17:16Z

Hi @AstroProfundis , I have updated it, PTAL, thanks!

btw, I used the /pd/ping API replace /pd/health to check the single node health status according to @HunDunDM recommendation.

besides, I found there are some code in /components/dms/command/display.go similar to the code in /components/cluster/command/display.go, so I changed it as well.

lonng · 2020-06-13T11:53:56Z

ping @AstroProfundis

AstroProfundis

LGTM

differ the Down and Unhealty status for PD node

32bb5b0

AstroProfundis reviewed Jun 9, 2020

View reviewed changes

baurine added 5 commits June 12, 2020 19:15

Revert "differ the Down and Unhealty status for PD node"

2b8cd5e

This reverts commit 32bb5b0.

make PD node status consistency with other components

f6a5068

Merge branch 'master' into refine-pd-status

410e1e6

update display output

a276ba1

fix CI

3486880

refine

cb6a42f

baurine changed the title ~~differ the Down and Unhealty status for PD node~~ Simplify PD node status Jun 12, 2020

AstroProfundis approved these changes Jun 15, 2020

View reviewed changes

Merge branch 'master' into refine-pd-status

3e9d3c2

AstroProfundis merged commit 8f3fbc0 into pingcap:master Jun 15, 2020

baurine deleted the refine-pd-status branch June 15, 2020 07:54

baurine mentioned this pull request Jun 16, 2020

tiup: update pd status description pingcap/docs-cn#3685

Merged

5 tasks

ti-srebot mentioned this pull request Jun 22, 2020

tiup: update pd status description (#3685) pingcap/docs-cn#3759

Merged

5 tasks

Conversation

baurine commented Jun 9, 2020

What problem does this PR solve?

What is changed and how it works?

Check List

Uh oh!

baurine commented Jun 9, 2020

Uh oh!

codecov-commenter commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

breezewish commented Jun 9, 2020

Uh oh!

AstroProfundis Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

baurine Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

disksing commented Jun 11, 2020

Uh oh!

baurine commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

disksing commented Jun 11, 2020

Uh oh!

AstroProfundis commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AstroProfundis commented Jun 11, 2020

Uh oh!

HunDunDM commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HunDunDM commented Jun 11, 2020

Uh oh!

AstroProfundis commented Jun 11, 2020

Uh oh!

HunDunDM commented Jun 11, 2020

Uh oh!

AstroProfundis commented Jun 12, 2020

Uh oh!

HunDunDM commented Jun 12, 2020

Uh oh!

baurine commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AstroProfundis commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baurine commented Jun 12, 2020

Uh oh!

baurine commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lonng commented Jun 13, 2020

Uh oh!

AstroProfundis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-commenter commented Jun 9, 2020 •

edited

Loading

baurine commented Jun 11, 2020 •

edited

Loading

AstroProfundis commented Jun 11, 2020 •

edited

Loading

HunDunDM commented Jun 11, 2020 •

edited

Loading

baurine commented Jun 12, 2020 •

edited

Loading

AstroProfundis commented Jun 12, 2020 •

edited

Loading

baurine commented Jun 12, 2020 •

edited

Loading