Skip to content

Simplify PD node status#474

Merged
AstroProfundis merged 8 commits into
pingcap:masterfrom
baurine:refine-pd-status
Jun 15, 2020
Merged

Simplify PD node status#474
AstroProfundis merged 8 commits into
pingcap:masterfrom
baurine:refine-pd-status

Conversation

@baurine
Copy link
Copy Markdown
Contributor

@baurine baurine commented Jun 9, 2020

What problem does this PR solve?

Resolve a part of #472

Currently, the "Unhealthy" PD node is recognized as "Down" status, and the "Unhealthy" status never displays.

What is changed and how it works?

Differ the "Down" and "Unhealthy" status for PD node, when all nodes don't work, recognize them as "Down", else recognize the not work node as "Unhealthy".

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Deploy a cluster with 3 PD nodes, pause 1 node.

企业微信20200609114349

Before:

企业微信20200609114229

After:

image

Pause 2 PD nodes:

image

@baurine
Copy link
Copy Markdown
Contributor Author

baurine commented Jun 9, 2020

@lucklove @HunDunDM please help take a look as well, thanks!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 9, 2020

Codecov Report

Merging #474 into master will decrease coverage by 0.88%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #474      +/-   ##
==========================================
- Coverage   39.84%   38.96%   -0.89%     
==========================================
  Files         201      205       +4     
  Lines       14847    15386     +539     
==========================================
+ Hits         5916     5995      +79     
- Misses       8072     8516     +444     
- Partials      859      875      +16     
Flag Coverage Δ
#coverage 38.96% <ø> (-0.89%) ⬇️
Impacted Files Coverage Δ
...m/pingcap/tiup/components/playground/playground.go 19.67% <0.00%> (-8.94%) ⬇️
go/src/github.com/pingcap/tiup/cmd/mirror.go 41.56% <0.00%> (-5.27%) ⬇️
...c/github.com/pingcap/tiup/pkg/utils/http_client.go 71.42% <0.00%> (-4.77%) ⬇️
go/src/github.com/pingcap/tiup/cmd/root.go 60.43% <0.00%> (-3.85%) ⬇️
...ingcap/tiup/components/playground/instance/tidb.go 86.95% <0.00%> (-3.75%) ⬇️
...ingcap/tiup/components/playground/instance/tikv.go 77.04% <0.00%> (-2.27%) ⬇️
...om/pingcap/tiup/components/cluster/command/root.go 43.03% <0.00%> (-2.24%) ⬇️
...cap/tiup/components/playground/instance/tiflash.go 62.79% <0.00%> (-0.99%) ⬇️
...ingcap/tiup/components/cluster/command/scale_in.go 10.34% <0.00%> (-0.77%) ⬇️
...om/pingcap/tiup/components/cluster/command/stop.go 21.21% <0.00%> (-0.67%) ⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cd64e4...3e9d3c2. Read the comment docs.

@breezewish
Copy link
Copy Markdown
Member

From the user perspective do we have differences about unhealthy and down? @HunDunDM PTAL

Comment thread pkg/cluster/meta/topology.go Outdated

healths, err := curPdAPI.GetHealth()
// "Down" means all PD nodes don't work,
// while "Unhealthy" means current PD node doesn't work
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this, I thought "Down" means the specific PD node is not working, and "Unhealthy" means the PD cluster is not in proper state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But according to the old logic, they will be all recognized as "Down" status, we never can go to the return "Unhealthy" + suffix line logic.

@disksing
Copy link
Copy Markdown

I don't think we have made a clear distinction between these two different terms.
IMO it is a good idea to distinguish between two states: single node down / majority nodes down. However, the state names would be better if they are more self-explanatory.

@baurine
Copy link
Copy Markdown
Contributor Author

baurine commented Jun 11, 2020

In fact, my key point is that, according to the code, we use the "Healthy" status to represent the work PD node, so it is natural to think that we should use "Unhealthy" to represent the not work PD node. It is true the "Unhealthy" is used in the code, but unfortunately, its logic never reach, it is always recognized as "Down".

(I guess the "Healthy" and "Unhealthy" come from the /pd/health API response. In the response we use health: true/false to represent whether a node is working.)

So, the PD node has 3 main kinds of status: "Healthy", "Unhealthy" and "Down", but only "Healthy" and "Down" happen.

Actually, for other components (TiDB / TiKV / TiFlash), we use "Up" to represent work node, and use "Down" to represent not work node.

So it seems here exists some inconsistency. Now I don't think this PR is a good solution. Or maybe we can use "Up" to replace "Healthy", remove the unreached "Unhealthy", and keep "Down" for PD node to keep consistency with other components, how about it?

@disksing
Copy link
Copy Markdown

Sounds reasonable to me. Any thought? /cc @nolouch @HunDunDM

@AstroProfundis
Copy link
Copy Markdown
Contributor

AstroProfundis commented Jun 11, 2020

It was because there is a state in PD's API response, it was never clearly documented what those states represent. It's ok to just remove the health/unhealthy term in response body and use up/down for only check response http code.

But I would wonder: what on earth does the health state in PD's response mean?

@AstroProfundis
Copy link
Copy Markdown
Contributor

What state would a PD node report, if another node in the cluster is down, while itself is still working, health = true or false ?

@HunDunDM
Copy link
Copy Markdown
Contributor

HunDunDM commented Jun 11, 2020

We determine that the health of an instance is based on that the http server can respond normally.

@HunDunDM
Copy link
Copy Markdown
Contributor

In fact, my key point is that, according to the code, we use the "Healthy" status to represent the work PD node, so it is natural to think that we should use "Unhealthy" to represent the not work PD node. It is true the "Unhealthy" is used in the code, but unfortunately, its logic never reach, it is always recognized as "Down".

(I guess the "Healthy" and "Unhealthy" come from the /pd/health API response. In the response we use health: true/false to represent whether a node is working.)

So, the PD node has 3 main kinds of status: "Healthy", "Unhealthy" and "Down", but only "Healthy" and "Down" happen.

Actually, for other components (TiDB / TiKV / TiFlash), we use "Up" to represent work node, and use "Down" to represent not work node.

So it seems here exists some inconsistency. Now I don't think this PR is a good solution. Or maybe we can use "Up" to replace "Healthy", remove the unreached "Unhealthy", and keep "Down" for PD node to keep consistency with other components, how about it?

Sounds reasonable to me too.

@AstroProfundis
Copy link
Copy Markdown
Contributor

We determine that the health of an instance is based on that the http server can respond normally.

Thanks for the clarification, my curiosity is that when would a PD node be responding to API calls and return an object with health: false; and what would other PD node report in this filed, when a node is down in the cluster.

@HunDunDM
Copy link
Copy Markdown
Contributor

When you request /health for any pd:

  • If it is a follower node, this node will internally proxy the request to the leader node.
  • If it is a leader node, this node will try to request /ping of all pds (including itself), and mark the node as healthy if it can respond normally. Then return the result.

@AstroProfundis
Copy link
Copy Markdown
Contributor

* If it is a leader node, this node will try to request `/ping` of all pds (including itself), and mark the node as healthy if it can respond normally. Then return the result.

What if a follower is down, will the leader respond unhealthy state? And if I understand it correctly, in this circumstance, all alive nodes will report the same unhealthy state?

@HunDunDM
Copy link
Copy Markdown
Contributor

The leader will only mark the node as unhealthy.

all alive nodes will report the same unhealthy state

yes.

@baurine
Copy link
Copy Markdown
Contributor Author

baurine commented Jun 12, 2020

@AstroProfundis , this is an example response for /pd/health API which requests from http://10.0.1.11:2379 or http://10.0.1.14:2379:

[
  {
    "name": "pd-10.0.1.14-2379",
    "member_id": 4436600309396281139,
    "client_urls": [
      "http://10.0.1.14:2379"
    ],
    "health": true
  },
  {
    "name": "pd-10.0.1.11-2379",
    "member_id": 16310129336451408704,
    "client_urls": [
      "http://10.0.1.11:2379"
    ],
    "health": true
  },
  {
    "name": "pd-10.0.1.15-2379",
    "member_id": 16570216285418057773,
    "client_urls": [
      "http://10.0.1.15:2379"
    ],
    "health": false
  }
]

@AstroProfundis
Copy link
Copy Markdown
Contributor

AstroProfundis commented Jun 12, 2020

OK I understand. So there are 2 possible solutions:

  1. Keep the current healthy/unhealthy state in displaying, but request that from any alive PD instead of only the node itself (we have similar approach when finding PD leader in the code), if the node with problem is not found in the first success response, it's marked as down
  2. Request the PD API of the specific node, ignore the state of PD node in response, use an up/down state based on the http response code of /pd/health, not parsing the content at all

I'd prefer 2.

@baurine
Copy link
Copy Markdown
Contributor Author

baurine commented Jun 12, 2020

Prefer 2 as well.

@baurine
Copy link
Copy Markdown
Contributor Author

baurine commented Jun 12, 2020

Hi @AstroProfundis , I have updated it, PTAL, thanks!

btw, I used the /pd/ping API replace /pd/health to check the single node health status according to @HunDunDM recommendation.

besides, I found there are some code in /components/dms/command/display.go similar to the code in /components/cluster/command/display.go, so I changed it as well.

企业微信20200612074128

@baurine baurine changed the title differ the Down and Unhealty status for PD node Simplify PD node status Jun 12, 2020
@lonng
Copy link
Copy Markdown
Contributor

lonng commented Jun 13, 2020

ping @AstroProfundis

Copy link
Copy Markdown
Contributor

@AstroProfundis AstroProfundis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AstroProfundis AstroProfundis merged commit 8f3fbc0 into pingcap:master Jun 15, 2020
@baurine baurine deleted the refine-pd-status branch June 15, 2020 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants