dist/ddl: add subtask metrics by okJiang · Pull Request #47175 · pingcap/tidb

okJiang · 2023-09-21T16:14:20Z

What problem does this PR solve?

Issue Number: close #47017

Problem Summary:

What is changed and how it works?

Runtime - Scheduler SubTask

A line chart showing the change in the number of waiting subTasks over time
A line chart showing the waiting time of waiting subTasks
A line chart showing the run time of running subTasks

TiDB perspective

A pie chart showing the distribution of all current subTasks on various TiDB nodes

Task perspective

A line chart showing the change in the number of each Task's (uncompleted/completed) subTasks over time
A line chart showing the average rate of each Task (subTask count/hour, which can later be improved to rows/s or bytes/s)

SubTask perspective

A line chart showing the average running speed of subTasks on different TiDB nodes (subTask count/hour)

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

more test results are coming soon

No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2023-09-21T16:14:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tiprow · 2023-09-21T16:14:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ywqzzy · 2023-09-22T01:49:08Z

/cc @ywqzzy

ti-chi-bot · 2023-09-22T01:49:10Z

@ywqzzy: GitHub didn't allow me to request PR reviews from the following users: ywqzzy.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @ywqzzy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov · 2023-09-22T03:32:36Z

Codecov Report

Merging #47175 (4171ebc) into master (34438f8) will decrease coverage by 0.2724%.
Report is 9 commits behind head on master.
The diff coverage is 80.8988%.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #47175        +/-   ##
================================================
- Coverage   72.9862%   72.7138%   -0.2724%     
================================================
  Files          1340       1361        +21     
  Lines        400251     406749      +6498     
================================================
+ Hits         292128     295763      +3635     
- Misses        89191      92200      +3009     
+ Partials      18932      18786       -146

Flag	Coverage Δ
integration	`33.3929% <0.0000%> (?)`
unit	`73.0026% <84.7058%> (+0.0164%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`53.9913% <ø> (ø)`
parser	`84.9376% <ø> (-0.0108%)`	⬇️
br	`48.8526% <ø> (-4.2222%)`	⬇️

ywqzzy · 2023-09-22T05:56:55Z

-		subtask, err := s.taskTable.GetSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
+		subtask, err := s.taskTable.GetFirstSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
 		if err != nil {
 			logutil.Logger(s.logCtx).Warn("GetSubtaskInStates meets error", zap.Error(err))


update the log

ywqzzy · 2023-09-22T05:57:23Z

+	for _, subtask := range subtasks {
+		metrics.IncDistDDLSubTaskCnt(subtask)
+		metrics.StartDistDDLSubTask(subtask)
+	}


Wrap function for all metric related code？

How about wrap them to func (s *BaseScheduler) initMetrics(task)?

ywqzzy · 2023-09-22T05:58:11Z

-func (s *BaseScheduler) startSubtask(id int64) {
-	err := s.taskTable.StartSubtask(id)
+func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) {
+	metrics.DecDistDDLSubTaskCnt(subtask)


Suggested change

metrics.DecDistDDLSubTaskCnt(subtask)

metrics.DecDistTaskSubTaskCnt(subtask)

Why decrease here?

Because we initiate it in L175

ywqzzy · 2023-09-22T06:00:55Z

 	GetGlobalTaskByID(taskID int64) (task *proto.Task, err error)

-	GetSubtaskInStates(instanceID string, taskID int64, step int64, states ...interface{}) (*proto.Subtask, error)
+	GetSubtasksInStates(tidbID string, taskID int64, step int64, states ...interface{}) ([]*proto.Subtask, error)


We can remove it now

ywqzzy · 2023-09-22T06:03:21Z

+func (s *BaseScheduler) updateSubtaskStateAndError(subtask *proto.Subtask, state string, subTaskErr error) {
+	metrics.DecDistDDLSubTaskCnt(subtask)
+	metrics.EndDistDDLSubTask(subtask)
+	err := s.taskTable.UpdateSubtaskStateAndError(subtask.ID, state, subTaskErr)
 	if err != nil {
 		s.onError(err)
 	}
+	subtask.State = state
+	metrics.IncDistDDLSubTaskCnt(subtask)
+	metrics.StartDistDDLSubTask(subtask)
+}


Don't understand the metric update logic.

Update the metric as soon as the subtask status is changed.

ywqzzy · 2023-09-22T06:03:37Z

+	metrics.StartDistDDLSubTask(subtask)
+}
+
+func (s *BaseScheduler) finishSubtask(subtask *proto.Subtask, subtaskMeta []byte) {


Where is the finishSubtask method called?

fixed in c230f37

okJiang · 2023-09-22T07:32:18Z

/ok-to-test

D3Hunter · 2023-09-22T07:32:32Z

+          "targets": [
+            {
+              "exemplar": true,
+              "expr": "sum(tidb_disttask_ddl_subtask_cnt{status=~\"pending|running|revert_pending|reverting|paused\"}) by (task_id)",


all expression must have extra labels, see other existing metrics

see #47175 (comment)

D3Hunter · 2023-09-22T07:33:08Z

+        }
+      ],
+      "repeat": null,
+      "title": "Dist DDL",


Suggested change

"title": "Dist DDL",

"title": "Dist Execute Framework",

D3Hunter · 2023-09-22T07:34:47Z

+          "targets": [
+            {
+              "exemplar": true,
+              "expr": "time()-tidb_disttask_ddl_subtask_start_time{k8s_cluster=\"$k8s_cluster\",tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\", status=\"pending\"}",


why not put this in previous Dist execute frameowork?

we can see this detail in each TiDB Node. If we put it in Dist execute frameowork, different tidb details are mixed.

you need k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance" labels

~~I didn't find other metrics containing these lables 🤔~~

In this way, we can see the details of each TiDB, rather than being mixed in one panel.

D3Hunter · 2023-09-22T07:37:57Z

/label ok-to-test
/remove-label needs-ok-to-test

ti-chi-bot · 2023-09-22T07:37:59Z

@D3Hunter: These labels are not set on the issue: needs-ok-to-test.

Details

In response to this:

/label ok-to-test
/remove-label needs-ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

…t-metrics-2

okJiang · 2023-09-23T07:24:24Z

/retest

ywqzzy · 2023-09-25T02:41:17Z

+          "timeFrom": null,
+          "timeRegions": [],
+          "timeShift": null,
+          "title": "Distributed DDL SubTask Pending Duration",


change all dist ddl to dist task

fixed in c4eef6d

D3Hunter · 2023-09-25T03:55:24Z

-func (s *BaseScheduler) startSubtask(id int64) {
-	err := s.taskTable.StartSubtask(id)
+func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) {
+	metrics.DecDistTaskSubTaskCnt(subtask)


why dec first, then inc?

dec pre-state subtask, then inc new-state subtask

dec pre-state subtask, then inc new-state subtask

IMHO, the method name is confusing.

dec pre-state subtask, then inc new-state subtask

IMHO, the method name is confusing.

Indeed, there is a point, I think the reason for the confusion is that this function implicitly changes the state of the subtask halfway. How about doing so?

func (s *BaseScheduler) startSubtaskAndUpdateState(subtask *proto.Subtask) { .... }

ywqzzy · 2023-09-25T05:30:22Z

+	subtasks, err := s.taskTable.GetSubtasksInStates(s.id, task.ID, task.Step, proto.TaskStatePending)
+	if err != nil {
+		s.onError(err)
+		return s.getError()
+	}
+	for _, subtask := range subtasks {
+		metrics.IncDistTaskSubTaskCnt(subtask)
+		metrics.StartDistTaskSubTask(subtask)
+	}


We can move this code into dispatcher.go.
When dispatching subtasks success, update the metric.
Then we don't need to fetch the taskTable.

This was my previous implementation method, which would cause the instance of collecting metrics to be different, thereby causing confusion in Grafana display.

Co-authored-by: EasonBall <592838129@qq.com>

…into ddl-dist-metrics-2

ti-chi-bot · 2023-09-25T07:38:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tangenta, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tangenta]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-09-25T07:38:53Z

[LGTM Timeline notifier]

Timeline:

2023-09-25 06:59:14.350170718 +0000 UTC m=+258144.068512920: ☑️ agreed by tangenta.
2023-09-25 07:38:52.578940656 +0000 UTC m=+260522.297282857: ☑️ agreed by ywqzzy.

okJiang · 2023-09-25T07:49:09Z

/retest

D3Hunter

now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...

but ok for now

okJiang · 2023-09-25T08:13:38Z

now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...

but ok for now

Could lead to certain issues. For instance, we might overlook a few state changes owing to state update twice in interval.

okJiang added 3 commits September 21, 2023 23:36

complete subtask metrics

576b37f

optimize

e02a947

fix cp conflict

09f9525

okJiang changed the title ~~[wip]dist/ddl: add subtask metrics~~ dist/ddl: add subtask metrics Sep 22, 2023

okJiang marked this pull request as ready for review September 22, 2023 01:38

ti-chi-bot Bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-tests-checked labels Sep 22, 2023

fix ci

a12007a

ywqzzy reviewed Sep 22, 2023

View reviewed changes

okJiang added 2 commits September 22, 2023 14:09

fix bug

c230f37

add json

9ab4de5

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2023

fix comment

12c2e27

ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Sep 22, 2023

D3Hunter reviewed Sep 22, 2023

View reviewed changes

okJiang added 2 commits September 22, 2023 15:41

fix ci

54b386a

fix ci

40622b1

okJiang added 2 commits September 23, 2023 14:23

Merge branch 'master' of https://github.com/pingcap/tidb into ddl-dis…

909b8e6

…t-metrics-2

fix comment

36cb414

fix comment: rename

0d92eae

ywqzzy reviewed Sep 25, 2023

View reviewed changes

fix comment: change dist ddl to dist task

c4eef6d

D3Hunter reviewed Sep 25, 2023

View reviewed changes

Comment thread metrics/disttask.go Outdated

fix comment

da5ffe9

D3Hunter reviewed Sep 25, 2023

View reviewed changes

okJiang added 2 commits September 25, 2023 12:33

fix comment

b4289c7

fix ci

b94008f

ywqzzy reviewed Sep 25, 2023

View reviewed changes

Comment thread disttask/framework/storage/task_table.go Outdated

Update disttask/framework/storage/task_table.go

61632c8

Co-authored-by: EasonBall <592838129@qq.com>

tangenta approved these changes Sep 25, 2023

View reviewed changes

ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 25, 2023

okJiang added 2 commits September 25, 2023 15:04

fix comment: update func name

626e06b

Merge branch 'ddl-dist-metrics-2' of https://github.com/okJiang/tidb …

4171ebc

…into ddl-dist-metrics-2

ywqzzy approved these changes Sep 25, 2023

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Sep 25, 2023

D3Hunter reviewed Sep 25, 2023

View reviewed changes

ti-chi-bot Bot merged commit 516542b into pingcap:master Sep 25, 2023

okJiang deleted the ddl-dist-metrics-2 branch September 25, 2023 08:51

D3Hunter mentioned this pull request Jan 18, 2024

Inconsistent DFE subtask count in two panels #49615

Closed

	metrics.DecDistDDLSubTaskCnt(subtask)
	metrics.DecDistTaskSubTaskCnt(subtask)

Conversation

okJiang commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Uh oh!

ti-chi-bot Bot commented Sep 21, 2023

Uh oh!

tiprow Bot commented Sep 21, 2023

Uh oh!

ywqzzy commented Sep 22, 2023

Uh oh!

ti-chi-bot Bot commented Sep 22, 2023

Uh oh!

codecov Bot commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okJiang commented Sep 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okJiang Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

D3Hunter commented Sep 22, 2023

Uh oh!

ti-chi-bot Bot commented Sep 22, 2023

Uh oh!

okJiang commented Sep 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

okJiang commented Sep 21, 2023 •

edited

Loading

codecov Bot commented Sep 22, 2023 •

edited

Loading

okJiang Sep 22, 2023 •

edited

Loading