dist/ddl: add subtask metrics#47175
Conversation
|
Skipping CI for Draft Pull Request. |
|
Skipping CI for Draft Pull Request. |
|
/cc @ywqzzy |
|
@ywqzzy: GitHub didn't allow me to request PR reviews from the following users: ywqzzy. Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #47175 +/- ##
================================================
- Coverage 72.9862% 72.7138% -0.2724%
================================================
Files 1340 1361 +21
Lines 400251 406749 +6498
================================================
+ Hits 292128 295763 +3635
- Misses 89191 92200 +3009
+ Partials 18932 18786 -146
Flags with carried forward coverage won't be shown. Click here to find out more.
|
| subtask, err := s.taskTable.GetSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
| subtask, err := s.taskTable.GetFirstSubtaskInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
| if err != nil { | ||
| logutil.Logger(s.logCtx).Warn("GetSubtaskInStates meets error", zap.Error(err)) |
| for _, subtask := range subtasks { | ||
| metrics.IncDistDDLSubTaskCnt(subtask) | ||
| metrics.StartDistDDLSubTask(subtask) | ||
| } |
There was a problem hiding this comment.
Wrap function for all metric related code?
There was a problem hiding this comment.
How about wrap them to func (s *BaseScheduler) initMetrics(task)?
| func (s *BaseScheduler) startSubtask(id int64) { | ||
| err := s.taskTable.StartSubtask(id) | ||
| func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) { | ||
| metrics.DecDistDDLSubTaskCnt(subtask) |
There was a problem hiding this comment.
| metrics.DecDistDDLSubTaskCnt(subtask) | |
| metrics.DecDistTaskSubTaskCnt(subtask) |
There was a problem hiding this comment.
Because we initiate it in L175
| GetGlobalTaskByID(taskID int64) (task *proto.Task, err error) | ||
|
|
||
| GetSubtaskInStates(instanceID string, taskID int64, step int64, states ...interface{}) (*proto.Subtask, error) | ||
| GetSubtasksInStates(tidbID string, taskID int64, step int64, states ...interface{}) ([]*proto.Subtask, error) |
| func (s *BaseScheduler) updateSubtaskStateAndError(subtask *proto.Subtask, state string, subTaskErr error) { | ||
| metrics.DecDistDDLSubTaskCnt(subtask) | ||
| metrics.EndDistDDLSubTask(subtask) | ||
| err := s.taskTable.UpdateSubtaskStateAndError(subtask.ID, state, subTaskErr) | ||
| if err != nil { | ||
| s.onError(err) | ||
| } | ||
| subtask.State = state | ||
| metrics.IncDistDDLSubTaskCnt(subtask) | ||
| metrics.StartDistDDLSubTask(subtask) | ||
| } |
There was a problem hiding this comment.
Don't understand the metric update logic.
There was a problem hiding this comment.
Update the metric as soon as the subtask status is changed.
| metrics.StartDistDDLSubTask(subtask) | ||
| } | ||
|
|
||
| func (s *BaseScheduler) finishSubtask(subtask *proto.Subtask, subtaskMeta []byte) { |
There was a problem hiding this comment.
Where is the finishSubtask method called?
|
/ok-to-test |
| "targets": [ | ||
| { | ||
| "exemplar": true, | ||
| "expr": "sum(tidb_disttask_ddl_subtask_cnt{status=~\"pending|running|revert_pending|reverting|paused\"}) by (task_id)", |
There was a problem hiding this comment.
all expression must have extra labels, see other existing metrics
| } | ||
| ], | ||
| "repeat": null, | ||
| "title": "Dist DDL", |
There was a problem hiding this comment.
| "title": "Dist DDL", | |
| "title": "Dist Execute Framework", |
| "targets": [ | ||
| { | ||
| "exemplar": true, | ||
| "expr": "time()-tidb_disttask_ddl_subtask_start_time{k8s_cluster=\"$k8s_cluster\",tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\", status=\"pending\"}", |
There was a problem hiding this comment.
why not put this in previous Dist execute frameowork?
There was a problem hiding this comment.
we can see this detail in each TiDB Node. If we put it in Dist execute frameowork, different tidb details are mixed.
There was a problem hiding this comment.
you need k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance" labels
There was a problem hiding this comment.
I didn't find other metrics containing these lables 🤔
|
/label ok-to-test |
|
@D3Hunter: These labels are not set on the issue: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/retest |
| "timeFrom": null, | ||
| "timeRegions": [], | ||
| "timeShift": null, | ||
| "title": "Distributed DDL SubTask Pending Duration", |
There was a problem hiding this comment.
change all dist ddl to dist task
| func (s *BaseScheduler) startSubtask(id int64) { | ||
| err := s.taskTable.StartSubtask(id) | ||
| func (s *BaseScheduler) startSubtask(subtask *proto.Subtask) { | ||
| metrics.DecDistTaskSubTaskCnt(subtask) |
There was a problem hiding this comment.
dec pre-state subtask, then inc new-state subtask
There was a problem hiding this comment.
dec pre-state subtask, then inc new-state subtask
IMHO, the method name is confusing.
There was a problem hiding this comment.
dec pre-state subtask, then inc new-state subtask
IMHO, the method name is confusing.
Indeed, there is a point, I think the reason for the confusion is that this function implicitly changes the state of the subtask halfway. How about doing so?
func (s *BaseScheduler) startSubtaskAndUpdateState(subtask *proto.Subtask) {
....
}| subtasks, err := s.taskTable.GetSubtasksInStates(s.id, task.ID, task.Step, proto.TaskStatePending) | ||
| if err != nil { | ||
| s.onError(err) | ||
| return s.getError() | ||
| } | ||
| for _, subtask := range subtasks { | ||
| metrics.IncDistTaskSubTaskCnt(subtask) | ||
| metrics.StartDistTaskSubTask(subtask) | ||
| } |
There was a problem hiding this comment.
We can move this code into dispatcher.go.
When dispatching subtasks success, update the metric.
Then we don't need to fetch the taskTable.
There was a problem hiding this comment.
This was my previous implementation method, which would cause the instance of collecting metrics to be different, thereby causing confusion in Grafana display.
Co-authored-by: EasonBall <592838129@qq.com>
…into ddl-dist-metrics-2
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tangenta, ywqzzy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
D3Hunter
left a comment
There was a problem hiding this comment.
now i prefer to query subtask count by a fixed interval to update metrics, much cleaner, not current inc/dec...
but ok for now
Could lead to certain issues. For instance, we might overlook a few state changes owing to state update twice in interval. |

What problem does this PR solve?
Issue Number: close #47017
Problem Summary:
What is changed and how it works?
Runtime - Scheduler SubTask
TiDB perspective
Task perspective
SubTask perspective
Check List
Tests
more test results are coming soon
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.