Skip to content

metrics: enhance diagnostic capabilities for gRPC network issues (#67811)#68299

Open
ti-chi-bot wants to merge 5 commits into
pingcap:release-nextgen-202603from
ti-chi-bot:cherry-pick-67811-to-release-nextgen-202603
Open

metrics: enhance diagnostic capabilities for gRPC network issues (#67811)#68299
ti-chi-bot wants to merge 5 commits into
pingcap:release-nextgen-202603from
ti-chi-bot:cherry-pick-67811-to-release-nextgen-202603

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

@ti-chi-bot ti-chi-bot commented May 11, 2026

This is an automated cherry-pick of #67811

What problem does this PR solve?

Issue Number: close #67810

Problem Summary: ref #67810

What changed and how does it work?

Bump client-go and register channelz collector.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Integrated gRPC Channelz metrics to surface connection and channel health.
  • Tests

    • Added tests for Channelz metrics and singleton lifecycle.
    • Expanded leak-ignore rules to suppress additional gRPC-related goroutines.
    • Added conditional test skips when running on the next-gen kernel.
  • Chores

    • Updated dependency pins and test shard counts.
    • Adjusted build dependency declarations.

Review Change Stack

@ti-chi-bot ti-chi-bot added ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-202603 labels May 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40a42c5d-bba6-40d5-9023-24f3ef56a6a5

📥 Commits

Reviewing files that changed from the base of the PR and between acda469 and 7df5685.

📒 Files selected for processing (2)
  • pkg/sessiontxn/staleread/BUILD.bazel
  • pkg/sessiontxn/staleread/provider_test.go

📝 Walkthrough

Walkthrough

Adds an internal gRPC Channelz Prometheus collector in pkg/metrics with singleton initialization, filtering, teardown, and tests; updates BUILD/dependency pins and extends goleak ignore lists across test mains.

Changes

gRPC Channelz Collector Integration

Layer / File(s) Summary
Build Dependencies
pkg/metrics/BUILD.bazel
go_library and go_test targets gain //pkg/util/intest and gRPC channelz/bufconn/credentials deps; metrics_test shard_count increases from 5 to 8.
Imports and Singleton State
pkg/metrics/metrics.go
Imports extend to include bufconn, insecure, Channelz protos, and tikvcollectors; adds mutex-protected grpcChannelzCollector singleton fields.
Metrics Registration Hook
pkg/metrics/metrics.go
Calls setupChannelzCollector() from RegisterMetrics to wire collector setup into metrics initialization.
Core Collector Implementation
pkg/metrics/metrics.go
Implements setupChannelzCollector(), initGrpcChannelzCollectorLocked() (bufconn listener + grpc.Server + passthrough dialer), channelzCollectorOpts() with a filter to exclude internal scraper targets/sockets, and helper predicates isInternalChannelzTarget() / isInternalChannelzSocket().
Cleanup and Teardown
pkg/metrics/metrics.go, pkg/metrics/main_test.go
Adds cleanupGrpcChannelzCollectorForTest() to unregister/reset singleton and registers it via goleak.Cleanup(...) in tests.
Collector Tests & Helpers
pkg/metrics/metrics_internal_test.go
Adds tests: TestGrpcChannelzCollectorSingleton, TestSetupChannelzCollectorSkippedInTest, TestGrpcChannelzCollectorGather; adds helpers findMetricFamily() and metricHasLabelValue().
Goleak Configuration
br/cmd/br/main_test.go, pkg/server/*/main_test.go, pkg/server/tests/*/main_test.go
Multiple test mains add goleak.IgnoreTopFunction() entries for grpcsync.(*CallbackSerializer).run and bufconn.(*Listener).Accept.
Dependency & BUILD Updates
DEPS.bzl, go.mod, pkg/util/signal/BUILD.bazel, cmd/tidb-server/BUILD.bazel
Update client-go v2 pin in DEPS.bzl and go.mod; mark lumberjack as indirect; refactor pkg/util/signal BUILD deps and adjust tidb-server_test shard_count (6→5).
Kernel-type Test Guards
tests/realtikvtest/txntest/..., pkg/sessiontxn/staleread/...
Add kerneltype.IsNextGen() guards and BUILD deps to skip tests not supported on next-gen kernel.

Sequence Diagram(s)

sequenceDiagram
  participant RegisterMetrics
  participant BufconnListener
  participant GRPCServer
  participant GRPCClientConn
  participant ChannelzCollector
  participant Prometheus

  RegisterMetrics->>GRPCServer: setupChannelzCollector()
  setupChannelzCollector->>BufconnListener: create bufconn listener
  setupChannelzCollector->>GRPCServer: register Channelz service and Start()
  setupChannelzCollector->>GRPCClientConn: dial via passthrough with custom dialer
  setupChannelzCollector->>ChannelzCollector: tikvcollectors.NewChannelzCollector(opts)
  ChannelzCollector->>Prometheus: prometheus.Register(collector)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • pingcap/tidb#67811: Similar gRPC Channelz collector additions, BUILD/deps, and goleak updates.
  • pingcap/tidb#67352: Related changes touching metrics and runtime wiring in overlapping areas.

Suggested labels

lgtm

Suggested reviewers

  • lcwangchao
  • 3pointer
  • nolouch
  • cfzjywxk

Poem

"A rabbit tunes the Channelz thread,
bufconn hums where metrics tread,
a collector wakes, then softly sleeps,
tests skip, cleanup gently keeps,
Prometheus counts carrots ahead. 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly describes the main change: enhancing diagnostic capabilities for gRPC network issues through the metrics package.
Description check ✅ Passed The PR description follows the template structure with Issue Number, Problem Summary, What Changed, Checklist items completed, and Release Notes sections, though the description is minimal.
Linked Issues check ✅ Passed The code changes successfully implement the objectives from issue #67810: registering a gRPC Channelz collector to export internal gRPC status/metrics to Prometheus, improving observability of gRPC network connections.
Out of Scope Changes check ✅ Passed All changes align with the PR objectives. The modifications include: registering Channelz collector for gRPC metrics export, updating dependencies (client-go, bufconn), adjusting test configurations to suppress expected gRPC goroutine patterns, and skipping replica/stale-read tests on next-gen kernels—all scoped to improving gRPC diagnostics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/metrics/metrics.go`:
- Around line 485-502: The setupChannelzCollector function can leave the
in-process gRPC server/listener/client running if prometheus.MustRegister
panics; after calling initGrpcChannelzCollectorLocked(), add a deferred rollback
that runs unless registration completes: create a local success flag (or
similar) immediately after initGrpcChannelzCollectorLocked() and defer a
function that, when the flag is false, calls the existing cleanup routine used
in tests (e.g., cleanupGrpcChannelzCollectorForTest or a new
cleanupGrpcChannelzCollectorLocked helper) to stop the server, close the
listener/client and reset grpcChannelzCollector state; only set the success flag
(and grpcChannelzCollector.registered = true) after
prometheus.MustRegister(grpcChannelzCollector.collector) returns without panic
so the deferred rollback runs on panic/early return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1eb0a809-40c4-4ecf-92fb-783e4d779731

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce1db3 and 4a14a91.

📒 Files selected for processing (14)
  • br/cmd/br/main_test.go
  • pkg/metrics/BUILD.bazel
  • pkg/metrics/main_test.go
  • pkg/metrics/metrics.go
  • pkg/metrics/metrics_internal_test.go
  • pkg/server/handler/extractorhandler/main_test.go
  • pkg/server/handler/optimizor/main_test.go
  • pkg/server/handler/tests/main_test.go
  • pkg/server/main_test.go
  • pkg/server/tests/commontest/main_test.go
  • pkg/server/tests/cursor/main_test.go
  • pkg/server/tests/main_test.go
  • pkg/server/tests/standby/main_test.go
  • pkg/server/tests/tls/main_test.go

Comment thread pkg/metrics/metrics.go
Comment on lines +485 to +502
func setupChannelzCollector() {
if intest.InTest {
return
}

grpcChannelzCollector.mu.Lock()
defer grpcChannelzCollector.mu.Unlock()

if err := initGrpcChannelzCollectorLocked(); err != nil {
logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err))
return
}
if grpcChannelzCollector.registered {
return
}
prometheus.MustRegister(grpcChannelzCollector.collector)
grpcChannelzCollector.registered = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Singleton state can drift if prometheus.MustRegister panics.

initGrpcChannelzCollectorLocked() brings the in-process gRPC server, listener, client, and collector to life, but grpcChannelzCollector.registered is only set to true after prometheus.MustRegister(grpcChannelzCollector.collector) succeeds. MustRegister panics on registration errors (e.g., duplicate descriptor / AlreadyRegisteredError), and the panic will propagate out of RegisterMetrics while the gRPC server goroutine, listener, and client connection are left running with registered == false. There is no defer here to roll them back, and subsequent recovery paths (e.g., re-invoking RegisterMetrics or cleanupGrpcChannelzCollectorForTest) would re-enter while the prior server is still alive.

Consider deferring a rollback if registration doesn't complete, e.g.:

🛡️ Suggested guard
-	if err := initGrpcChannelzCollectorLocked(); err != nil {
-		logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err))
-		return
-	}
-	if grpcChannelzCollector.registered {
-		return
-	}
-	prometheus.MustRegister(grpcChannelzCollector.collector)
-	grpcChannelzCollector.registered = true
+	if err := initGrpcChannelzCollectorLocked(); err != nil {
+		logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err))
+		return
+	}
+	if grpcChannelzCollector.registered {
+		return
+	}
+	if err := prometheus.Register(grpcChannelzCollector.collector); err != nil {
+		logutil.BgLogger().Warn("register internal channelz collector failed", zap.Error(err))
+		stopGrpcChannelzCollectorLocked()
+		return
+	}
+	grpcChannelzCollector.registered = true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/metrics/metrics.go` around lines 485 - 502, The setupChannelzCollector
function can leave the in-process gRPC server/listener/client running if
prometheus.MustRegister panics; after calling initGrpcChannelzCollectorLocked(),
add a deferred rollback that runs unless registration completes: create a local
success flag (or similar) immediately after initGrpcChannelzCollectorLocked()
and defer a function that, when the flag is false, calls the existing cleanup
routine used in tests (e.g., cleanupGrpcChannelzCollectorForTest or a new
cleanupGrpcChannelzCollectorLocked helper) to stop the server, close the
listener/client and reset grpcChannelzCollector state; only set the success flag
(and grpcChannelzCollector.registered = true) after
prometheus.MustRegister(grpcChannelzCollector.collector) returns without panic
so the deferred rollback runs on panic/early return.

@zyguan
Copy link
Copy Markdown
Contributor

zyguan commented May 11, 2026

/retest

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 11, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 11, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-11 12:41:03.285921302 +0000 UTC m=+96631.818700621: ☑️ agreed by cfzjywxk.

zyguan added 3 commits May 13, 2026 07:32
…rry-pick-67811-to-release-nextgen-202603

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cfzjywxk
Once this PR has been reviewed and has the lgtm label, please assign bornchanger, zimulala for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-nextgen-202603@1cfd1ba). Learn more about missing BASE report.

Additional details and impacted files
@@                     Coverage Diff                     @@
##             release-nextgen-202603     #68299   +/-   ##
===========================================================
  Coverage                          ?   77.5775%           
===========================================================
  Files                             ?       1962           
  Lines                             ?     544231           
  Branches                          ?          0           
===========================================================
  Hits                              ?     422201           
  Misses                            ?     121177           
  Partials                          ?        853           
Flag Coverage Δ
unit 76.1791% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <0.0000%> (?)
parser ∅ <0.0000%> (?)
br 61.0443% <0.0000%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-1-more-lgtm Indicates a PR needs 1 more LGTM. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-202603

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants