Skip to content

metrics: enhance diagnostic capabilities for gRPC network issues (#67811)#68462

Merged
ti-chi-bot[bot] merged 2 commits into
pingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-67811-to-release-nextgen-20251011
May 19, 2026
Merged

metrics: enhance diagnostic capabilities for gRPC network issues (#67811)#68462
ti-chi-bot[bot] merged 2 commits into
pingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-67811-to-release-nextgen-20251011

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

@ti-chi-bot ti-chi-bot commented May 18, 2026

This is an automated cherry-pick of #67811

What problem does this PR solve?

Issue Number: close #67810

Problem Summary: ref #67810

What changed and how does it work?

Bump client-go and register channelz collector.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Added a Prometheus collector to expose gRPC channel diagnostics for improved network observability.
  • Tests

    • Added/expanded tests validating the new channel diagnostics and metrics.
    • Updated test harness to suppress known test-only goroutine noise, reducing false-positive leak reports.

Review Change Stack

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011 labels May 18, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member Author

@zyguan This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 18, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 196a3d2c-921f-4473-b4b9-d58114cd4aa4

📥 Commits

Reviewing files that changed from the base of the PR and between a28d41a and db08548.

📒 Files selected for processing (4)
  • pkg/metrics/BUILD.bazel
  • pkg/metrics/metrics.go
  • pkg/metrics/metrics_internal_test.go
  • pkg/server/main_test.go

📝 Walkthrough

Walkthrough

Adds an internal singleton gRPC Channelz Prometheus collector wired into RegisterMetrics, test infrastructure and tests for the collector, BUILD/test dependency updates, and goleak IgnoreTopFunction entries across multiple TestMain files to suppress known gRPC background goroutines.

Changes

gRPC Channelz Metrics Collection

Layer / File(s) Summary
Core Channelz collector implementation
pkg/metrics/metrics.go
Adds Channelz/bufconn imports; implements mutex-guarded grpcChannelzCollector singleton, setupChannelzCollector() (no-op in tests), initGrpcChannelzCollectorLocked() to start bufconn server and client and construct Prometheus collector, channelzCollectorOpts() filter, and test cleanup/unregister (cleanupGrpcChannelzCollectorForTest, stopGrpcChannelzCollectorLocked). Also calls setupChannelzCollector() from RegisterMetrics.
BUILD and tests for collector
pkg/metrics/BUILD.bazel, pkg/metrics/main_test.go, pkg/metrics/metrics_internal_test.go
Updates metrics deps to include Channelz/bufconn and //pkg/util/intest, increases metrics_test shard_count, adds goleak.Cleanup(...) hook in main_test.go, and introduces tests: TestGrpcChannelzCollectorSingleton, TestSetupChannelzCollectorSkippedInTest, TestGrpcChannelzCollectorGather plus helpers findMetricFamily and metricHasLabelValue.
Test goroutine leak suppressions
br/cmd/br/main_test.go, pkg/server/handler/extractorhandler/main_test.go, pkg/server/handler/optimizor/main_test.go, pkg/server/handler/tests/main_test.go, pkg/server/main_test.go, pkg/server/tests/commontest/main_test.go, pkg/server/tests/cursor/main_test.go, pkg/server/tests/main_test.go, pkg/server/tests/standby/main_test.go, pkg/server/tests/tls/main_test.go
Appends goleak.IgnoreTopFunction(...) entries across multiple TestMain files to ignore gRPC internal goroutines such as grpcsync.(*CallbackSerializer).run and test/bufconn.(*Listener).Accept during leak verification.

Sequence Diagram

sequenceDiagram
  participant RegisterMetrics
  participant setupChannelzCollector
  participant initGrpcChannelzCollectorLocked
  participant bufconnListener
  participant grpcServer
  participant grpcClient
  participant PrometheusRegistry
  RegisterMetrics->>setupChannelzCollector: invoke
  setupChannelzCollector->>initGrpcChannelzCollectorLocked: initialize once (non-test)
  initGrpcChannelzCollectorLocked->>bufconnListener: create listener
  initGrpcChannelzCollectorLocked->>grpcServer: register channelz service & serve
  initGrpcChannelzCollectorLocked->>grpcClient: dial via bufconn
  initGrpcChannelzCollectorLocked->>PrometheusRegistry: create/register collector with filter
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

approved, lgtm

Suggested reviewers

  • lcwangchao
  • MyonKeminta
  • cfzjywxk

Poem

🐰 I spun a bufconn loop and hummed,

Channelz metrics now softly thrummed.
Goroutines ignored where they play,
Prometheus gathers what we say.
Hop, hop — diagnostics brighten today!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding gRPC diagnostic capabilities (channelz collector) to the metrics package for enhanced observability of network issues.
Description check ✅ Passed The PR description follows the template with required sections: issue number, problem summary, what changed, unit test confirmation, and release note. All critical sections are present and filled out.
Linked Issues check ✅ Passed The PR implements the core objective from #67810 by registering a gRPC channelz Prometheus collector to export internal gRPC metrics, directly addressing the goal of enhancing diagnostic capabilities for gRPC network issues.
Out of Scope Changes check ✅ Passed All code changes are focused on implementing the channelz collector and updating test goleak ignore rules for gRPC internals, which are directly related to the linked issue's objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/metrics/BUILD.bazel`:
- Around line 67-71: The BUILD file contains unresolved git merge markers around
the shard_count setting which breaks Bazel parsing; open the BUILD file and
remove the conflict markers and choose the correct value for shard_count (either
keep shard_count = 4 or shard_count = 8) so the stanza is a single valid
attribute assignment, ensuring the final line uses the chosen shard_count and no
"<<<<<<<", "=======", or ">>>>>>>" tokens remain.

In `@pkg/metrics/metrics_internal_test.go`:
- Around line 22-27: The test file contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) in the import block and later code (around the
import lines that mention "github.com/pingcap/tidb/pkg/util/intest",
"github.com/prometheus/client_golang/prometheus", dto alias) and between lines
~35-173; remove all conflict markers and reconcile the two versions so the
imports and test code form a valid Go file — ensure you keep the correct set of
imports (combine needed ones like intest, prometheus, dto) and the intended test
code paths, delete the marker lines, and run `go test` to verify compilation.

In `@pkg/metrics/metrics.go`:
- Around line 347-375: The file contains leftover git merge markers (<<<<<<<,
=======, >>>>>>>) inside the RegisterMetrics block which prevents compilation;
remove the conflict markers and keep the intended registrations and calls
(ensure prometheus.MustRegister calls for GlobalMemArbitrationDuration,
GlobalMemArbitratorWorkMode, GlobalMemArbitratorQuota,
GlobalMemArbitratorWaitingTask, GlobalMemArbitratorRuntimeMemMagnifi,
GlobalMemArbitratorRootPool, GlobalMemArbitratorEventCounter,
GlobalMemArbitratorTaskExecCounter, TLSVersion, TLSCipher,
IndexLookUpExecutorDuration, IndexLookRowsCounter, IndexLookUpExecutorRowNumber,
IndexLookUpCopTaskCount, StmtSummaryWindowRecordCount,
StmtSummaryWindowEvictedCount are present) and the setupChannelzCollector()
invocation inside RegisterMetrics; remove the merge markers and leave the final
intended code so RegisterMetrics compiles cleanly.

In `@pkg/server/main_test.go`:
- Around line 75-84: The file contains unresolved Git merge conflict markers
(<<<<<<<, =======, >>>>>>>) around the goleak.IgnoreTopFunction entries in
pkg/server/main_test.go; remove the conflict markers and keep the intended set
of ignore entries (the combined/desired goleak.IgnoreTopFunction lines such as
google.golang.org/grpc.(*addrConn).resetTransport,
google.golang.org/grpc.(*ccBalancerWrapper).watcher,
google.golang.org/grpc/internal/transport.(*controlBuffer).get,
google.golang.org/grpc/internal/transport.(*http2Client).keepalive,
google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run,
google.golang.org/grpc/test/bufconn.(*Listener).Accept,
github.com/tikv/client-go/v2/txnkv/transaction.keepAlive) by deleting the
"<<<<<<<", "=======", and ">>>>>>>" markers and leaving the correct
IgnoreTopFunction lines; then run gofmt/goimports or go vet to ensure the test
file is valid Go.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f0adcc76-18bd-483b-969b-32d039572acb

📥 Commits

Reviewing files that changed from the base of the PR and between bd3a668 and a28d41a.

📒 Files selected for processing (14)
  • br/cmd/br/main_test.go
  • pkg/metrics/BUILD.bazel
  • pkg/metrics/main_test.go
  • pkg/metrics/metrics.go
  • pkg/metrics/metrics_internal_test.go
  • pkg/server/handler/extractorhandler/main_test.go
  • pkg/server/handler/optimizor/main_test.go
  • pkg/server/handler/tests/main_test.go
  • pkg/server/main_test.go
  • pkg/server/tests/commontest/main_test.go
  • pkg/server/tests/cursor/main_test.go
  • pkg/server/tests/main_test.go
  • pkg/server/tests/standby/main_test.go
  • pkg/server/tests/tls/main_test.go

Comment thread pkg/metrics/BUILD.bazel Outdated
Comment thread pkg/metrics/metrics_internal_test.go Outdated
Comment thread pkg/metrics/metrics.go Outdated
Comment thread pkg/server/main_test.go Outdated
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor

zyguan commented May 18, 2026

/unhold

@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2026
@zyguan
Copy link
Copy Markdown
Contributor

zyguan commented May 18, 2026

/retest

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-nextgen-20251011@bd3a668). Learn more about missing BASE report.

Additional details and impacted files
@@                      Coverage Diff                      @@
##             release-nextgen-20251011     #68462   +/-   ##
=============================================================
  Coverage                            ?   71.8902%           
=============================================================
  Files                               ?       1835           
  Lines                               ?     493715           
  Branches                            ?          0           
=============================================================
  Hits                                ?     354933           
  Misses                              ?     115435           
  Partials                            ?      23347           
Flag Coverage Δ
unit 71.8902% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.3493% <0.0000%> (?)
parser ∅ <0.0000%> (?)
br 46.5353% <0.0000%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zyguan zyguan requested review from 3pointer, cfzjywxk and nolouch May 18, 2026 10:02
@ti-chi-bot ti-chi-bot Bot added cherry-pick-approved Cherry pick PR approved by release team. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. and removed do-not-merge/cherry-pick-not-approved labels May 18, 2026
@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 19, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-18 10:42:10.934601907 +0000 UTC m=+174460.438732582: ☑️ agreed by cfzjywxk.
  • 2026-05-19 01:54:43.389101721 +0000 UTC m=+229212.893232387: ☑️ agreed by 3pointer.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3pointer, cfzjywxk, nolouch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the approved label May 19, 2026
@ti-chi-bot ti-chi-bot Bot merged commit 56385fd into pingcap:release-nextgen-20251011 May 19, 2026
18 checks passed
@ti-chi-bot ti-chi-bot Bot deleted the cherry-pick-67811-to-release-nextgen-20251011 branch May 19, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants