metrics: enhance diagnostic capabilities for gRPC network issues (#67811)#68299
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds an internal gRPC Channelz Prometheus collector in pkg/metrics with singleton initialization, filtering, teardown, and tests; updates BUILD/dependency pins and extends goleak ignore lists across test mains. ChangesgRPC Channelz Collector Integration
Sequence Diagram(s)sequenceDiagram
participant RegisterMetrics
participant BufconnListener
participant GRPCServer
participant GRPCClientConn
participant ChannelzCollector
participant Prometheus
RegisterMetrics->>GRPCServer: setupChannelzCollector()
setupChannelzCollector->>BufconnListener: create bufconn listener
setupChannelzCollector->>GRPCServer: register Channelz service and Start()
setupChannelzCollector->>GRPCClientConn: dial via passthrough with custom dialer
setupChannelzCollector->>ChannelzCollector: tikvcollectors.NewChannelzCollector(opts)
ChannelzCollector->>Prometheus: prometheus.Register(collector)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: zyguan <zhongyangguan@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/metrics/metrics.go`:
- Around line 485-502: The setupChannelzCollector function can leave the
in-process gRPC server/listener/client running if prometheus.MustRegister
panics; after calling initGrpcChannelzCollectorLocked(), add a deferred rollback
that runs unless registration completes: create a local success flag (or
similar) immediately after initGrpcChannelzCollectorLocked() and defer a
function that, when the flag is false, calls the existing cleanup routine used
in tests (e.g., cleanupGrpcChannelzCollectorForTest or a new
cleanupGrpcChannelzCollectorLocked helper) to stop the server, close the
listener/client and reset grpcChannelzCollector state; only set the success flag
(and grpcChannelzCollector.registered = true) after
prometheus.MustRegister(grpcChannelzCollector.collector) returns without panic
so the deferred rollback runs on panic/early return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1eb0a809-40c4-4ecf-92fb-783e4d779731
📒 Files selected for processing (14)
br/cmd/br/main_test.gopkg/metrics/BUILD.bazelpkg/metrics/main_test.gopkg/metrics/metrics.gopkg/metrics/metrics_internal_test.gopkg/server/handler/extractorhandler/main_test.gopkg/server/handler/optimizor/main_test.gopkg/server/handler/tests/main_test.gopkg/server/main_test.gopkg/server/tests/commontest/main_test.gopkg/server/tests/cursor/main_test.gopkg/server/tests/main_test.gopkg/server/tests/standby/main_test.gopkg/server/tests/tls/main_test.go
| func setupChannelzCollector() { | ||
| if intest.InTest { | ||
| return | ||
| } | ||
|
|
||
| grpcChannelzCollector.mu.Lock() | ||
| defer grpcChannelzCollector.mu.Unlock() | ||
|
|
||
| if err := initGrpcChannelzCollectorLocked(); err != nil { | ||
| logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err)) | ||
| return | ||
| } | ||
| if grpcChannelzCollector.registered { | ||
| return | ||
| } | ||
| prometheus.MustRegister(grpcChannelzCollector.collector) | ||
| grpcChannelzCollector.registered = true | ||
| } |
There was a problem hiding this comment.
Singleton state can drift if prometheus.MustRegister panics.
initGrpcChannelzCollectorLocked() brings the in-process gRPC server, listener, client, and collector to life, but grpcChannelzCollector.registered is only set to true after prometheus.MustRegister(grpcChannelzCollector.collector) succeeds. MustRegister panics on registration errors (e.g., duplicate descriptor / AlreadyRegisteredError), and the panic will propagate out of RegisterMetrics while the gRPC server goroutine, listener, and client connection are left running with registered == false. There is no defer here to roll them back, and subsequent recovery paths (e.g., re-invoking RegisterMetrics or cleanupGrpcChannelzCollectorForTest) would re-enter while the prior server is still alive.
Consider deferring a rollback if registration doesn't complete, e.g.:
🛡️ Suggested guard
- if err := initGrpcChannelzCollectorLocked(); err != nil {
- logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err))
- return
- }
- if grpcChannelzCollector.registered {
- return
- }
- prometheus.MustRegister(grpcChannelzCollector.collector)
- grpcChannelzCollector.registered = true
+ if err := initGrpcChannelzCollectorLocked(); err != nil {
+ logutil.BgLogger().Warn("setup internal channelz collector failed", zap.Error(err))
+ return
+ }
+ if grpcChannelzCollector.registered {
+ return
+ }
+ if err := prometheus.Register(grpcChannelzCollector.collector); err != nil {
+ logutil.BgLogger().Warn("register internal channelz collector failed", zap.Error(err))
+ stopGrpcChannelzCollectorLocked()
+ return
+ }
+ grpcChannelzCollector.registered = true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/metrics/metrics.go` around lines 485 - 502, The setupChannelzCollector
function can leave the in-process gRPC server/listener/client running if
prometheus.MustRegister panics; after calling initGrpcChannelzCollectorLocked(),
add a deferred rollback that runs unless registration completes: create a local
success flag (or similar) immediately after initGrpcChannelzCollectorLocked()
and defer a function that, when the flag is false, calls the existing cleanup
routine used in tests (e.g., cleanupGrpcChannelzCollectorForTest or a new
cleanupGrpcChannelzCollectorLocked helper) to stop the server, close the
listener/client and reset grpcChannelzCollector state; only set the success flag
(and grpcChannelzCollector.registered = true) after
prometheus.MustRegister(grpcChannelzCollector.collector) returns without panic
so the deferred rollback runs on panic/early return.
|
/retest |
[LGTM Timeline notifier]Timeline:
|
…rry-pick-67811-to-release-nextgen-202603 Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cfzjywxk The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release-nextgen-202603 #68299 +/- ##
===========================================================
Coverage ? 77.5775%
===========================================================
Files ? 1962
Lines ? 544231
Branches ? 0
===========================================================
Hits ? 422201
Misses ? 121177
Partials ? 853
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This is an automated cherry-pick of #67811
What problem does this PR solve?
Issue Number: close #67810
Problem Summary: ref #67810
What changed and how does it work?
Bump client-go and register channelz collector.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Tests
Chores