Skip to content

PC-822 Implement Heartbeat sender to Dakr#282

Merged
mutantkeyboard merged 6 commits into
mainfrom
heartbeat_dakr
Feb 19, 2026
Merged

PC-822 Implement Heartbeat sender to Dakr#282
mutantkeyboard merged 6 commits into
mainfrom
heartbeat_dakr

Conversation

@mutantkeyboard
Copy link
Copy Markdown
Contributor

@mutantkeyboard mutantkeyboard commented Feb 18, 2026

[Title]

📚 Description of Changes

Provide an overview of your changes and why they're needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.


Summary by Gitar

  • Heartbeat service: New OperatorHealthService RPC for periodic health reporting to Dakr, enabling real-time operator visibility
    • Sends heartbeat every 60 seconds with operator version, uptime, and component health status
  • Health reporting logic: Added heartbeat builder that aggregates component health into overall status (worst-case severity ordering: UNSPECIFIED < HEALTHY < DEGRADED < UNHEALTHY)
  • Dakr client integration: Extended RealDakrClient with operatorHealthClient to call ReportHealth RPC
  • Controller integration: Modified runHealthReporting to send initial heartbeat on startup and every 60s thereafter; RPC failures logged but don't crash operator
  • Environment configuration: Reads cluster ID from CLUSTER_ID env var; defaults to "unknown" if not set
  • CPU throttle monitoring: Added CpuThrottledFraction metric to track CFS bandwidth throttling (merged from main)
    • Queries Prometheus for container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total with 5m rate

This will update automatically on new commits.

Comment thread internal/controller/custom.go Outdated
Comment thread internal/controller/custom.go Outdated
Comment thread internal/health/heartbeat_test.go Outdated
Comment thread internal/health/heartbeat.go
@mutantkeyboard mutantkeyboard enabled auto-merge (squash) February 19, 2026 14:11
Comment thread internal/controller/custom.go
Comment thread internal/transport/interface.go
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 19, 2026

Code Review 👍 Approved with suggestions 4 resolved / 6 findings

Clean heartbeat implementation with good separation of concerns. Two minor suggestions: ensure cluster ID is reliably resolved (especially in PAT-exchange deployments) and complete the interface doc comment.

💡 Edge Case: Heartbeat cluster ID may always be "unknown" in PAT-exchange flow

📄 internal/controller/custom.go:242

getClusterID() only reads from the CLUSTER_ID environment variable (falling back to "unknown"). However, when the operator bootstraps via PAT token exchange in initializeTelemetryComponents, the real cluster ID is returned by ExchangePATForClusterToken at line 316 but is never stored anywhere accessible to getClusterID().

If deployments that use the PAT exchange flow don't also set CLUSTER_ID as an env var, every heartbeat will report cluster_id: "unknown", making it impossible for Dakr to correlate heartbeats with the correct cluster.

Consider caching the cluster ID on the controller (e.g., from the PAT exchange result or from the context cluster_id value) and using it as the primary source in getClusterID().

Suggested fix
// getClusterID returns the cluster ID from environment configuration.
func (c *EnvBasedController) getClusterID() string {
	if c.clusterID != "" {
		return c.clusterID
	}
	if id := os.Getenv("CLUSTER_ID"); id != "" {
		return id
	}
	return "unknown"
}
💡 Quality: Incomplete doc comment on ReportHealth interface method

📄 internal/transport/interface.go:32

The ReportHealth method in the DakrClient interface has a bare // ReportHealth comment without any description. All other methods in the interface have descriptive comments explaining what they do. For consistency and to help future implementors, add a brief description.

Suggested fix
	// ReportHealth sends the operator health heartbeat to the Dakr service.
	ReportHealth(ctx context.Context, req *gen.ReportHealthRequest) error
✅ 4 resolved
Edge Case: No initial heartbeat sent until first 60s tick fires

📄 internal/controller/custom.go:202
The runHealthReporting goroutine uses time.NewTicker(60 * time.Second) which means the first heartbeat is delayed by a full 60 seconds after the operator starts. If Dakr uses heartbeats to detect operator availability, there's a 60-second blind spot after startup where the operator appears absent.

Consider sending an initial heartbeat immediately before entering the ticker loop, similar to how doReconcile is called immediately before runPeriodicReconciliation.

Quality: Double BuildReport call creates inconsistent log vs. heartbeat

📄 internal/controller/custom.go:208 📄 internal/controller/custom.go:216
runHealthReporting calls BuildReport() on line 208 for local logging, then BuildHeartbeatRequest on line 216 calls BuildReport() again internally. These are two separate lock acquisitions, so the health state logged locally could differ from the state sent to Dakr if a status update occurs between the two calls.

Consider building the report once and passing it to both consumers, or having BuildHeartbeatRequest accept the report map directly instead of the HealthManager.

Quality: Unused mockDakrClient in test file

📄 internal/health/heartbeat_test.go:13
The mockDakrClient struct (lines 14-17) and its ReportHealth method (lines 19-22) are defined but never used in any test. Both tests only call BuildHeartbeatRequest and verify the returned protobuf request. This dead code adds confusion about test coverage — it suggests there should be tests for the RPC call path but they weren't written.

Either remove the mock or add a test that exercises the ReportHealth call path through the mock.

Quality: worstStatus relies on implicit proto enum ordering

📄 internal/health/heartbeat.go:25
The worstStatus function uses numeric comparison (a > b) to determine which health status is more severe. This works because the proto enum values happen to be ordered by severity (UNSPECIFIED=0, HEALTHY=1, DEGRADED=2, UNHEALTHY=3), but this coupling is implicit and fragile.

If someone adds a new health status enum value (e.g., HEALTH_STATUS_UNKNOWN = 4), the comparison would silently treat it as the worst status. Consider adding a comment documenting this invariant, or using an explicit severity map for robustness.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mutantkeyboard mutantkeyboard merged commit 4d621c0 into main Feb 19, 2026
26 checks passed
@mutantkeyboard mutantkeyboard deleted the heartbeat_dakr branch February 19, 2026 14:25
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
* RPC for Dakr

* Gitar fixes

* ci: retrigger pipeline

* Retrigger vercel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants