Skip to content

PC-820 Add component-level health reporting#274

Merged
mutantkeyboard merged 5 commits into
mainfrom
health_reporting
Feb 12, 2026
Merged

PC-820 Add component-level health reporting#274
mutantkeyboard merged 5 commits into
mainfrom
health_reporting

Conversation

@mutantkeyboard
Copy link
Copy Markdown
Contributor

@mutantkeyboard mutantkeyboard commented Feb 12, 2026

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Comment thread internal/collector/historical_metrics_collector.go Outdated
@mutantkeyboard mutantkeyboard marked this pull request as ready for review February 12, 2026 17:38
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 12, 2026

Code Review 👍 Approved with suggestions 2 resolved / 4 findings

Clean health reporting integration with consistent nil-safe patterns. Two minor previous findings remain unresolved: the unusual "workload ->" metadata key and FetchPercentiles unconditionally reporting healthy even when inner queries fail (masking degraded status).

💡 Quality: Unusual metadata key "workload ->" with arrow symbol

📄 internal/collector/historical_metrics_collector.go:67

The metadata key "workload ->" uses an arrow symbol which is inconsistent with all other metadata keys in this PR (e.g., "collector_count", "error", "capacity", "circuit_breaker"). This looks like a leftover from debugging. Using a non-standard key makes it harder to parse metadata programmatically and is inconsistent with the snake_case convention used everywhere else.

Suggested fix
	c.updateHealthStatus(health.HealthStatusHealthy, "Prometheus queries succeeding", map[string]string{"workload": workload.WorkloadName})
💡 Edge Case: FetchPercentiles reports healthy even when queries fail

📄 internal/collector/historical_metrics_collector.go:67

FetchPercentiles always reports HealthStatusHealthy on line 67 after the container loop, even when inner queryScalar calls failed and reported HealthStatusDegraded. The degraded status from queryScalar is immediately overwritten by the healthy status in FetchPercentiles, so partial query failures are masked. Since fetchContainerPercentiles intentionally swallows individual query errors (logs them at V(1) level), the health update creates an inconsistent picture — the last status written wins, and it's always "healthy" if the function returns without error.

Consider only reporting healthy if no inner failures occurred, or removing the healthy update from FetchPercentiles since queryScalar already handles degraded status reporting.

✅ 2 resolved
Bug: Typo in field name: healthMmanager (double 'm')

📄 internal/collector/historical_metrics_collector.go:38
The HistoricalMetricsCollector struct field is named healthMmanager with a double 'm', while the constructor parameter is correctly named healthManager. This is inconsistent naming that will confuse future developers and makes the code harder to maintain. While the code currently compiles and works (Go doesn't require field/parameter names to match), this typo should be fixed now before it propagates to other code that references this field.

  • ✅ Health updates called while holding TelemetrySender's mutex
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mutantkeyboard mutantkeyboard merged commit 5ea528c into main Feb 12, 2026
25 checks passed
@mutantkeyboard mutantkeyboard deleted the health_reporting branch February 12, 2026 18:21
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
* Work in progress - need to fix and add tests

* Add tests

* Fix typo

* Add stress tests for local kind cluster

* Add doc comments to unexported health status helper functions

---------

Co-authored-by: Antonio Nesic <antonio.nesic@devzero.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants