-
Notifications
You must be signed in to change notification settings - Fork 9
CP-32193: Add Istio integration with runtime validation #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b064029 to
953272b
Compare
dmepham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor questions, but lgtm!
| } | ||
| } | ||
|
|
||
| // If we have more than one distinct region, cross-cluster LB is active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this miss cross-clusters in the same region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I don't think so; it's just appending to an array not setting values in a map, but I'll spin up a new cluster to check...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. You're right that the region-based approach would miss same-region clusters.
Important context: This detection is purely diagnostic/informational - the actual traffic fencing protection comes from the DestinationRule that pins traffic using the topology.istio.io/cluster label. Even if detection failed, traffic would still be correctly fenced. That said, misleading log messages aren't great.
Upon further examination, I found a more reliable detection method: comparing the local-cluster subset endpoints against all endpoints in the Envoy output:
# Local-cluster subset (our DestinationRule):
outbound|80|local-cluster|aggregator.cza.svc.cluster.local::10.84.0.10:8080
# All endpoints (includes remote clusters):
outbound|80||aggregator.cza.svc.cluster.local::10.84.0.10:8080
outbound|80||aggregator.cza.svc.cluster.local::10.36.6.3:8080 <- remote
If there are IPs in the all-endpoints list that aren't in the local-cluster subset, those are remote endpoints from other clusters - regardless of their region. This approach works because it uses the same mechanism as the DestinationRule itself.
I've updated the code to use this subset-based detection with comprehensive tests, including TestParseEnvoyResponse_SameRegion_DifferentClusters that specifically validates this scenario.
This is now tested with a second GKE cluster in the same region, and everything seems to work as expected.
29bd116 to
eb6f3d3
Compare
aa907a7 to
1c85086
Compare
Istio service mesh deployments create two challenges for the CloudZero Agent:
(1) strict mTLS can conflict with the webhook's self-signed TLS certificates,
and (2) cross-cluster load balancing can route metrics to the wrong aggregator,
corrupting cost attribution data. This change adds automatic Istio detection,
runtime validation, and comprehensive documentation.
Functional Change:
Before: Istio support required manual configuration via suppressIstioAnnotations.
No runtime validation existed for cross-cluster load balancing risks. Documentation
was incomplete and didn't explain the underlying problems.
After: The chart auto-detects Istio via CRD presence, applies port exclusion
annotations automatically, and the validator detects both sidecar and ambient
modes at runtime, warning when cluster ID configuration is incorrect or missing.
Solution:
1. Added `integrations.istio.enabled` (null/true/false) for auto-detect or explicit
control, replacing the old `suppressIstioAnnotations` approach
2. Added `integrations.istio.clusterID` for traffic fencing in multicluster meshes,
with fallback to `clusterName` for automatic configuration in sidecar mode
3. Created new Istio diagnostic provider (`app/domain/diagnostic/istio/`) that:
- Detects sidecar mode by querying localhost:15000 (Envoy admin API)
- Detects ambient mode via Downward API (ambient.istio.io/redirection annotation)
- In sidecar mode, validates effective cluster ID (clusterID || clusterName) matches
what Istio reports, ensuring the DestinationRule will work correctly
- Requires explicit clusterID in ambient mode (cannot validate without sidecar)
4. Added Helm helpers for Istio detection and cluster ID resolution with detailed
documentation explaining the port number differences (443 vs 8443)
5. Completely rewrote helm/docs/istio.md to explain the two problems (mTLS conflict
and cross-cluster LB) with Mermaid diagrams, solutions, and configuration guide
Validation:
- Added comprehensive unit tests for Istio diagnostic provider (istio_test.go)
covering sidecar mode, ambient mode, cluster ID validation, and edge cases
- Added Helm unittest suite (istio_integration_test.yaml) with 33 tests covering
DestinationRule/VirtualService generation, port exclusion annotations, API
version selection, and cluster ID configuration
- Added schema validation tests for integrations.istio.enabled and clusterID
- Added Istio template test to verify full manifest generation with Istio enabled
- Manual testing pending on Istio-enabled clusters (sidecar and ambient modes)
Istio service mesh deployments create two challenges for the CloudZero Agent: (1) strict mTLS can conflict with the webhook's self-signed TLS certificates, and (2) cross-cluster load balancing can route metrics to the wrong aggregator, corrupting cost attribution data. This change adds automatic Istio detection, runtime validation, and comprehensive documentation.
Functional Change:
Before: Istio support required manual configuration via suppressIstioAnnotations. No runtime validation existed for cross-cluster load balancing risks. Documentation was incomplete and didn't explain the underlying problems.
After: The chart auto-detects Istio via CRD presence, applies port exclusion annotations automatically, and the validator detects both sidecar and ambient modes at runtime, warning when cluster ID configuration is incorrect or missing.
Solution:
Added
integrations.istio.enabled(null/true/false) for auto-detect or explicit control, replacing the oldsuppressIstioAnnotationsapproachAdded
integrations.istio.clusterIDfor traffic fencing in multicluster meshes, with fallback toclusterNamefor automatic configuration in sidecar modeCreated new Istio diagnostic provider (
app/domain/diagnostic/istio/) that:Added Helm helpers for Istio detection and cluster ID resolution with detailed documentation explaining the port number differences (443 vs 8443)
Completely rewrote helm/docs/istio.md to explain the two problems (mTLS conflict and cross-cluster LB) with Mermaid diagrams, solutions, and configuration guide
Validation: