SUSM-94: Handle Same-Node K8s Service#35644
Conversation
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision✅ Passed |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: dda inv aws.create-vm --pipeline-id=62203283 --os-family=ubuntuNote: This applies to commit ec797ea |
Static quality checks ✅Please find below the results from static quality gates Successful checksInfo
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: bd316e1 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_idle | memory utilization | +0.58 | [+0.53, +0.64] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.38 | [-0.38, +1.14] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.31 | [-2.51, +3.13] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.20 | [+0.04, +0.36] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.07 | [-0.40, +0.54] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.01 | [-0.08, +0.09] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.29, +0.30] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.00 | [-0.84, +0.85] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.02] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.01 | [-0.71, +0.70] | 1 | Logs |
| ➖ | file_to_blackhole_300ms_latency | egress throughput | -0.01 | [-0.64, +0.63] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.02 | [-0.83, +0.79] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.02 | [-0.83, +0.79] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.04 | [-0.18, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.07 | [-0.86, +0.72] | 1 | Logs |
| ➖ | otlp_ingest_traces | memory utilization | -0.12 | [-0.50, +0.27] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.24 | [-0.39, -0.09] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.38 | [-0.45, -0.31] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.60 | [-0.64, -0.55] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -1.09 | [-1.93, -0.24] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
Static quality checks ❌Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! You can check the static quality gates confluence page for guidance. We also have a toolbox page available to list tools useful to debug the size increase. Successful checksInfo
|
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
…ng' into daniel.lavie/fix-unclaimed-tracing
|
|
||
| // traceLogConnections logs detailed connection information to help investigate NPM/USM | ||
| // customer issues by providing visibility into network connections and their metadata. | ||
| func traceLogConnections(id string, cs *network.Connections) { |
There was a problem hiding this comment.
I think this can result in an enormous number of logs. I have also not really had a customer send us trace level logs; usually we ask for debug level logs.
There are already endpoints like /connections and others for USM that we can use to get similar info. We can add them to the agent flare if we need them from customers; this seems like a better alternative to me than logs.
There was a problem hiding this comment.
-
"I think this can result in an enormous number of logs" – This should only be printed when trace logs are enabled, meaning we’re actively investigating a customer issue and need detailed context.
I'm also seeing this, which suggests we may already be emitting a high volume of trace logs. -
"I have also not really had a customer send us trace level logs; usually we ask for debug level logs" – Why is that? It makes sense to begin with debug logs, but in cases like this one where more granularity is required, escalating to trace level should be an option.
-
"There are already endpoints like /connections and others for USM that we can use to get similar info" – That’s valid, but logs offer full historical context, while /connections only reflects the most recent 30 seconds or so. Logs also simplify correlation: they allow us to directly compare the NPM connection tuple with the USM connection key and quickly identify mismatches, without needing to parse aggregated HTTP data.
There was a problem hiding this comment.
I am not sure how useful this will be even if we somehow manage to get these logs from the customer. This could be a pretty large number of logs and parsing through them would be challenging to say the least. The endpoints will give you JSON, which is parse-able and query-able with tools like jq. Another problem would be that this obscures other trace logs since the context from this one log is going to be much larger than other trace logs.
Content like this is more appropriate for the flare. We were already considering adding this, but de-prioritized it; maybe this can be incentive enough to resurrect it.
There was a problem hiding this comment.
The issue with the /connections endpoint is that it doesn’t include all the USM HTTP connection key details—it relies on writeConnections, which is where the underlying problem was hidden. As a result, we wouldn’t observe the issue when using this endpoint.
We could try using the HTTP debug endpoint instead, but it only returns the USM portion of the HTTP data, without NPM context, and it consumes all connections in the process - making it unusable for our needs.
So at this point, I’m not exactly sure how the endpoints could help us retrieve the necessary information.
There was a problem hiding this comment.
A combination of the two doesn't give the same results? You could also modify or add another endpoint. Another option could be to do more limited trace logs at the point where you think the problem can happen.
There was a problem hiding this comment.
@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?
| } | ||
|
|
||
| // Log all connections for debugging | ||
| traceLog("Found %d total connections", len(cs.Conns)) |
There was a problem hiding this comment.
what is the difference between this log and line 394?
| traceLog("Connection %d:", i) | ||
| traceLog(" Source: %s:%d", conn.Source.String(), conn.SPort) | ||
| traceLog(" Destination: %s:%d", conn.Dest.String(), conn.DPort) | ||
| traceLog(" Protocol Stack: %+v", conn.ProtocolStack) |
There was a problem hiding this comment.
Any reason why not logging all the data in a same traceLog function?
| } else { | ||
| traceLog(" No IP Translation") | ||
| } | ||
| traceLog(" Direction: %s", conn.Direction) |
| servicePort := uint16(7778) | ||
| serverPort := uint16(7777) | ||
|
|
||
| // Create both connections exactly as seen from SUSM-94 reproduction environment |
There was a problem hiding this comment.
Not sure the comment with SUSM-94 is relevant let's explain the purpose without referencing the ticket number
| httpStats := http.NewRequestStats() | ||
| httpStats.AddRequest(200, 15.0, 0, nil) | ||
| httpStats.AddRequest(200, 15.0, 0, nil) | ||
| httpStats.AddRequest(200, 15.0, 0, nil) | ||
| httpStats.AddRequest(200, 15.0, 0, nil) |
There was a problem hiding this comment.
maybe create a loop for that with a const
| postNATKey := http.NewKey( | ||
| clientIP, // Client is still the source in the HTTP key (172.29.161.37) | ||
| serverIP, // Server is the destination (172.29.191.94) | ||
| clientPort, // Client port (53792) | ||
| serverPort, // Server port (7777) | ||
| []byte("/delay/5"), | ||
| true, | ||
| http.MethodGet, | ||
| ) |
There was a problem hiding this comment.
If the only difference between the preNATKey and the postNATKey is the serverIP, let's structure it more cleanly by creating a http.newKey and updating just that field.
Plus IMO the comments are not really relevant
| httpEncoder := newHTTPEncoder(payload.HTTP) | ||
|
|
||
| // Test post-NAT connection (server → client) - should have HTTP data | ||
| aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1]) |
There was a problem hiding this comment.
shouldn't you validate that payload.Conns > 0?
| // Test post-NAT connection (server → client) - should have HTTP data | ||
| aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1]) | ||
| assert.NotNil(aggregations) | ||
| assert.Equal("/delay/5", aggregations.EndpointAggregations[0].Path) |
There was a problem hiding this comment.
shouldn't you validate that aggregations.EndpointAggregations > 0?
|
@DanielLavie, I understand the problem, but I'm not sure how the proposed change solves it. The description isn't fully clear to me on how the fix addresses the issue. I think that the description needs to be updated. |
There was a problem hiding this comment.
I'm surprised (maybe disappointed) that not a single existing test was broken due to your change
How can we guarantee any other change/regression will be detected?
Can you try and "tweak" the code to ensure your new test is covering the cases?
There was a problem hiding this comment.
The TestKubernetesLocalNATScenario initially failed. Only when I made the modification in this PR, it passed
|
|
||
| // traceLogConnections logs detailed connection information to help investigate NPM/USM | ||
| // customer issues by providing visibility into network connections and their metadata. | ||
| func traceLogConnections(id string, cs *network.Connections) { |
There was a problem hiding this comment.
@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?
| func TestKubernetesLocalNATScenario(t *testing.T) { | ||
| if runtime.GOOS == "windows" || os.Getenv("CI") == "true" { | ||
| t.Skip("Skipping test on Windows or CI") | ||
| } |
There was a problem hiding this comment.
Consider moving the test to a new file with linux go build on it (as this test might run on macos)
also, why do we skip CI?
| // Callback 1: (client, server) | ||
| if f(types.NewConnectionKey(clientIP, serverIP, clientPort, serverPort)) { |
There was a problem hiding this comment.
I suggest adding a comment here explaining the reason for the change, to help prevent future breaking changes
What does this PR do?
TestKubernetesLocalNATScenariothat reproduces and validates the fix for SUSM-94Motivation
In SUSM-94, a customer reported an issue where in a K8s environment, when a client communicates with a K8s service that NATs traffic to a server on the same node, USM suffered from accuracy loss. We reproduced the issue locally on a K8s cluster and noticed the following issue:
NPM captures 2 connections:
Connection 1 (Pre-NAT, client's perspective):
Connection 2 (Post-NAT, server's perspective):
USM captures 2 connection keys:
Before this change, the same USM aggregation (client -> server) would match both NPM connections because:
Because we have a PID filtering mechanism that prevents matching the same USM connection to different PIDs, we would drop the second matching entirely, resulting in accuracy loss.
The change introduced in this PR fixes this issue by prioritizing direct connection matches (Source and Dest fields) before NAT translations. This means:
This ensures both connections retain their appropriate USM data.
Describe how you validated your changes
TestKubernetesLocalNATScenariothat exactly reproduces the customer scenario from SUSM-94:Possible Drawbacks / Trade-offs
Additional Notes