SUSM-94: Handle Same-Node K8s Service by DanielLavie · Pull Request #35644 · DataDog/datadog-agent

DanielLavie · 2025-03-31T10:59:14Z

What does this PR do?

Reorders connection key callback sequence to prioritize direct connection matches before NAT translations
Adds test case TestKubernetesLocalNATScenario that reproduces and validates the fix for SUSM-94
Adds detailed debug tracing capabilities for connection handling when trace logging is enabled, so we can use it for future customer investigations

Motivation

In SUSM-94, a customer reported an issue where in a K8s environment, when a client communicates with a K8s service that NATs traffic to a server on the same node, USM suffered from accuracy loss. We reproduced the issue locally on a K8s cluster and noticed the following issue:

NPM captures 2 connections:

Connection 1 (Pre-NAT, client's perspective):
- Source: 172.29.161.37:53792 (Client)
- Dest: 10.100.103.122:7778 (K8s Service)
- Direction: OUTGOING
- PID: 1784258 (client process)
- IPTranslation: Contains post-NAT info mapping to Client->Server
Connection 2 (Post-NAT, server's perspective):
- Source: 172.29.191.94:7777 (Server)
- Dest: 172.29.161.37:53792 (Client)
- Direction: INCOMING
- PID: 334392 (server process)
- No IPTranslation

USM captures 2 connection keys:

Key 1: Client->K8s Service (pre-NAT)
Key 2: Client->Server (post-NAT, normalized)

Before this change, the same USM aggregation (client -> server) would match both NPM connections because:

For Connection 1, it would check IPTranslation first and match with Key 2
For Connection 2, it would also match with Key 2

Because we have a PID filtering mechanism that prevents matching the same USM connection to different PIDs, we would drop the second matching entirely, resulting in accuracy loss.

The change introduced in this PR fixes this issue by prioritizing direct connection matches (Source and Dest fields) before NAT translations. This means:

Connection 1 now correctly matches with Key 1 (Client->K8s Service)
Connection 2 correctly matches with Key 2 (Client->Server)

This ensures both connections retain their appropriate USM data.

Describe how you validated your changes

Added TestKubernetesLocalNATScenario that exactly reproduces the customer scenario from SUSM-94:
- Client (172.29.161.37:53792) -> K8s Service (10.100.103.122:7778) -> Server (172.29.191.94:7777)
- The test validates both pre-NAT and post-NAT connection handling
Manually tested in a Kubernetes environment that matches the customer's setup
All existing tests continue to pass
Tested the change in the entei and parent12 staging clusters, observed no degradation in the orphan connections and aggregations metrics.
Tested Haproxy scenario using the load test (results)

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr · 2025-03-31T13:28:55Z

Uncompressed package size comparison

Comparison with ancestor bd316e1e79549862f1870a636352534f2f858512

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	0.00MB	✅	802.98MB	802.98MB	0.50MB
datadog-agent-x86_64-rpm	0.00MB	✅	812.86MB	812.85MB	0.50MB
datadog-agent-x86_64-suse	0.00MB	✅	812.86MB	812.85MB	0.50MB
datadog-agent-arm64-deb	0.00MB	✅	792.73MB	792.73MB	0.50MB
datadog-agent-aarch64-rpm	0.00MB	✅	802.59MB	802.58MB	0.50MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	39.92MB	39.92MB	0.50MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	40.00MB	40.00MB	0.50MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	40.00MB	40.00MB	0.50MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	38.43MB	38.43MB	0.50MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	446.87MB	446.87MB	0.50MB
datadog-iot-agent-amd64-deb	0.00MB	✅	60.81MB	60.81MB	0.50MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	60.88MB	60.88MB	0.50MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	60.88MB	60.88MB	0.50MB
datadog-iot-agent-arm64-deb	0.00MB	✅	58.12MB	58.12MB	0.50MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	58.20MB	58.20MB	0.50MB

Decision

✅ Passed

agent-platform-auto-pr · 2025-03-31T13:29:55Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

dda inv aws.create-vm --pipeline-id=62203283 --os-family=ubuntu

Note: This applies to commit ec797ea

agent-platform-auto-pr · 2025-03-31T13:33:20Z

Static quality checks ✅

Please find below the results from static quality gates

Successful checks

Info

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
✅	static_quality_gate_agent_deb_amd64	771.86 MiB	780.74 MiB	189.38 MiB	190.85 MiB
✅	static_quality_gate_agent_deb_amd64_fips	769.78 MiB	778.65 MiB	188.63 MiB	190.66 MiB
✅	static_quality_gate_agent_heroku_amd64	428.57 MiB	428.97 MiB	112.78 MiB	113.11 MiB
✅	static_quality_gate_agent_msi	976.91 MiB	977.0 MiB	149.4 MiB	149.85 MiB
✅	static_quality_gate_agent_rpm_amd64	771.79 MiB	780.59 MiB	191.0 MiB	193.37 MiB
✅	static_quality_gate_agent_rpm_amd64_fips	769.74 MiB	778.69 MiB	190.61 MiB	192.47 MiB
✅	static_quality_gate_agent_rpm_arm64	763.02 MiB	772.07 MiB	173.63 MiB	174.47 MiB
✅	static_quality_gate_agent_rpm_arm64_fips	761.14 MiB	770.27 MiB	172.35 MiB	173.63 MiB
✅	static_quality_gate_agent_suse_amd64	771.86 MiB	780.7 MiB	191.0 MiB	193.37 MiB
✅	static_quality_gate_agent_suse_amd64_fips	769.8 MiB	778.59 MiB	190.61 MiB	192.47 MiB
✅	static_quality_gate_agent_suse_arm64	762.94 MiB	772.02 MiB	173.63 MiB	174.47 MiB
✅	static_quality_gate_agent_suse_arm64_fips	761.19 MiB	770.2 MiB	172.35 MiB	173.63 MiB
✅	static_quality_gate_docker_agent_amd64	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_arm64	870.85 MiB	880.0 MiB	275.82 MiB	277.86 MiB
✅	static_quality_gate_docker_agent_jmx_amd64	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_jmx_arm64	870.85 MiB	880.0 MiB	275.82 MiB	277.86 MiB
✅	static_quality_gate_docker_agent_windows1809	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_core	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_core_jmx	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_jmx	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_core	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_core_jmx	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_jmx	856.39 MiB	865.3 MiB	289.31 MiB	291.35 MiB
✅	static_quality_gate_docker_cluster_agent_amd64	263.13 MiB	263.29 MiB	105.67 MiB	106.03 MiB
✅	static_quality_gate_docker_cluster_agent_arm64	279.07 MiB	279.25 MiB	100.53 MiB	100.87 MiB
✅	static_quality_gate_docker_cws_instrumentation_amd64	6.65 MiB	7.12 MiB	2.82 MiB	3.29 MiB
✅	static_quality_gate_docker_cws_instrumentation_arm64	6.44 MiB	6.92 MiB	2.6 MiB	3.07 MiB
✅	static_quality_gate_docker_dogstatsd_amd64	46.09 MiB	46.37 MiB	17.38 MiB	17.78 MiB
✅	static_quality_gate_docker_dogstatsd_arm64	44.71 MiB	44.99 MiB	16.25 MiB	16.65 MiB
✅	static_quality_gate_dogstatsd_deb_amd64	37.95 MiB	38.23 MiB	9.84 MiB	10.26 MiB
✅	static_quality_gate_dogstatsd_deb_arm64	36.53 MiB	36.82 MiB	8.54 MiB	8.96 MiB
✅	static_quality_gate_dogstatsd_rpm_amd64	37.95 MiB	38.23 MiB	9.85 MiB	10.27 MiB
✅	static_quality_gate_dogstatsd_suse_amd64	37.95 MiB	38.23 MiB	9.85 MiB	10.27 MiB
✅	static_quality_gate_iot_agent_deb_amd64	57.75 MiB	58.19 MiB	14.56 MiB	15.02 MiB
✅	static_quality_gate_iot_agent_deb_arm64	55.2 MiB	55.63 MiB	12.59 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_deb_armhf	53.89 MiB	54.32 MiB	12.58 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_rpm_amd64	57.75 MiB	58.19 MiB	14.58 MiB	15.04 MiB
✅	static_quality_gate_iot_agent_rpm_arm64	55.2 MiB	55.63 MiB	12.61 MiB	13.07 MiB
✅	static_quality_gate_iot_agent_suse_amd64	57.75 MiB	58.19 MiB	14.58 MiB	15.04 MiB

cit-pr-commenter · 2025-03-31T14:35:35Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: e0b80e60-eef8-4845-9231-5d371a328741

Baseline: bd316e1
Comparison: ec797ea
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_idle	memory utilization	+0.58	[+0.53, +0.64]	1	Logs bounds checks dashboard
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.38	[-0.38, +1.14]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.31	[-2.51, +3.13]	1	Logs bounds checks dashboard
➖	otlp_ingest_logs	memory utilization	+0.20	[+0.04, +0.36]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.07	[-0.40, +0.54]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.01	[-0.08, +0.09]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.29, +0.30]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.00	[-0.84, +0.85]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.02]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.01	[-0.71, +0.70]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	-0.01	[-0.64, +0.63]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	-0.02	[-0.83, +0.79]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.02	[-0.83, +0.79]	1	Logs
➖	file_tree	memory utilization	-0.04	[-0.18, +0.11]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.07	[-0.86, +0.72]	1	Logs
➖	otlp_ingest_traces	memory utilization	-0.12	[-0.50, +0.27]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.24	[-0.39, -0.09]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.38	[-0.45, -0.31]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.60	[-0.64, -0.55]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-1.09	[-1.93, -0.24]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.

…declaration

agent-platform-auto-pr · 2025-04-01T14:21:55Z

Static quality checks ❌

Please find below the results from static quality gates

Error

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
❌	static_quality_gate_agent_rpm_amd64	780.63 MiB	780.59 MiB	192.92 MiB	193.37 MiB
❌	static_quality_gate_agent_suse_amd64_fips	778.62 MiB	778.59 MiB	192.16 MiB	192.47 MiB
❌	static_quality_gate_agent_suse_arm64	772.06 MiB	772.02 MiB	173.98 MiB	174.47 MiB
❌	static_quality_gate_agent_suse_arm64_fips	770.23 MiB	770.2 MiB	173.17 MiB	173.63 MiB

Gate failure full details

Quality gate	Error type	Error message
static_quality_gate_agent_rpm_amd64	AssertionError	Package size on disk (uncompressed package size) 818550920 is higher than the maximum allowed 818507939 by the gate !
static_quality_gate_agent_suse_amd64_fips	AssertionError	Package size on disk (uncompressed package size) 816444089 is higher than the maximum allowed 816410787 by the gate !
static_quality_gate_agent_suse_arm64	AssertionError	Package size on disk (uncompressed package size) 809559348 is higher than the maximum allowed 809521643 by the gate !
static_quality_gate_agent_suse_arm64_fips	AssertionError	Package size on disk (uncompressed package size) 807645585 is higher than the maximum allowed 807613235 by the gate !

Static quality gates prevent the PR to merge! You can check the static quality gates confluence page for guidance. We also have a toolbox page available to list tools useful to debug the size increase.

Successful checks

Info

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
✅	static_quality_gate_agent_deb_amd64	780.66 MiB	780.74 MiB	190.34 MiB	190.85 MiB
✅	static_quality_gate_agent_deb_amd64_fips	778.63 MiB	778.65 MiB	190.08 MiB	190.66 MiB
✅	static_quality_gate_agent_heroku_amd64	428.56 MiB	428.97 MiB	112.78 MiB	113.11 MiB
✅	static_quality_gate_agent_msi	976.69 MiB	977.0 MiB	149.38 MiB	149.85 MiB
✅	static_quality_gate_agent_rpm_amd64_fips	778.62 MiB	778.69 MiB	192.16 MiB	192.47 MiB
✅	static_quality_gate_agent_rpm_arm64	771.99 MiB	772.07 MiB	173.98 MiB	174.47 MiB
✅	static_quality_gate_agent_rpm_arm64_fips	770.19 MiB	770.27 MiB	173.17 MiB	173.63 MiB
✅	static_quality_gate_agent_suse_amd64	780.63 MiB	780.7 MiB	192.92 MiB	193.37 MiB
✅	static_quality_gate_docker_agent_amd64	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_arm64	879.88 MiB	880.0 MiB	277.42 MiB	277.86 MiB
✅	static_quality_gate_docker_agent_jmx_amd64	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_jmx_arm64	879.88 MiB	880.0 MiB	277.42 MiB	277.86 MiB
✅	static_quality_gate_docker_agent_windows1809	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_core	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_core_jmx	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows1809_jmx	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_core	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_core_jmx	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_agent_windows2022_jmx	865.19 MiB	865.3 MiB	290.92 MiB	291.35 MiB
✅	static_quality_gate_docker_cluster_agent_amd64	263.12 MiB	263.29 MiB	105.69 MiB	106.03 MiB
✅	static_quality_gate_docker_cluster_agent_arm64	279.06 MiB	279.25 MiB	100.52 MiB	100.87 MiB
✅	static_quality_gate_docker_cws_instrumentation_amd64	6.65 MiB	7.12 MiB	2.82 MiB	3.29 MiB
✅	static_quality_gate_docker_cws_instrumentation_arm64	6.44 MiB	6.92 MiB	2.6 MiB	3.07 MiB
✅	static_quality_gate_docker_dogstatsd_amd64	46.09 MiB	46.37 MiB	17.38 MiB	17.78 MiB
✅	static_quality_gate_docker_dogstatsd_arm64	44.7 MiB	44.99 MiB	16.24 MiB	16.65 MiB
✅	static_quality_gate_dogstatsd_deb_amd64	37.95 MiB	38.23 MiB	9.84 MiB	10.26 MiB
✅	static_quality_gate_dogstatsd_deb_arm64	36.53 MiB	36.82 MiB	8.54 MiB	8.96 MiB
✅	static_quality_gate_dogstatsd_rpm_amd64	37.95 MiB	38.23 MiB	9.85 MiB	10.27 MiB
✅	static_quality_gate_dogstatsd_suse_amd64	37.95 MiB	38.23 MiB	9.85 MiB	10.27 MiB
✅	static_quality_gate_iot_agent_deb_amd64	57.71 MiB	58.19 MiB	14.55 MiB	15.02 MiB
✅	static_quality_gate_iot_agent_deb_arm64	55.16 MiB	55.63 MiB	12.57 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_deb_armhf	53.85 MiB	54.32 MiB	12.57 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_rpm_amd64	57.71 MiB	58.19 MiB	14.57 MiB	15.04 MiB
✅	static_quality_gate_iot_agent_rpm_arm64	55.16 MiB	55.63 MiB	12.59 MiB	13.07 MiB
✅	static_quality_gate_iot_agent_suse_amd64	57.71 MiB	58.19 MiB	14.57 MiB	15.04 MiB

agent-platform-auto-pr · 2025-04-10T12:06:06Z

Static quality checks

✅ Please find below the results from static quality gates

Successful checks

Info

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
✅	static_quality_gate_agent_deb_amd64	776.91 MiB	778.06 MiB	190.44 MiB	191.06 MiB
✅	static_quality_gate_agent_deb_amd64_fips	774.87 MiB	776.09 MiB	189.85 MiB	190.72 MiB
✅	static_quality_gate_agent_heroku_amd64	433.71 MiB	434.99 MiB	113.4 MiB	114.34 MiB
✅	static_quality_gate_agent_msi	977.08 MiB	978.45 MiB	150.72 MiB	151.65 MiB
✅	static_quality_gate_agent_rpm_amd64	776.89 MiB	778.06 MiB	192.54 MiB	193.42 MiB
✅	static_quality_gate_agent_rpm_amd64_fips	774.86 MiB	776.06 MiB	191.97 MiB	192.61 MiB
✅	static_quality_gate_agent_rpm_arm64	767.1 MiB	768.33 MiB	173.84 MiB	174.71 MiB
✅	static_quality_gate_agent_rpm_arm64_fips	765.28 MiB	766.55 MiB	173.04 MiB	173.92 MiB
✅	static_quality_gate_agent_suse_amd64	776.89 MiB	778.08 MiB	192.54 MiB	193.42 MiB
✅	static_quality_gate_agent_suse_amd64_fips	774.86 MiB	776.11 MiB	191.97 MiB	192.78 MiB
✅	static_quality_gate_agent_suse_arm64	767.1 MiB	768.31 MiB	173.84 MiB	174.71 MiB
✅	static_quality_gate_agent_suse_arm64_fips	765.28 MiB	766.5 MiB	173.04 MiB	173.92 MiB
✅	static_quality_gate_docker_agent_amd64	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_arm64	875.48 MiB	875.62 MiB	277.05 MiB	277.94 MiB
✅	static_quality_gate_docker_agent_jmx_amd64	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_jmx_arm64	875.48 MiB	876.23 MiB	277.05 MiB	277.94 MiB
✅	static_quality_gate_docker_agent_windows1809	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows1809_core	861.99 MiB	862.2 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows1809_core_jmx	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows1809_jmx	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows2022	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows2022_core	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows2022_core_jmx	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_agent_windows2022_jmx	861.99 MiB	862.0 MiB	290.58 MiB	291.48 MiB
✅	static_quality_gate_docker_cluster_agent_amd64	262.46 MiB	263.4 MiB	103.12 MiB	104.07 MiB
✅	static_quality_gate_docker_cluster_agent_arm64	278.43 MiB	279.38 MiB	97.99 MiB	98.95 MiB
✅	static_quality_gate_docker_cws_instrumentation_amd64	6.65 MiB	7.12 MiB	2.82 MiB	3.29 MiB
✅	static_quality_gate_docker_cws_instrumentation_arm64	6.44 MiB	6.92 MiB	2.6 MiB	3.07 MiB
✅	static_quality_gate_docker_dogstatsd_amd64	46.29 MiB	46.39 MiB	17.45 MiB	17.78 MiB
✅	static_quality_gate_docker_dogstatsd_arm64	44.9 MiB	45.05 MiB	16.3 MiB	16.65 MiB
✅	static_quality_gate_dogstatsd_deb_amd64	38.15 MiB	38.4 MiB	9.88 MiB	10.26 MiB
✅	static_quality_gate_dogstatsd_deb_arm64	36.72 MiB	36.98 MiB	8.57 MiB	8.96 MiB
✅	static_quality_gate_dogstatsd_rpm_amd64	38.15 MiB	38.4 MiB	9.88 MiB	10.27 MiB
✅	static_quality_gate_dogstatsd_suse_amd64	38.15 MiB	38.4 MiB	9.88 MiB	10.27 MiB
✅	static_quality_gate_iot_agent_deb_amd64	58.07 MiB	58.51 MiB	14.61 MiB	15.02 MiB
✅	static_quality_gate_iot_agent_deb_arm64	55.51 MiB	55.94 MiB	12.64 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_deb_armhf	54.16 MiB	54.32 MiB	12.62 MiB	13.05 MiB
✅	static_quality_gate_iot_agent_rpm_amd64	58.07 MiB	58.51 MiB	14.63 MiB	15.04 MiB
✅	static_quality_gate_iot_agent_rpm_arm64	55.51 MiB	55.94 MiB	12.65 MiB	13.07 MiB
✅	static_quality_gate_iot_agent_suse_amd64	58.07 MiB	58.51 MiB	14.63 MiB	15.04 MiB

…Scenario

…ng' into daniel.lavie/fix-unclaimed-tracing

hmahmood · 2025-04-15T14:43:58Z

+
+// traceLogConnections logs detailed connection information to help investigate NPM/USM
+// customer issues by providing visibility into network connections and their metadata.
+func traceLogConnections(id string, cs *network.Connections) {


I think this can result in an enormous number of logs. I have also not really had a customer send us trace level logs; usually we ask for debug level logs.

There are already endpoints like /connections and others for USM that we can use to get similar info. We can add them to the agent flare if we need them from customers; this seems like a better alternative to me than logs.

"I think this can result in an enormous number of logs" – This should only be printed when trace logs are enabled, meaning we’re actively investigating a customer issue and need detailed context.
I'm also seeing this, which suggests we may already be emitting a high volume of trace logs.

"I have also not really had a customer send us trace level logs; usually we ask for debug level logs" – Why is that? It makes sense to begin with debug logs, but in cases like this one where more granularity is required, escalating to trace level should be an option.

"There are already endpoints like /connections and others for USM that we can use to get similar info" – That’s valid, but logs offer full historical context, while /connections only reflects the most recent 30 seconds or so. Logs also simplify correlation: they allow us to directly compare the NPM connection tuple with the USM connection key and quickly identify mismatches, without needing to parse aggregated HTTP data.

I am not sure how useful this will be even if we somehow manage to get these logs from the customer. This could be a pretty large number of logs and parsing through them would be challenging to say the least. The endpoints will give you JSON, which is parse-able and query-able with tools like jq. Another problem would be that this obscures other trace logs since the context from this one log is going to be much larger than other trace logs.

Content like this is more appropriate for the flare. We were already considering adding this, but de-prioritized it; maybe this can be incentive enough to resurrect it.

The issue with the /connections endpoint is that it doesn’t include all the USM HTTP connection key details—it relies on writeConnections, which is where the underlying problem was hidden. As a result, we wouldn’t observe the issue when using this endpoint.

We could try using the HTTP debug endpoint instead, but it only returns the USM portion of the HTTP data, without NPM context, and it consumes all connections in the process - making it unusable for our needs.

So at this point, I’m not exactly sure how the endpoints could help us retrieve the necessary information.

A combination of the two doesn't give the same results? You could also modify or add another endpoint. Another option could be to do more limited trace logs at the point where you think the problem can happen.

@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?

@guyarb a snapshot should suffice

amitslavin · 2025-04-15T14:25:37Z

+	}
+
+	// Log all connections for debugging
+	traceLog("Found %d total connections", len(cs.Conns))


what is the difference between this log and line 394?

amitslavin · 2025-04-15T14:26:37Z

+		traceLog("Connection %d:", i)
+		traceLog("  Source: %s:%d", conn.Source.String(), conn.SPort)
+		traceLog("  Destination: %s:%d", conn.Dest.String(), conn.DPort)
+		traceLog("  Protocol Stack: %+v", conn.ProtocolStack)


Any reason why not logging all the data in a same traceLog function?

amitslavin · 2025-04-15T14:26:47Z

+		} else {
+			traceLog("  No IP Translation")
+		}
+		traceLog("  Direction: %s", conn.Direction)


same as above

amitslavin · 2025-04-15T14:29:16Z

+	servicePort := uint16(7778)
+	serverPort := uint16(7777)
+
+	// Create both connections exactly as seen from SUSM-94 reproduction environment


Not sure the comment with SUSM-94 is relevant let's explain the purpose without referencing the ticket number

amitslavin · 2025-04-15T14:33:07Z

+	httpStats := http.NewRequestStats()
+	httpStats.AddRequest(200, 15.0, 0, nil)
+	httpStats.AddRequest(200, 15.0, 0, nil)
+	httpStats.AddRequest(200, 15.0, 0, nil)
+	httpStats.AddRequest(200, 15.0, 0, nil)


maybe create a loop for that with a const

amitslavin · 2025-04-15T14:36:21Z

+	postNATKey := http.NewKey(
+		clientIP,   // Client is still the source in the HTTP key (172.29.161.37)
+		serverIP,   // Server is the destination (172.29.191.94)
+		clientPort, // Client port (53792)
+		serverPort, // Server port (7777)
+		[]byte("/delay/5"),
+		true,
+		http.MethodGet,
+	)


If the only difference between the preNATKey and the postNATKey is the serverIP, let's structure it more cleanly by creating a http.newKey and updating just that field.

Plus IMO the comments are not really relevant

amitslavin · 2025-04-15T14:39:27Z

+		httpEncoder := newHTTPEncoder(payload.HTTP)
+
+		// Test post-NAT connection (server → client) - should have HTTP data
+		aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1])


shouldn't you validate that payload.Conns > 0?

amitslavin · 2025-04-15T14:40:04Z

+		// Test post-NAT connection (server → client) - should have HTTP data
+		aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1])
+		assert.NotNil(aggregations)
+		assert.Equal("/delay/5", aggregations.EndpointAggregations[0].Path)


shouldn't you validate that aggregations.EndpointAggregations > 0?

amitslavin · 2025-04-15T14:57:22Z

@DanielLavie, I understand the problem, but I'm not sure how the proposed change solves it. The description isn't fully clear to me on how the fix addresses the issue. I think that the description needs to be updated.

guyarb · 2025-04-16T06:11:12Z

I'm surprised (maybe disappointed) that not a single existing test was broken due to your change
How can we guarantee any other change/regression will be detected?
Can you try and "tweak" the code to ensure your new test is covering the cases?

The TestKubernetesLocalNATScenario initially failed. Only when I made the modification in this PR, it passed

guyarb · 2025-04-16T06:18:24Z

+
+// traceLogConnections logs detailed connection information to help investigate NPM/USM
+// customer issues by providing visibility into network connections and their metadata.
+func traceLogConnections(id string, cs *network.Connections) {


@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?

guyarb · 2025-04-16T06:19:40Z

+func TestKubernetesLocalNATScenario(t *testing.T) {
+	if runtime.GOOS == "windows" || os.Getenv("CI") == "true" {
+		t.Skip("Skipping test on Windows or CI")
+	}


Consider moving the test to a new file with linux go build on it (as this test might run on macos)
also, why do we skip CI?

amitslavin · 2025-04-16T06:30:13Z

+	// Callback 1: (client, server)
+	if f(types.NewConnectionKey(clientIP, serverIP, clientPort, serverPort)) {


I suggest adding a comment here explaining the reason for the change, to help prevent future breaking changes

DanielLavie added 8 commits March 24, 2025 11:43

Updated ConnectionKey's String method to use a value receiver

51287e0

Added traces

0f47d8b

WIP: TestHTTPThroughDNAT

07af1b1

Added TestHTTPDirect for comparison

5680f9e

Fixed Tests

fa9d297

Added TestKubernetesNATScenario which replicate the issue synthetically

b6d0386

Added NETTRACER_DEBUG logs

44333a7

Changed Callback 1 and Callback 2 order in WithKey

f017b24

github-actions Bot added component/system-probe medium review PR review might take time labels Mar 31, 2025

Fixed issue

ac5959e

DanielLavie added team/universal-service-monitoring The USM team qa/done QA done before merge and regressions are covered by tests labels Mar 31, 2025

DanielLavie added the changelog/no-changelog No changelog entry needed label Mar 31, 2025

DanielLavie added 2 commits March 31, 2025 16:36

Trying to fix more issues

816f853

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

793d3fe

DanielLavie added 4 commits March 31, 2025 17:38

fix: simplify HTTP connections loop by removing unnecessary variable …

9539e67

…declaration

remove unused HTTP test functions and clean up imports

9238682

test: skip Kubernetes NAT scenario test on Windows

3de4b1b

test: skip Kubernetes NAT scenario test on Windows and CI

5601af6

DanielLavie added 4 commits April 1, 2025 17:54

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

62c5329

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

31fec8d

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

0456188

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

8ff9e67

DanielLavie added 4 commits April 15, 2025 13:45

Refactored TestKubernetesLocalNATScenario into TestKubernetesLocalNAT…

25a1b3d

…Scenario

Merge remote-tracking branch 'origin/daniel.lavie/fix-unclaimed-traci…

d136db5

…ng' into daniel.lavie/fix-unclaimed-tracing

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

aea2c23

Removed shouldTraceLog and related code

f29e097

DanielLavie changed the title ~~Daniel.lavie/fix unclaimed tracing~~ SUSM-94: Handle K8s Service (NAT) Traffic on Same Node Correctly Apr 15, 2025

DanielLavie changed the title ~~SUSM-94: Handle K8s Service (NAT) Traffic on Same Node Correctly~~ SUSM-94: Handle Same-Node K8s Service Apr 15, 2025

Merge branch 'main' into daniel.lavie/fix-unclaimed-tracing

0a75328

DanielLavie marked this pull request as ready for review April 15, 2025 12:12

DanielLavie requested review from a team as code owners April 15, 2025 12:12

DanielLavie requested a review from mbakht April 15, 2025 12:12

DanielLavie added 3 commits April 15, 2025 15:26

Introduced traceLogConnections

81f5cc4

Removed unecessary comment

e37f9b7

refactored traceLogConnections

ec797ea

hmahmood reviewed Apr 15, 2025

View reviewed changes

amitslavin requested changes Apr 15, 2025

View reviewed changes

github-actions Bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Apr 15, 2025

guyarb reviewed Apr 16, 2025

View reviewed changes

amitslavin requested changes Apr 16, 2025

View reviewed changes

mbakht removed their request for review May 8, 2025 20:39

dd-devtools-worker Bot closed this Oct 15, 2025

dd-devtools-worker Bot deleted the daniel.lavie/fix-unclaimed-tracing branch October 15, 2025 03:01

		// Callback 1: (client, server)
		if f(types.NewConnectionKey(clientIP, serverIP, clientPort, serverPort)) {

Conversation

DanielLavie commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

Uh oh!

agent-platform-auto-pr Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uncompressed package size comparison

Decision

Uh oh!

agent-platform-auto-pr Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test changes on VM

Uh oh!

agent-platform-auto-pr Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks ✅

Info

Uh oh!

cit-pr-commenter Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

agent-platform-auto-pr Bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks ❌

Error

Info

Uh oh!

agent-platform-auto-pr Bot commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanielLavie Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitslavin Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitslavin commented Apr 15, 2025

Uh oh!

DanielLavie commented Mar 31, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Mar 31, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Mar 31, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Mar 31, 2025 •

edited

Loading

cit-pr-commenter Bot commented Mar 31, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Apr 1, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Apr 10, 2025 •

edited

Loading

DanielLavie Apr 15, 2025 •

edited

Loading

amitslavin Apr 15, 2025 •

edited

Loading