Skip to content

SUSM-94: Handle Same-Node K8s Service#35644

Closed
DanielLavie wants to merge 27 commits intomainfrom
daniel.lavie/fix-unclaimed-tracing
Closed

SUSM-94: Handle Same-Node K8s Service#35644
DanielLavie wants to merge 27 commits intomainfrom
daniel.lavie/fix-unclaimed-tracing

Conversation

@DanielLavie
Copy link
Copy Markdown
Contributor

@DanielLavie DanielLavie commented Mar 31, 2025

What does this PR do?

  • Reorders connection key callback sequence to prioritize direct connection matches before NAT translations
  • Adds test case TestKubernetesLocalNATScenario that reproduces and validates the fix for SUSM-94
  • Adds detailed debug tracing capabilities for connection handling when trace logging is enabled, so we can use it for future customer investigations

Motivation

In SUSM-94, a customer reported an issue where in a K8s environment, when a client communicates with a K8s service that NATs traffic to a server on the same node, USM suffered from accuracy loss. We reproduced the issue locally on a K8s cluster and noticed the following issue:

image

NPM captures 2 connections:

  1. Connection 1 (Pre-NAT, client's perspective):

    • Source: 172.29.161.37:53792 (Client)
    • Dest: 10.100.103.122:7778 (K8s Service)
    • Direction: OUTGOING
    • PID: 1784258 (client process)
    • IPTranslation: Contains post-NAT info mapping to Client->Server
  2. Connection 2 (Post-NAT, server's perspective):

    • Source: 172.29.191.94:7777 (Server)
    • Dest: 172.29.161.37:53792 (Client)
    • Direction: INCOMING
    • PID: 334392 (server process)
    • No IPTranslation

USM captures 2 connection keys:

  • Key 1: Client->K8s Service (pre-NAT)
  • Key 2: Client->Server (post-NAT, normalized)

Before this change, the same USM aggregation (client -> server) would match both NPM connections because:

  • For Connection 1, it would check IPTranslation first and match with Key 2
  • For Connection 2, it would also match with Key 2

Because we have a PID filtering mechanism that prevents matching the same USM connection to different PIDs, we would drop the second matching entirely, resulting in accuracy loss.

The change introduced in this PR fixes this issue by prioritizing direct connection matches (Source and Dest fields) before NAT translations. This means:

  • Connection 1 now correctly matches with Key 1 (Client->K8s Service)
  • Connection 2 correctly matches with Key 2 (Client->Server)

This ensures both connections retain their appropriate USM data.

Describe how you validated your changes

  • Added TestKubernetesLocalNATScenario that exactly reproduces the customer scenario from SUSM-94:
    • Client (172.29.161.37:53792) -> K8s Service (10.100.103.122:7778) -> Server (172.29.191.94:7777)
    • The test validates both pre-NAT and post-NAT connection handling
  • Manually tested in a Kubernetes environment that matches the customer's setup
  • All existing tests continue to pass
  • Tested the change in the entei and parent12 staging clusters, observed no degradation in the orphan connections and aggregations metrics.
  • Tested Haproxy scenario using the load test (results)

Possible Drawbacks / Trade-offs

Additional Notes

@github-actions github-actions Bot added component/system-probe medium review PR review might take time labels Mar 31, 2025
@DanielLavie DanielLavie added team/universal-service-monitoring The USM team qa/done QA done before merge and regressions are covered by tests labels Mar 31, 2025
@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Mar 31, 2025

Uncompressed package size comparison

Comparison with ancestor bd316e1e79549862f1870a636352534f2f858512

Diff per package
package diff status size ancestor threshold
datadog-agent-amd64-deb 0.00MB 802.98MB 802.98MB 0.50MB
datadog-agent-x86_64-rpm 0.00MB 812.86MB 812.85MB 0.50MB
datadog-agent-x86_64-suse 0.00MB 812.86MB 812.85MB 0.50MB
datadog-agent-arm64-deb 0.00MB 792.73MB 792.73MB 0.50MB
datadog-agent-aarch64-rpm 0.00MB 802.59MB 802.58MB 0.50MB
datadog-dogstatsd-amd64-deb 0.00MB 39.92MB 39.92MB 0.50MB
datadog-dogstatsd-x86_64-rpm 0.00MB 40.00MB 40.00MB 0.50MB
datadog-dogstatsd-x86_64-suse 0.00MB 40.00MB 40.00MB 0.50MB
datadog-dogstatsd-arm64-deb 0.00MB 38.43MB 38.43MB 0.50MB
datadog-heroku-agent-amd64-deb 0.00MB 446.87MB 446.87MB 0.50MB
datadog-iot-agent-amd64-deb 0.00MB 60.81MB 60.81MB 0.50MB
datadog-iot-agent-x86_64-rpm 0.00MB 60.88MB 60.88MB 0.50MB
datadog-iot-agent-x86_64-suse 0.00MB 60.88MB 60.88MB 0.50MB
datadog-iot-agent-arm64-deb 0.00MB 58.12MB 58.12MB 0.50MB
datadog-iot-agent-aarch64-rpm 0.00MB 58.20MB 58.20MB 0.50MB

Decision

✅ Passed

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Mar 31, 2025

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

dda inv aws.create-vm --pipeline-id=62203283 --os-family=ubuntu

Note: This applies to commit ec797ea

@DanielLavie DanielLavie added the changelog/no-changelog No changelog entry needed label Mar 31, 2025
@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Mar 31, 2025

Static quality checks ✅

Please find below the results from static quality gates

Successful checks

Info

Result Quality gate On disk size On disk size limit On wire size On wire size limit
static_quality_gate_agent_deb_amd64 771.86 MiB 780.74 MiB 189.38 MiB 190.85 MiB
static_quality_gate_agent_deb_amd64_fips 769.78 MiB 778.65 MiB 188.63 MiB 190.66 MiB
static_quality_gate_agent_heroku_amd64 428.57 MiB 428.97 MiB 112.78 MiB 113.11 MiB
static_quality_gate_agent_msi 976.91 MiB 977.0 MiB 149.4 MiB 149.85 MiB
static_quality_gate_agent_rpm_amd64 771.79 MiB 780.59 MiB 191.0 MiB 193.37 MiB
static_quality_gate_agent_rpm_amd64_fips 769.74 MiB 778.69 MiB 190.61 MiB 192.47 MiB
static_quality_gate_agent_rpm_arm64 763.02 MiB 772.07 MiB 173.63 MiB 174.47 MiB
static_quality_gate_agent_rpm_arm64_fips 761.14 MiB 770.27 MiB 172.35 MiB 173.63 MiB
static_quality_gate_agent_suse_amd64 771.86 MiB 780.7 MiB 191.0 MiB 193.37 MiB
static_quality_gate_agent_suse_amd64_fips 769.8 MiB 778.59 MiB 190.61 MiB 192.47 MiB
static_quality_gate_agent_suse_arm64 762.94 MiB 772.02 MiB 173.63 MiB 174.47 MiB
static_quality_gate_agent_suse_arm64_fips 761.19 MiB 770.2 MiB 172.35 MiB 173.63 MiB
static_quality_gate_docker_agent_amd64 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_arm64 870.85 MiB 880.0 MiB 275.82 MiB 277.86 MiB
static_quality_gate_docker_agent_jmx_amd64 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_jmx_arm64 870.85 MiB 880.0 MiB 275.82 MiB 277.86 MiB
static_quality_gate_docker_agent_windows1809 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_core 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_core_jmx 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_jmx 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_core 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_core_jmx 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_jmx 856.39 MiB 865.3 MiB 289.31 MiB 291.35 MiB
static_quality_gate_docker_cluster_agent_amd64 263.13 MiB 263.29 MiB 105.67 MiB 106.03 MiB
static_quality_gate_docker_cluster_agent_arm64 279.07 MiB 279.25 MiB 100.53 MiB 100.87 MiB
static_quality_gate_docker_cws_instrumentation_amd64 6.65 MiB 7.12 MiB 2.82 MiB 3.29 MiB
static_quality_gate_docker_cws_instrumentation_arm64 6.44 MiB 6.92 MiB 2.6 MiB 3.07 MiB
static_quality_gate_docker_dogstatsd_amd64 46.09 MiB 46.37 MiB 17.38 MiB 17.78 MiB
static_quality_gate_docker_dogstatsd_arm64 44.71 MiB 44.99 MiB 16.25 MiB 16.65 MiB
static_quality_gate_dogstatsd_deb_amd64 37.95 MiB 38.23 MiB 9.84 MiB 10.26 MiB
static_quality_gate_dogstatsd_deb_arm64 36.53 MiB 36.82 MiB 8.54 MiB 8.96 MiB
static_quality_gate_dogstatsd_rpm_amd64 37.95 MiB 38.23 MiB 9.85 MiB 10.27 MiB
static_quality_gate_dogstatsd_suse_amd64 37.95 MiB 38.23 MiB 9.85 MiB 10.27 MiB
static_quality_gate_iot_agent_deb_amd64 57.75 MiB 58.19 MiB 14.56 MiB 15.02 MiB
static_quality_gate_iot_agent_deb_arm64 55.2 MiB 55.63 MiB 12.59 MiB 13.05 MiB
static_quality_gate_iot_agent_deb_armhf 53.89 MiB 54.32 MiB 12.58 MiB 13.05 MiB
static_quality_gate_iot_agent_rpm_amd64 57.75 MiB 58.19 MiB 14.58 MiB 15.04 MiB
static_quality_gate_iot_agent_rpm_arm64 55.2 MiB 55.63 MiB 12.61 MiB 13.07 MiB
static_quality_gate_iot_agent_suse_amd64 57.75 MiB 58.19 MiB 14.58 MiB 15.04 MiB

@cit-pr-commenter
Copy link
Copy Markdown

cit-pr-commenter Bot commented Mar 31, 2025

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: e0b80e60-eef8-4845-9231-5d371a328741

Baseline: bd316e1
Comparison: ec797ea
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
quality_gate_idle memory utilization +0.58 [+0.53, +0.64] 1 Logs bounds checks dashboard
file_to_blackhole_1000ms_latency egress throughput +0.38 [-0.38, +1.14] 1 Logs
quality_gate_logs % cpu utilization +0.31 [-2.51, +3.13] 1 Logs bounds checks dashboard
otlp_ingest_logs memory utilization +0.20 [+0.04, +0.36] 1 Logs
file_to_blackhole_1000ms_latency_linear_load egress throughput +0.07 [-0.40, +0.54] 1 Logs
quality_gate_idle_all_features memory utilization +0.01 [-0.08, +0.09] 1 Logs bounds checks dashboard
uds_dogstatsd_to_api ingress throughput +0.01 [-0.29, +0.30] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.00 [-0.84, +0.85] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.01, +0.02] 1 Logs
file_to_blackhole_100ms_latency egress throughput -0.01 [-0.71, +0.70] 1 Logs
file_to_blackhole_300ms_latency egress throughput -0.01 [-0.64, +0.63] 1 Logs
file_to_blackhole_0ms_latency_http1 egress throughput -0.02 [-0.83, +0.79] 1 Logs
file_to_blackhole_0ms_latency_http2 egress throughput -0.02 [-0.83, +0.79] 1 Logs
file_tree memory utilization -0.04 [-0.18, +0.11] 1 Logs
file_to_blackhole_500ms_latency egress throughput -0.07 [-0.86, +0.72] 1 Logs
otlp_ingest_traces memory utilization -0.12 [-0.50, +0.27] 1 Logs
otlp_ingest_metrics memory utilization -0.24 [-0.39, -0.09] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.38 [-0.45, -0.31] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.60 [-0.64, -0.55] 1 Logs
uds_dogstatsd_to_api_cpu % cpu utilization -1.09 [-1.93, -0.24] 1 Logs

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
file_to_blackhole_0ms_latency lost_bytes 10/10
file_to_blackhole_0ms_latency memory_usage 10/10
file_to_blackhole_0ms_latency_http1 lost_bytes 10/10
file_to_blackhole_0ms_latency_http1 memory_usage 10/10
file_to_blackhole_0ms_latency_http2 lost_bytes 10/10
file_to_blackhole_0ms_latency_http2 memory_usage 10/10
file_to_blackhole_1000ms_latency memory_usage 10/10
file_to_blackhole_1000ms_latency_linear_load memory_usage 10/10
file_to_blackhole_100ms_latency lost_bytes 10/10
file_to_blackhole_100ms_latency memory_usage 10/10
file_to_blackhole_300ms_latency lost_bytes 10/10
file_to_blackhole_300ms_latency memory_usage 10/10
file_to_blackhole_500ms_latency lost_bytes 10/10
file_to_blackhole_500ms_latency memory_usage 10/10
quality_gate_idle intake_connections 10/10 bounds checks dashboard
quality_gate_idle memory_usage 10/10 bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 bounds checks dashboard
quality_gate_logs intake_connections 10/10 bounds checks dashboard
quality_gate_logs lost_bytes 10/10 bounds checks dashboard
quality_gate_logs memory_usage 10/10 bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Apr 1, 2025

Static quality checks ❌

Please find below the results from static quality gates

Error

Result Quality gate On disk size On disk size limit On wire size On wire size limit
static_quality_gate_agent_rpm_amd64 780.63 MiB 780.59 MiB 192.92 MiB 193.37 MiB
static_quality_gate_agent_suse_amd64_fips 778.62 MiB 778.59 MiB 192.16 MiB 192.47 MiB
static_quality_gate_agent_suse_arm64 772.06 MiB 772.02 MiB 173.98 MiB 174.47 MiB
static_quality_gate_agent_suse_arm64_fips 770.23 MiB 770.2 MiB 173.17 MiB 173.63 MiB
Gate failure full details
Quality gate Error type Error message
static_quality_gate_agent_rpm_amd64 AssertionError Package size on disk (uncompressed package size) 818550920 is higher than the maximum allowed 818507939 by the gate !
static_quality_gate_agent_suse_amd64_fips AssertionError Package size on disk (uncompressed package size) 816444089 is higher than the maximum allowed 816410787 by the gate !
static_quality_gate_agent_suse_arm64 AssertionError Package size on disk (uncompressed package size) 809559348 is higher than the maximum allowed 809521643 by the gate !
static_quality_gate_agent_suse_arm64_fips AssertionError Package size on disk (uncompressed package size) 807645585 is higher than the maximum allowed 807613235 by the gate !

Static quality gates prevent the PR to merge! You can check the static quality gates confluence page for guidance. We also have a toolbox page available to list tools useful to debug the size increase.

Successful checks

Info

Result Quality gate On disk size On disk size limit On wire size On wire size limit
static_quality_gate_agent_deb_amd64 780.66 MiB 780.74 MiB 190.34 MiB 190.85 MiB
static_quality_gate_agent_deb_amd64_fips 778.63 MiB 778.65 MiB 190.08 MiB 190.66 MiB
static_quality_gate_agent_heroku_amd64 428.56 MiB 428.97 MiB 112.78 MiB 113.11 MiB
static_quality_gate_agent_msi 976.69 MiB 977.0 MiB 149.38 MiB 149.85 MiB
static_quality_gate_agent_rpm_amd64_fips 778.62 MiB 778.69 MiB 192.16 MiB 192.47 MiB
static_quality_gate_agent_rpm_arm64 771.99 MiB 772.07 MiB 173.98 MiB 174.47 MiB
static_quality_gate_agent_rpm_arm64_fips 770.19 MiB 770.27 MiB 173.17 MiB 173.63 MiB
static_quality_gate_agent_suse_amd64 780.63 MiB 780.7 MiB 192.92 MiB 193.37 MiB
static_quality_gate_docker_agent_amd64 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_arm64 879.88 MiB 880.0 MiB 277.42 MiB 277.86 MiB
static_quality_gate_docker_agent_jmx_amd64 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_jmx_arm64 879.88 MiB 880.0 MiB 277.42 MiB 277.86 MiB
static_quality_gate_docker_agent_windows1809 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_core 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_core_jmx 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows1809_jmx 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_core 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_core_jmx 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_agent_windows2022_jmx 865.19 MiB 865.3 MiB 290.92 MiB 291.35 MiB
static_quality_gate_docker_cluster_agent_amd64 263.12 MiB 263.29 MiB 105.69 MiB 106.03 MiB
static_quality_gate_docker_cluster_agent_arm64 279.06 MiB 279.25 MiB 100.52 MiB 100.87 MiB
static_quality_gate_docker_cws_instrumentation_amd64 6.65 MiB 7.12 MiB 2.82 MiB 3.29 MiB
static_quality_gate_docker_cws_instrumentation_arm64 6.44 MiB 6.92 MiB 2.6 MiB 3.07 MiB
static_quality_gate_docker_dogstatsd_amd64 46.09 MiB 46.37 MiB 17.38 MiB 17.78 MiB
static_quality_gate_docker_dogstatsd_arm64 44.7 MiB 44.99 MiB 16.24 MiB 16.65 MiB
static_quality_gate_dogstatsd_deb_amd64 37.95 MiB 38.23 MiB 9.84 MiB 10.26 MiB
static_quality_gate_dogstatsd_deb_arm64 36.53 MiB 36.82 MiB 8.54 MiB 8.96 MiB
static_quality_gate_dogstatsd_rpm_amd64 37.95 MiB 38.23 MiB 9.85 MiB 10.27 MiB
static_quality_gate_dogstatsd_suse_amd64 37.95 MiB 38.23 MiB 9.85 MiB 10.27 MiB
static_quality_gate_iot_agent_deb_amd64 57.71 MiB 58.19 MiB 14.55 MiB 15.02 MiB
static_quality_gate_iot_agent_deb_arm64 55.16 MiB 55.63 MiB 12.57 MiB 13.05 MiB
static_quality_gate_iot_agent_deb_armhf 53.85 MiB 54.32 MiB 12.57 MiB 13.05 MiB
static_quality_gate_iot_agent_rpm_amd64 57.71 MiB 58.19 MiB 14.57 MiB 15.04 MiB
static_quality_gate_iot_agent_rpm_arm64 55.16 MiB 55.63 MiB 12.59 MiB 13.07 MiB
static_quality_gate_iot_agent_suse_amd64 57.71 MiB 58.19 MiB 14.57 MiB 15.04 MiB

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented Apr 10, 2025

Static quality checks

✅ Please find below the results from static quality gates

Successful checks

Info

Result Quality gate On disk size On disk size limit On wire size On wire size limit
static_quality_gate_agent_deb_amd64 776.91 MiB 778.06 MiB 190.44 MiB 191.06 MiB
static_quality_gate_agent_deb_amd64_fips 774.87 MiB 776.09 MiB 189.85 MiB 190.72 MiB
static_quality_gate_agent_heroku_amd64 433.71 MiB 434.99 MiB 113.4 MiB 114.34 MiB
static_quality_gate_agent_msi 977.08 MiB 978.45 MiB 150.72 MiB 151.65 MiB
static_quality_gate_agent_rpm_amd64 776.89 MiB 778.06 MiB 192.54 MiB 193.42 MiB
static_quality_gate_agent_rpm_amd64_fips 774.86 MiB 776.06 MiB 191.97 MiB 192.61 MiB
static_quality_gate_agent_rpm_arm64 767.1 MiB 768.33 MiB 173.84 MiB 174.71 MiB
static_quality_gate_agent_rpm_arm64_fips 765.28 MiB 766.55 MiB 173.04 MiB 173.92 MiB
static_quality_gate_agent_suse_amd64 776.89 MiB 778.08 MiB 192.54 MiB 193.42 MiB
static_quality_gate_agent_suse_amd64_fips 774.86 MiB 776.11 MiB 191.97 MiB 192.78 MiB
static_quality_gate_agent_suse_arm64 767.1 MiB 768.31 MiB 173.84 MiB 174.71 MiB
static_quality_gate_agent_suse_arm64_fips 765.28 MiB 766.5 MiB 173.04 MiB 173.92 MiB
static_quality_gate_docker_agent_amd64 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_arm64 875.48 MiB 875.62 MiB 277.05 MiB 277.94 MiB
static_quality_gate_docker_agent_jmx_amd64 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_jmx_arm64 875.48 MiB 876.23 MiB 277.05 MiB 277.94 MiB
static_quality_gate_docker_agent_windows1809 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows1809_core 861.99 MiB 862.2 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows1809_core_jmx 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows1809_jmx 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows2022 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows2022_core 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows2022_core_jmx 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_agent_windows2022_jmx 861.99 MiB 862.0 MiB 290.58 MiB 291.48 MiB
static_quality_gate_docker_cluster_agent_amd64 262.46 MiB 263.4 MiB 103.12 MiB 104.07 MiB
static_quality_gate_docker_cluster_agent_arm64 278.43 MiB 279.38 MiB 97.99 MiB 98.95 MiB
static_quality_gate_docker_cws_instrumentation_amd64 6.65 MiB 7.12 MiB 2.82 MiB 3.29 MiB
static_quality_gate_docker_cws_instrumentation_arm64 6.44 MiB 6.92 MiB 2.6 MiB 3.07 MiB
static_quality_gate_docker_dogstatsd_amd64 46.29 MiB 46.39 MiB 17.45 MiB 17.78 MiB
static_quality_gate_docker_dogstatsd_arm64 44.9 MiB 45.05 MiB 16.3 MiB 16.65 MiB
static_quality_gate_dogstatsd_deb_amd64 38.15 MiB 38.4 MiB 9.88 MiB 10.26 MiB
static_quality_gate_dogstatsd_deb_arm64 36.72 MiB 36.98 MiB 8.57 MiB 8.96 MiB
static_quality_gate_dogstatsd_rpm_amd64 38.15 MiB 38.4 MiB 9.88 MiB 10.27 MiB
static_quality_gate_dogstatsd_suse_amd64 38.15 MiB 38.4 MiB 9.88 MiB 10.27 MiB
static_quality_gate_iot_agent_deb_amd64 58.07 MiB 58.51 MiB 14.61 MiB 15.02 MiB
static_quality_gate_iot_agent_deb_arm64 55.51 MiB 55.94 MiB 12.64 MiB 13.05 MiB
static_quality_gate_iot_agent_deb_armhf 54.16 MiB 54.32 MiB 12.62 MiB 13.05 MiB
static_quality_gate_iot_agent_rpm_amd64 58.07 MiB 58.51 MiB 14.63 MiB 15.04 MiB
static_quality_gate_iot_agent_rpm_arm64 55.51 MiB 55.94 MiB 12.65 MiB 13.07 MiB
static_quality_gate_iot_agent_suse_amd64 58.07 MiB 58.51 MiB 14.63 MiB 15.04 MiB

@DanielLavie DanielLavie changed the title Daniel.lavie/fix unclaimed tracing SUSM-94: Handle K8s Service (NAT) Traffic on Same Node Correctly Apr 15, 2025
@DanielLavie DanielLavie changed the title SUSM-94: Handle K8s Service (NAT) Traffic on Same Node Correctly SUSM-94: Handle Same-Node K8s Service Apr 15, 2025
@DanielLavie DanielLavie marked this pull request as ready for review April 15, 2025 12:12
@DanielLavie DanielLavie requested review from a team as code owners April 15, 2025 12:12
@DanielLavie DanielLavie requested a review from mbakht April 15, 2025 12:12

// traceLogConnections logs detailed connection information to help investigate NPM/USM
// customer issues by providing visibility into network connections and their metadata.
func traceLogConnections(id string, cs *network.Connections) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can result in an enormous number of logs. I have also not really had a customer send us trace level logs; usually we ask for debug level logs.

There are already endpoints like /connections and others for USM that we can use to get similar info. We can add them to the agent flare if we need them from customers; this seems like a better alternative to me than logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "I think this can result in an enormous number of logs" – This should only be printed when trace logs are enabled, meaning we’re actively investigating a customer issue and need detailed context.
    I'm also seeing this, which suggests we may already be emitting a high volume of trace logs.

  2. "I have also not really had a customer send us trace level logs; usually we ask for debug level logs" – Why is that? It makes sense to begin with debug logs, but in cases like this one where more granularity is required, escalating to trace level should be an option.

  3. "There are already endpoints like /connections and others for USM that we can use to get similar info" – That’s valid, but logs offer full historical context, while /connections only reflects the most recent 30 seconds or so. Logs also simplify correlation: they allow us to directly compare the NPM connection tuple with the USM connection key and quickly identify mismatches, without needing to parse aggregated HTTP data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how useful this will be even if we somehow manage to get these logs from the customer. This could be a pretty large number of logs and parsing through them would be challenging to say the least. The endpoints will give you JSON, which is parse-able and query-able with tools like jq. Another problem would be that this obscures other trace logs since the context from this one log is going to be much larger than other trace logs.

Content like this is more appropriate for the flare. We were already considering adding this, but de-prioritized it; maybe this can be incentive enough to resurrect it.

Copy link
Copy Markdown
Contributor Author

@DanielLavie DanielLavie Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the /connections endpoint is that it doesn’t include all the USM HTTP connection key details—it relies on writeConnections, which is where the underlying problem was hidden. As a result, we wouldn’t observe the issue when using this endpoint.

We could try using the HTTP debug endpoint instead, but it only returns the USM portion of the HTTP data, without NPM context, and it consumes all connections in the process - making it unusable for our needs.

So at this point, I’m not exactly sure how the endpoints could help us retrieve the necessary information.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A combination of the two doesn't give the same results? You could also modify or add another endpoint. Another option could be to do more limited trace logs at the point where you think the problem can happen.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guyarb a snapshot should suffice

}

// Log all connections for debugging
traceLog("Found %d total connections", len(cs.Conns))
Copy link
Copy Markdown
Contributor

@amitslavin amitslavin Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between this log and line 394?

Comment on lines +419 to +422
traceLog("Connection %d:", i)
traceLog(" Source: %s:%d", conn.Source.String(), conn.SPort)
traceLog(" Destination: %s:%d", conn.Dest.String(), conn.DPort)
traceLog(" Protocol Stack: %+v", conn.ProtocolStack)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not logging all the data in a same traceLog function?

} else {
traceLog(" No IP Translation")
}
traceLog(" Direction: %s", conn.Direction)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

servicePort := uint16(7778)
serverPort := uint16(7777)

// Create both connections exactly as seen from SUSM-94 reproduction environment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the comment with SUSM-94 is relevant let's explain the purpose without referencing the ticket number

Comment on lines +368 to +372
httpStats := http.NewRequestStats()
httpStats.AddRequest(200, 15.0, 0, nil)
httpStats.AddRequest(200, 15.0, 0, nil)
httpStats.AddRequest(200, 15.0, 0, nil)
httpStats.AddRequest(200, 15.0, 0, nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe create a loop for that with a const

Comment on lines +388 to +396
postNATKey := http.NewKey(
clientIP, // Client is still the source in the HTTP key (172.29.161.37)
serverIP, // Server is the destination (172.29.191.94)
clientPort, // Client port (53792)
serverPort, // Server port (7777)
[]byte("/delay/5"),
true,
http.MethodGet,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only difference between the preNATKey and the postNATKey is the serverIP, let's structure it more cleanly by creating a http.newKey and updating just that field.

Plus IMO the comments are not really relevant

httpEncoder := newHTTPEncoder(payload.HTTP)

// Test post-NAT connection (server → client) - should have HTTP data
aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you validate that payload.Conns > 0?

// Test post-NAT connection (server → client) - should have HTTP data
aggregations, _, _ := getHTTPAggregations(t, httpEncoder, payload.Conns[1])
assert.NotNil(aggregations)
assert.Equal("/delay/5", aggregations.EndpointAggregations[0].Path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you validate that aggregations.EndpointAggregations > 0?

@amitslavin
Copy link
Copy Markdown
Contributor

@DanielLavie, I understand the problem, but I'm not sure how the proposed change solves it. The description isn't fully clear to me on how the fix addresses the issue. I think that the description needs to be updated.

@github-actions github-actions Bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Apr 15, 2025
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised (maybe disappointed) that not a single existing test was broken due to your change
How can we guarantee any other change/regression will be detected?
Can you try and "tweak" the code to ensure your new test is covering the cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestKubernetesLocalNATScenario initially failed. Only when I made the modification in this PR, it passed


// traceLogConnections logs detailed connection information to help investigate NPM/USM
// customer issues by providing visibility into network connections and their metadata.
func traceLogConnections(id string, cs *network.Connections) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielLavie, eventually, will we need a snapshot of the logs (say, the last iteration) or a massive dump (i.e. an hour)?

Comment on lines +315 to +318
func TestKubernetesLocalNATScenario(t *testing.T) {
if runtime.GOOS == "windows" || os.Getenv("CI") == "true" {
t.Skip("Skipping test on Windows or CI")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the test to a new file with linux go build on it (as this test might run on macos)
also, why do we skip CI?

Comment on lines +78 to +79
// Callback 1: (client, server)
if f(types.NewConnectionKey(clientIP, serverIP, clientPort, serverPort)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a comment here explaining the reason for the change, to help prevent future breaking changes

@mbakht mbakht removed their request for review May 8, 2025 20:39
@dd-devtools-worker dd-devtools-worker Bot deleted the daniel.lavie/fix-unclaimed-tracing branch October 15, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed component/system-probe long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/universal-service-monitoring The USM team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants