fix(fleet/installer): fix ~90s daemon stop delay during installer setup#46757
fix(fleet/installer): fix ~90s daemon stop delay during installer setup#46757BaptisteFoy wants to merge 10 commits intomainfrom
Conversation
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
24 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 4623533 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.92 | [-3.98, +2.14] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.55 | [+0.40, +0.71] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.43 | [+0.35, +0.52] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.21 | [+0.16, +0.26] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.05 | [-0.16, +0.26] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.05 | [-0.34, +0.43] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.03, +0.06] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.12, +0.14] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.00 | [-0.04, +0.05] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.10, +0.09] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.00 | [-0.07, +0.06] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.02 | [-0.08, +0.03] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.14, +0.09] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.04 | [-0.47, +0.38] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.05 | [-0.54, +0.43] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.10 | [-0.33, +0.14] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.18 | [-0.21, -0.15] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | -0.20 | [-0.29, -0.11] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.27 | [-0.42, -0.11] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.32 | [-0.51, -0.12] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.92 | [-3.98, +2.14] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -1.15 | [-1.36, -0.95] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.63 | [-1.72, -1.54] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -2.83 | [-4.32, -1.35] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
…emon stop hanging
What does this PR do?
Fixes the ~90s stop delay of
datadog-agent-installer.serviceduringinstaller setup(e.g.--flavor databricks), introduced whenremote_updates: truebecame the default.Root cause
datadog-agent-installer.servicehasBindsTo=datadog-agent.service. Everysystemctl restart datadog-agentduring setup also stops the installer daemon.The daemon stop hangs because installer subprocesses (
get-states,garbage-collect) openpackages.dbvia bbolt's exclusive flock, which is held by the setup process for the entire setup duration. These subprocesses block inbbolt.Openindefinitely — not interruptible by context cancellation or SIGINT — and are orphaned in the systemd cgroup when the daemon process exits, causing systemd to wait the fullTimeoutStopSec(90s) before sending SIGKILL.Two related issues compound this:
newDaemon()calledrefreshState()synchronously, spawning aget-statessubprocess before FX init completed. If the daemon was stopped before init finished, signal handlers were never registered, so SIGTERM used Go's default handler (immediate exit) instead of callingdaemon.Stop()— orphaning the subprocess without any cleanup.Stop()returned before the background goroutine exited. Even whendaemon.Stop()was called correctly, it returned as soon asstopChanwas closed. The daemon process exited, orphaning any subprocess still blocked onbbolt.Open.WaitDelaynever fired because it requires the parent to stay alive.Fixes
pkg/fleet/daemon/daemon.goMove
refreshStateto background goroutine: RemovedrefreshState()fromnewDaemon()and moved it to the start of theStart()goroutine (without holdingd.m— safe becauserefreshStateonly reads external state). FX init now completes immediately, signal handlers are registered, and SIGTERM is handled viadaemon.Stop()before any blocking subprocess is spawned.goroutineWG.Wait()inStop(): Added agoroutineWGtracking the background goroutine.Stop()waits for the goroutine to exit before returning, keeping the daemon process alive until all child subprocesses have been waited on. This is the core fix: the parent process stays alive long enough forWaitDelay(15s) to fire and SIGKILL any subprocess blocked onbbolt.Open.Cancellable daemon context: Added
context.WithCancelto the daemon.d.cancel()is called at the start ofStop()(before acquiring the mutex) so in-flight subprocesses receive SIGINT immediately, without waiting for the mutex.Release mutex before waiting: Replaced
defer d.m.Unlock()with explicit unlocks befored.goroutineWG.Wait(), so the background goroutine can still acquired.mif needed while draining.scheduleRemoteAPIRequeststop-awareness: Added aselectond.stopChanto avoidrequestsWG.Add(1)without a correspondingDone()after the goroutine has exited.Start RC after initial
refreshState: Movedrc.Start()from the end ofStart()into the background goroutine, after the initialrefreshState()completes. This ensures the first RC payload sent to the backend contains actual package state instead of an empty state. The call is guarded by the mutex and a context check to prevent a race withrc.Close()inStop().pkg/fleet/installer/exec/installer_exec_nix.goWaitDelay = 15s: Ensures SIGKILL fires 15s after SIGINT for subprocesses blocked in bbolt's exclusive flock, bounding the daemon stop time to ~15s in the worst case.pkg/fleet/installer/setup/common/services_nix.godefer func() { span.Finish(err) }()torestartServices.Validate
go test ./pkg/fleet/daemon/...passesinstaller setup --flavor databrickswithremote_updates: truecompletes without the ~90s hang (previously observed on everysystemctl restart datadog-agentduring setup)