-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[TEEP-3981] feat(agentprofiling check): add agent termination option to agentprofiling check #44631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 3ed63b5 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.06 | [+1.01, +7.11] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.06 | [+1.01, +7.11] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.51 | [+1.42, +1.60] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.59 | [+0.54, +0.65] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.39 | [+0.24, +0.55] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.36 | [+0.33, +0.40] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.27 | [+0.03, +0.51] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.16 | [+0.10, +0.21] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.11 | [-1.36, +1.58] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.05 | [-0.36, +0.46] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.12, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.00 | [-0.37, +0.37] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.00 | [-0.05, +0.05] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.00 | [-0.13, +0.13] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.08, +0.07] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.05 | [-0.47, +0.37] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.05 | [-0.13, +0.02] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.09 | [-0.30, +0.13] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.10 | [-0.25, +0.06] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.26 | [-0.30, -0.22] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.40 | [-0.47, -0.34] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.41 | [-0.64, -0.18] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.60 | [-0.69, -0.50] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -2.51 | [-2.71, -2.30] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
bc4aa62 to
33318e4
Compare
| --- | ||
| features: | ||
| - | | ||
| The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information. | |
| The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the Agent needs to be restarted after generating diagnostic information. |
| - | | ||
| The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information. | ||
| Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination. | |
| Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the Agent attempts a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it falls back to immediate termination. |
| Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination. | ||
| **Warning**: This feature will cause the agent to exit. This feature is disabled by default and should be used with caution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Warning**: This feature will cause the agent to exit. This feature is disabled by default and should be used with caution. | |
| **Warning**: This feature will cause the Agent to exit. This feature is disabled by default and should be used with caution. |
nathan-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is asking for this? I'm not sure the agent should be in the business of killing itself based on resource usage; that's the job of the system.
|
Hey @nathan-b , The main use case for this would really be for troubleshooting scenarios where the Agent is using way more memory or CPU than it should be. I made this check in the first place because sometimes customers would report that the Agent would use way more memory/CPU than expected, but getting a flare with profiles while the overutilization was happening proves to be difficult for customers. Cases of this nature then end up taking a long time to resolve and are a generally poor customer experience. Since developing it, there was a case where we recommended using this check, but then the customer was afraid to reproduce the issue because they did not want the Agent to end up OOMing their node or container. This option should help prevent that kind of fear in future cases. I know this has come up as an idea before in a conversation with @julien-lebot on a customer case this could have been helpful for as well. |
|
Hmm, interesting. So the reason for the agent to kill itself is specifically to increase customer confidence that they can reproduce scenarios that lead to excessive resource consumption because we can tell them that the agent will self-terminate when this happens? Based on the above I understand the rationale better, though I'm a little skeptical that customers will actually accept the proposal, for two reasons:
Thanks for the response! |
nathan-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally OK with this change, however I have two comments / suggestions which I would like you to seriously consider before merging. Happy to have further conversation about this if you would like.
| // via stopAgent(). Termination is skipped when running in test mode to avoid killing the test process. | ||
| func (m *Check) terminateAgent() { | ||
| // Skip termination when running in test mode | ||
| // Check if we're running under go test by looking for test-related arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend using testing.Testing() or build tagging (//go:build !test) instead of this somewhat fragile logic.
| log.Info("Flare generation complete. No more flares will be generated until the Agent is restarted.") | ||
|
|
||
| // Terminate agent if configured to do so | ||
| if m.instance.TerminateAgentOnThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can terminate early if generating a flare causes an error. Under this scenario, the agent has exceeded the resource threshold but has not terminated itself. Is this behavior what you intend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In the case that terminate_agent_on_threshold: true is set, I think customers will find it more important that the Agent is shutdown because they do not want the Agent to continue hogging resources.
Additionally, most flare generation errors happen because the flare was unable to send to our intake, but it will still be available locally in a temp folder. I almost never see the flare fail to generate completely.
|
Adding a response to @nathan-b 's thoughts here for posterity.
Yes. Specifically, it should make the customer feel safer about reproducing a high memory or CPU utilization issue in order for us to collect more diagnostic info. Customers often hesitate to reproduce issues of this kind.
Yes, in all situations where I would recommend using this check, I always recommend to the customer that they set the thresholds well below a threshold where the system starts having trouble finishing commands. In general, I haven't seen customers use this check without our explicit guidance. |
33318e4 to
93a8bf7
Compare
93a8bf7 to
a91bfbd
Compare
…to agentprofiling check Add a new configuration option to the agentprofiling check that allows the agent process to be automatically terminated after generating a flare when memory or CPU thresholds are exceeded. This enables process managers (systemd, Kubernetes, Docker, etc.) to automatically restart the agent when resource usage becomes excessive, preventing the agent from consuming unbounded resources. The termination is opt-in (defaults to false) and attempts graceful shutdown via SIGINT, falling back to os.Exit(1) if signal delivery fails. Termination is automatically skipped when running in test mode to prevent test failures. Changes: - Add boolean config field - Implement cross-platform termination logic using os.Interrupt - Update example config with documentation and warnings - Add unit tests for config parsing and termination logic
…t and Windows .test.exe binaries.
…nstead of SIGINT/os.Exit Replace direct SIGINT signal handling with the agent's established shutdown mechanism (signals.Stopper) to ensure proper cleanup via stopAgent(). This provides better integration with the agent's shutdown flow and ensures all components are properly cleaned up during termination.
in terminateAgent() method. Remove unused imports. Update release note.
Gitlab CI Configuration ChangesChanges Summary
ℹ️ Diff available in the job log. |
a91bfbd to
fd037d0
Compare
fd037d0 to
db37b1b
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
|
What does this PR do?
This PR adds a new configuration option
terminate_agent_on_thresholdto the agentprofiling check that allows the agent process to be automatically terminated after generating a flare when memory or CPU thresholds are exceeded. The termination is opt-in (defaults to false) and uses the agent's established graceful shutdown mechanism (signals.Stopper) to ensure proper cleanup viastopAgent(). Termination is automatically skipped when running in test mode to prevent test failures.Changes:
terminate_agent_on_thresholdboolean config fieldsignals.Stopper)signals.Stopperinstead of direct signal handling for better integration with agent shutdown flowMotivation
This feature helps prevent OOM errors and CPU throttling by automatically terminating the agent when resource thresholds are exceeded. This allows process managers to restart the agent before it consumes excessive system resources, making it useful for troubleshooting memory and CPU issues in production environments.
Describe how you validated your changes
Ran the changes locally in a dev environment and tested with the agentprofiling check:
Verified that the Agent was shutdown successfully:
Tested on Linux container and local macOS machine. The Agent was successfully terminated with proper cleanup.