[TEEP-3981] feat(agentprofiling check): add agent termination option to agentprofiling check #44631

mwdd146980 · 2025-12-26T22:08:06Z

What does this PR do?

This PR adds a new configuration option terminate_agent_on_threshold to the agentprofiling check that allows the agent process to be automatically terminated after generating a flare when memory or CPU thresholds are exceeded. The termination is opt-in (defaults to false) and uses the agent's established graceful shutdown mechanism (signals.Stopper) to ensure proper cleanup via stopAgent(). Termination is automatically skipped when running in test mode to prevent test failures.

Changes:

Add terminate_agent_on_threshold boolean config field
Implement termination logic using the agent's graceful shutdown mechanism (signals.Stopper)
Update example config with documentation and warnings
Add unit tests for config parsing and termination logic
Refactor to use signals.Stopper instead of direct signal handling for better integration with agent shutdown flow

Motivation

This feature helps prevent OOM errors and CPU throttling by automatically terminating the agent when resource thresholds are exceeded. This allows process managers to restart the agent before it consumes excessive system resources, making it useful for troubleshooting memory and CPU issues in production environments.

Describe how you validated your changes

Ran the changes locally in a dev environment and tested with the agentprofiling check:

instances:
  - memory_threshold: "1MB"
    cpu_threshold: 1000
    ticket_id: "2066960"
    user_email: "morgan.wang@datadoghq.com"
    terminate_agent_on_threshold: true

Verified that the Agent was shutdown successfully:

2025-12-31 17:09:14 UTC | CORE | INFO | (pkg/collector/corechecks/agentprofiling/agentprofiling.go:279 in generateFlare) | Flare generation complete. No more flares will be generated until the Agent is restarted.
2025-12-31 17:09:14 UTC | CORE | WARN | (pkg/collector/corechecks/agentprofiling/agentprofiling.go:226 in terminateAgent) | Terminating agent process due to threshold exceeded (terminate_agent_on_threshold is enabled)
2025-12-31 17:09:14 UTC | CORE | INFO | (pkg/collector/corechecks/agentprofiling/agentprofiling.go:237 in terminateAgent) | Agent Profiling check: Graceful shutdown requested. Agent will exit after cleanup.
2025-12-31 17:09:14 UTC | CORE | INFO | (pkg/collector/worker/check_logger.go:59 in CheckFinished) | check:agentprofiling | Done running check
2025-12-31 17:09:14 UTC | CORE | INFO | (cmd/agent/subcommands/run/command.go:323 in func2) | Received stop command, shutting down...
2025-12-31 17:09:14 UTC | CORE | INFO | (cmd/agent/subcommands/run/command.go:752 in stopAgent) | See ya!

Tested on Linux container and local macOS machine. The Agent was successfully terminated with proper cleanup.

agent-platform-auto-pr · 2025-12-26T22:51:39Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 85f346b
📊 Static Quality Gates Dashboard

Successful checks

Info

	Quality gate	Delta	On disk size (MiB)	Delta	On wire size (MiB)
✅	agent_deb_amd64	$${0}$$	$${708.33}$$ < $${708.41}$$	$${+0.03}$$	$${173.91}$$ < $${174.49}$$
✅	agent_deb_amd64_fips	$${0}$$	$${703.62}$$ < $${704.0}$$	$${-0.03}$$	$${172.66}$$ < $${173.75}$$
✅	agent_heroku_amd64	$${+0}$$	$${328.84}$$ < $${329.53}$$	$${+0}$$	$${87.45}$$ < $${88.45}$$
✅	agent_msi	$${+0}$$	$${570.98}$$ < $${982.08}$$	$${+0.02}$$	$${142.75}$$ < $${143.02}$$
✅	agent_rpm_amd64	$${0}$$	$${708.31}$$ < $${708.38}$$	$${-0.01}$$	$${176.39}$$ < $${177.66}$$
✅	agent_rpm_amd64_fips	$${0}$$	$${703.6}$$ < $${703.99}$$	$${+0.06}$$	$${175.9}$$ < $${176.6}$$
✅	agent_rpm_arm64	$${0}$$	$${689.87}$$ < $${693.52}$$	$${-0.01}$$	$${160.03}$$ < $${161.26}$$
✅	agent_rpm_arm64_fips	$${0}$$	$${685.99}$$ < $${688.48}$$	$${+0.05}$$	$${159.44}$$ < $${160.55}$$
✅	agent_suse_amd64	$${0}$$	$${708.31}$$ < $${708.38}$$	$${-0.01}$$	$${176.39}$$ < $${177.66}$$
✅	agent_suse_amd64_fips	$${0}$$	$${703.6}$$ < $${703.99}$$	$${+0.06}$$	$${175.9}$$ < $${176.6}$$
✅	agent_suse_arm64	$${0}$$	$${689.87}$$ < $${693.52}$$	$${-0.01}$$	$${160.03}$$ < $${161.26}$$
✅	agent_suse_arm64_fips	$${0}$$	$${685.99}$$ < $${688.48}$$	$${+0.05}$$	$${159.44}$$ < $${160.55}$$
✅	docker_agent_amd64	$${+0}$$	$${770.12}$$ < $${770.72}$$	$${-0.01}$$	$${261.78}$$ < $${262.45}$$
✅	docker_agent_arm64	$${-0}$$	$${776.24}$$ < $${780.2}$$	$${-0}$$	$${250.83}$$ < $${252.63}$$
✅	docker_agent_jmx_amd64	$${-0}$$	$${960.99}$$ < $${961.6}$$	$${-0.01}$$	$${330.42}$$ < $${331.08}$$
✅	docker_agent_jmx_arm64	$${+0}$$	$${955.84}$$ < $${959.8}$$	$${-0.01}$$	$${315.46}$$ < $${317.27}$$
✅	docker_cluster_agent_amd64	$${+0}$$	$${180.72}$$ < $${181.08}$$	$${-0}$$	$${63.85}$$ < $${64.49}$$
✅	docker_cluster_agent_arm64	$${+0}$$	$${196.55}$$ < $${198.49}$$	$${-0}$$	$${60.12}$$ < $${61.17}$$
✅	docker_cws_instrumentation_amd64	$${0}$$	$${7.13}$$ < $${7.18}$$	$${-0}$$	$${2.99}$$ < $${3.33}$$
✅	docker_cws_instrumentation_arm64	$${0}$$	$${6.69}$$ < $${6.92}$$	$${+0}$$	$${2.73}$$ < $${3.09}$$
✅	docker_dogstatsd_amd64	$${0}$$	$${38.81}$$ < $${39.38}$$	$${0}$$	$${15.02}$$ < $${15.82}$$
✅	docker_dogstatsd_arm64	$${0}$$	$${37.13}$$ < $${37.94}$$	$${+0}$$	$${14.34}$$ < $${14.83}$$
✅	dogstatsd_deb_amd64	$${0}$$	$${30.03}$$ < $${30.61}$$	$${-0}$$	$${7.94}$$ < $${8.79}$$
✅	dogstatsd_deb_arm64	$${0}$$	$${28.18}$$ < $${29.11}$$	$${-0}$$	$${6.82}$$ < $${7.71}$$
✅	dogstatsd_rpm_amd64	$${0}$$	$${30.03}$$ < $${30.61}$$	$${-0}$$	$${7.95}$$ < $${8.8}$$
✅	dogstatsd_suse_amd64	$${0}$$	$${30.03}$$ < $${30.61}$$	$${-0}$$	$${7.95}$$ < $${8.8}$$
✅	iot_agent_deb_amd64	$${+0.01}$$	$${42.98}$$ < $${43.29}$$	$${+0}$$	$${11.25}$$ < $${12.04}$$
✅	iot_agent_deb_arm64	$${+0.01}$$	$${40.1}$$ < $${40.92}$$	$${-0}$$	$${9.62}$$ < $${10.45}$$
✅	iot_agent_deb_armhf	$${+0.01}$$	$${40.68}$$ < $${41.03}$$	$${+0}$$	$${9.82}$$ < $${10.62}$$
✅	iot_agent_rpm_amd64	$${+0.01}$$	$${42.98}$$ < $${43.29}$$	$${+0}$$	$${11.27}$$ < $${12.06}$$
✅	iot_agent_suse_amd64	$${+0.01}$$	$${42.98}$$ < $${43.29}$$	$${+0}$$	$${11.27}$$ < $${12.06}$$

cit-pr-commenter · 2025-12-26T23:40:15Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: b543687e-99bc-46a9-9781-76ffb91e05ef

Baseline: 3ed63b5
Comparison: db37b1b
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+4.06	[+1.01, +7.11]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+4.06	[+1.01, +7.11]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+1.51	[+1.42, +1.60]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	+0.59	[+0.54, +0.65]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.39	[+0.24, +0.55]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.36	[+0.33, +0.40]	1	Logs bounds checks dashboard
➖	ddot_metrics	memory utilization	+0.27	[+0.03, +0.51]	1	Logs
➖	file_tree	memory utilization	+0.16	[+0.10, +0.21]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.11	[-1.36, +1.58]	1	Logs bounds checks dashboard
➖	file_to_blackhole_0ms_latency	egress throughput	+0.05	[-0.36, +0.46]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.12, +0.14]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.00	[-0.37, +0.37]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.00	[-0.05, +0.05]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.00	[-0.13, +0.13]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.08, +0.07]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.05	[-0.47, +0.37]	1	Logs
➖	docker_containers_memory	memory utilization	-0.05	[-0.13, +0.02]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	-0.09	[-0.30, +0.13]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.10	[-0.25, +0.06]	1	Logs
➖	quality_gate_idle	memory utilization	-0.26	[-0.30, -0.22]	1	Logs bounds checks dashboard
➖	ddot_logs	memory utilization	-0.40	[-0.47, -0.34]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.41	[-0.64, -0.18]	1	Logs
➖	otlp_ingest_logs	memory utilization	-0.60	[-0.69, -0.50]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	-2.51	[-2.71, -2.30]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	docker_containers_cpu	simple_check_run	10/10
✅	docker_containers_memory	memory_usage	10/10
✅	docker_containers_memory	simple_check_run	10/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	lost_bytes	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

brett0000FF · 2025-12-31T18:29:36Z

releasenotes/notes/agentprofiling-check-terminate-threshold-feat-6f35b801e1ace36e.yaml

+---
+features:
+  - |
+    The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information.


Suggested change

The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information.

The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the Agent needs to be restarted after generating diagnostic information.

brett0000FF · 2025-12-31T18:30:03Z

releasenotes/notes/agentprofiling-check-terminate-threshold-feat-6f35b801e1ace36e.yaml

+  - |
+    The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information.
+
+    Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination.


Suggested change

Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination.

Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the Agent attempts a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it falls back to immediate termination.

brett0000FF · 2025-12-31T18:30:15Z

releasenotes/notes/agentprofiling-check-terminate-threshold-feat-6f35b801e1ace36e.yaml

+
+    Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination.
+
+    **Warning**: This feature will cause the agent to exit. This feature is disabled by default and should be used with caution.


Suggested change

**Warning**: This feature will cause the agent to exit. This feature is disabled by default and should be used with caution.

**Warning**: This feature will cause the Agent to exit. This feature is disabled by default and should be used with caution.

nathan-b

Who is asking for this? I'm not sure the agent should be in the business of killing itself based on resource usage; that's the job of the system.

mwdd146980 · 2025-12-31T20:31:01Z

Hey @nathan-b ,

The main use case for this would really be for troubleshooting scenarios where the Agent is using way more memory or CPU than it should be. I made this check in the first place because sometimes customers would report that the Agent would use way more memory/CPU than expected, but getting a flare with profiles while the overutilization was happening proves to be difficult for customers. Cases of this nature then end up taking a long time to resolve and are a generally poor customer experience.

Since developing it, there was a case where we recommended using this check, but then the customer was afraid to reproduce the issue because they did not want the Agent to end up OOMing their node or container. This option should help prevent that kind of fear in future cases.

I know this has come up as an idea before in a conversation with @julien-lebot on a customer case this could have been helpful for as well.

nathan-b · 2025-12-31T21:32:51Z

Hmm, interesting. So the reason for the agent to kill itself is specifically to increase customer confidence that they can reproduce scenarios that lead to excessive resource consumption because we can tell them that the agent will self-terminate when this happens?

Based on the above I understand the rationale better, though I'm a little skeptical that customers will actually accept the proposal, for two reasons:

The Datadog agent going down is ALSO not ideal for customers, and the possibility of unconstrained restarts / CLBO might be equally daunting.
In the time it takes the agent to send a flare, the OOM killer could victimize either it or another process. I assume we would work around this by setting the agentprofiling limit to comfortably below the cgroup limit.

Thanks for the response!

nathan-b

I'm generally OK with this change, however I have two comments / suggestions which I would like you to seriously consider before merging. Happy to have further conversation about this if you would like.

nathan-b · 2025-12-31T21:45:35Z

pkg/collector/corechecks/agentprofiling/agentprofiling.go

+// via stopAgent(). Termination is skipped when running in test mode to avoid killing the test process.
+func (m *Check) terminateAgent() {
+	// Skip termination when running in test mode
+	// Check if we're running under go test by looking for test-related arguments


Recommend using testing.Testing() or build tagging (//go:build !test) instead of this somewhat fragile logic.

nathan-b · 2025-12-31T21:48:49Z

pkg/collector/corechecks/agentprofiling/agentprofiling.go

 	log.Info("Flare generation complete. No more flares will be generated until the Agent is restarted.")

+	// Terminate agent if configured to do so
+	if m.instance.TerminateAgentOnThreshold {


This function can terminate early if generating a flare causes an error. Under this scenario, the agent has exceeded the resource threshold but has not terminated itself. Is this behavior what you intend?

Yes. In the case that terminate_agent_on_threshold: true is set, I think customers will find it more important that the Agent is shutdown because they do not want the Agent to continue hogging resources.

Additionally, most flare generation errors happen because the flare was unable to send to our intake, but it will still be available locally in a temp folder. I almost never see the flare fail to generate completely.

mwdd146980 · 2026-01-02T18:07:33Z

Adding a response to @nathan-b 's thoughts here for posterity.

So the reason for the agent to kill itself is specifically to increase customer confidence that they can reproduce scenarios that lead to excessive resource consumption because we can tell them that the agent will self-terminate when this happens?

Yes. Specifically, it should make the customer feel safer about reproducing a high memory or CPU utilization issue in order for us to collect more diagnostic info. Customers often hesitate to reproduce issues of this kind.

Based on the above I understand the rationale better, though I'm a little skeptical that customers will actually accept the proposal, for two reasons: ... I assume we would work around this by setting the agentprofiling limit to comfortably below the cgroup limit.

Yes, in all situations where I would recommend using this check, I always recommend to the customer that they set the thresholds well below a threshold where the system starts having trouble finishing commands. In general, I haven't seen customers use this check without our explicit guidance.

agent-platform-auto-pr · 2026-01-02T19:38:50Z

Go Package Import Differences

Baseline: 3ed63b5
Comparison: db37b1b

binary	os	arch	change
iot-agent	linux	amd64	+1, -0 +testing
iot-agent	linux	arm64	+1, -0 +testing

…to agentprofiling check Add a new configuration option to the agentprofiling check that allows the agent process to be automatically terminated after generating a flare when memory or CPU thresholds are exceeded. This enables process managers (systemd, Kubernetes, Docker, etc.) to automatically restart the agent when resource usage becomes excessive, preventing the agent from consuming unbounded resources. The termination is opt-in (defaults to false) and attempts graceful shutdown via SIGINT, falling back to os.Exit(1) if signal delivery fails. Termination is automatically skipped when running in test mode to prevent test failures. Changes: - Add boolean config field - Implement cross-platform termination logic using os.Interrupt - Update example config with documentation and warnings - Add unit tests for config parsing and termination logic

…t and Windows .test.exe binaries.

…nstead of SIGINT/os.Exit Replace direct SIGINT signal handling with the agent's established shutdown mechanism (signals.Stopper) to ensure proper cleanup via stopAgent(). This provides better integration with the agent's shutdown flow and ensures all components are properly cleaned up during termination.

in terminateAgent() method. Remove unused imports. Update release note.

agent-platform-auto-pr · 2026-01-02T20:03:04Z

Gitlab CI Configuration Changes

⚠️ Diff too large to display on Github.

Changes Summary

Removed	Modified	Added	Renamed
0	152	0	0

ℹ️ Diff available in the job log.

mwdd146980 · 2026-01-02T22:42:57Z

/merge

dd-devflow-routing-codex · 2026-01-02T22:43:01Z

View all feedbacks in Devflow UI.

2026-01-02 22:43:01 UTC ℹ️ Start processing command /merge

2026-01-02 22:43:07 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

2026-01-02 22:47:07 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 56m (p90).

2026-01-02 23:20:50 UTC ℹ️ MergeQueue: This merge request was merged

github-actions bot added short review PR is simple enough to be reviewed quickly team/agent-runtimes labels Dec 26, 2025

mwdd146980 self-assigned this Dec 26, 2025

mwdd146980 added the qa/done QA done before merge and regressions are covered by tests label Dec 26, 2025

github-actions bot added medium review PR review might take time and removed short review PR is simple enough to be reviewed quickly labels Dec 26, 2025

mwdd146980 marked this pull request as ready for review December 31, 2025 18:06

mwdd146980 requested review from a team as code owners December 31, 2025 18:07

mwdd146980 requested review from jose-manuel-almaza and s-alad December 31, 2025 18:07

mwdd146980 force-pushed the mwdd146980/teep-3981 branch from bc4aa62 to 33318e4 Compare December 31, 2025 18:18

mwdd146980 added the ask-review Ask required teams to review this PR label Dec 31, 2025

brett0000FF approved these changes Dec 31, 2025

View reviewed changes

nathan-b reviewed Dec 31, 2025

View reviewed changes

nathan-b approved these changes Dec 31, 2025

View reviewed changes

s-alad approved these changes Jan 2, 2026

View reviewed changes

mwdd146980 force-pushed the mwdd146980/teep-3981 branch from 33318e4 to 93a8bf7 Compare January 2, 2026 19:21

mwdd146980 force-pushed the mwdd146980/teep-3981 branch from 93a8bf7 to a91bfbd Compare January 2, 2026 19:59

mwdd146980 requested review from a team as code owners January 2, 2026 19:59

mwdd146980 requested a review from a team as a code owner January 2, 2026 19:59

github-actions bot added component/system-probe long review PR is complex, plan time to review it and removed medium review PR review might take time labels Jan 2, 2026

mwdd146980 added 6 commits January 2, 2026 15:01

fix(corechecks): fix test mode detection to handle -test.v=true forma…

5591712

…t and Windows .test.exe binaries.

Adding release note

874718c

remove double s

405753b

Replace manual os.Args and binary name checks with testing.Testing()

4dc39a5

in terminateAgent() method. Remove unused imports. Update release note.

mwdd146980 force-pushed the mwdd146980/teep-3981 branch from a91bfbd to fd037d0 Compare January 2, 2026 20:06

github-actions bot added medium review PR review might take time and removed component/system-probe long review PR is complex, plan time to review it labels Jan 2, 2026

mwdd146980 removed request for a team and jose-manuel-almaza January 2, 2026 20:14

Fix capitalization in release note

db37b1b

mwdd146980 force-pushed the mwdd146980/teep-3981 branch from fd037d0 to db37b1b Compare January 2, 2026 20:27

mwdd146980 removed request for a team January 2, 2026 20:27

dd-mergequeue bot merged commit 8a182bd into main Jan 2, 2026
329 checks passed

dd-mergequeue bot deleted the mwdd146980/teep-3981 branch January 2, 2026 23:20

github-actions bot added this to the 7.76.0 milestone Jan 2, 2026

	The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the agent needs to be restarted after generating diagnostic information.
	The Agent Profiling check now supports automatic agent termination after flare generation when memory or CPU thresholds are exceeded. This feature is useful in resource-constrained environments where the Agent needs to be restarted after generating diagnostic information.

	Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination.
	Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the Agent attempts a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it falls back to immediate termination.


		Enable this feature by setting `terminate_agent_on_threshold: true` in the Agent Profiling check configuration. When enabled, the agent will attempt a graceful shutdown via SIGINT after successfully generating a flare, allowing cleanup before exit. If signal delivery fails, it will fall back to immediate termination.

		Warning: This feature will cause the agent to exit. This feature is disabled by default and should be used with caution.

[TEEP-3981] feat(agentprofiling check): add agent termination option to agentprofiling check #44631

[TEEP-3981] feat(agentprofiling check): add agent termination option to agentprofiling check #44631

Conversation

mwdd146980 commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes:

Motivation

Describe how you validated your changes

Uh oh!

agent-platform-auto-pr bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

brett0000FF Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

brett0000FF Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brett0000FF Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-b left a comment

Choose a reason for hiding this comment

Uh oh!

mwdd146980 commented Dec 31, 2025

Uh oh!

nathan-b commented Dec 31, 2025

Uh oh!

nathan-b left a comment

Choose a reason for hiding this comment

Uh oh!

nathan-b Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-b Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

mwdd146980 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

mwdd146980 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agent-platform-auto-pr bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

agent-platform-auto-pr bot commented Jan 2, 2026

Gitlab CI Configuration Changes

Changes Summary

Uh oh!

mwdd146980 commented Jan 2, 2026

Uh oh!

dd-devflow-routing-codex bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mwdd146980 commented Dec 26, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2025 •

edited

Loading

cit-pr-commenter bot commented Dec 26, 2025 •

edited

Loading

brett0000FF Dec 31, 2025 •

edited

Loading

mwdd146980 commented Jan 2, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Jan 2, 2026 •

edited

Loading

dd-devflow-routing-codex bot commented Jan 2, 2026 •

edited

Loading