[core] Check runner by masci · Pull Request #1 · DataDog/datadog-agent

masci · 2016-05-12T15:28:46Z

Quite large PR implementing a concurrent architecture to execute Agent checks, along with the embedding of a CPython interpreter to transparently run Python checks.

More details here: https://github.com/DataDog/datadog-agent/wiki/Python-Check-Runner

runner logic added yaml dep load configuration files for checks better code layout do not use cgo when possible using Check interface, begin split source modules moar ignores better check and tests fixed import paths, added tests forgot testcase splitted function, better error cleaning, tests pass the path to conf dir spare some boilerplate docs test fixtures get init_config from kwargs minor changed Config type convert YAML output to python dict removed debug prints, fixed tests more tests added basic benchmark fixed tests, temp solution to lock goroutine to OS thread run go_expvar check temporary use dd api to post metrics from the go binary directly post real metrics to staging run processes check use Dogstatsd instead of DD api fixed histogram payload extracted common interface for checks run the Go checks in the same scheduler as Python checks better code layout docs

Added code formatting style for clang-format

Use agent logging module

Add benchmarks on open

* Add partial flags when tailing k8s container logs * parser interface compliance * Adjust UT * [logs] Extract parsing from lineHandler logic * [logs] Adjust UT, minor fixes * [logs] UT covering previous changes * [logs] fix async test * Docker parser handles partial itself * Rebase * Address reviews * Address reviews * [logs] rebase collateral fixes Co-authored-by: Jaime Fullaondo <jaime.fullaondo@datadoghq.com>

### What does this PR do? Fixes several flaky eBPFless tests in `TestTracerSuite`. 1. **`TestTracerSuite/eBPFless/TestShortWrite` — "couldn't find connection used by short write"** * (Real bug.) Fixes `updateTCPStats()` so it no longer overcounts `SentBytes` when the kernel retransmits TCP segments that partially overlap with previously-counted data. We now only add the new bytes beyond the sequence high‑water mark. 2. **`TestTracerSuite/eBPFless/TestShortWrite` — "resource temporarily unavailable"** * (Test-only bug.) Fixes the `TestShortWrite` write loop to correctly handle the case where the send buffer is completely full and a write returns zero bytes with `resource temporarily unavailable`. 3. **`TestTracerSuite/eBPFless/TestTCPRTT` — "write: bad file descriptor"** * (Test-only bug.) Fixes an fd lifecycle issue in `TestShortWrite`: The original TestShortWrite had a triple-close on the socket fd (defer syscall.Close, explicit unix.Close, and the os.NewFile GC finalizer). After the first close, the kernel could reuse the fd number, so later closes would clobber unrelated fds in other tests. Fixed by consolidating ownership into a single t.Cleanup that does f.Close(). ### Motivation `pkg/network/tracer.TestTracerSuite` has many flaky test failures in CI. ### Describe how you validated your changes 1. `TestShortWrite` connection‑not‑found ``` # Reproduced by running the following on a vagrant VM. # Validated after the fix was applied that this failure no longer occurred. sudo go test -v -count 1000 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/TestShortWrite' === RUN TestTracerSuite/eBPFless/TestShortWrite 1771024760834621303 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled tracer_linux_test.go:1979: sent: 5000 tracer_linux_test.go:1979: sent: 10000 tracer_linux_test.go:1979: sent: 15000 tracer_linux_test.go:1979: sent: 20000 tracer_linux_test.go:1979: sent: 25000 tracer_linux_test.go:1979: sent: 30000 tracer_linux_test.go:1979: sent: 35000 tracer_linux_test.go:1979: sent: 40000 tracer_linux_test.go:1979: sent: 42741 tracer_linux_test.go:1997: Error Trace: /git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:2001 /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.7.linux-arm64/src/runtime/asm_arm64.s:1268 Error: Should be true tracer_linux_test.go:1997: Error Trace: /git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:1997 Error: Condition never satisfied Test: TestTracerSuite/eBPFless/TestShortWrite Messages: couldn't find connection used by short write --- FAIL: TestTracerSuite (10.16s) --- FAIL: TestTracerSuite/eBPFless (10.16s) --- FAIL: TestTracerSuite/eBPFless/TestShortWrite (10.16s) ``` 2. `TestShortWrite` resource temporarily unavailable ``` # Reproduced by running the following on a vagrant VM (with fix for #1) # Validated after the fix was applied that this failure no longer occurred. sudo go test -v -count 1000 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/TestShortWrite' === RUN TestTracerSuite/eBPFless/TestShortWrite 1771025049854166657 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled tracer_linux_test.go:1981: sent: 5000 tracer_linux_test.go:1981: sent: 10000 tracer_linux_test.go:1981: sent: 15000 tracer_linux_test.go:1978: Error Trace: /git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:1978 Error: Received unexpected error: resource temporarily unavailable Test: TestTracerSuite/eBPFless/TestShortWrite --- FAIL: TestTracerSuite (1.22s) --- FAIL: TestTracerSuite/eBPFless (1.22s) --- FAIL: TestTracerSuite/eBPFless/TestShortWrite (1.21s) ``` 3. `TestTCPRTT` write: bad file descriptor ``` # Reproduced by running the following on a vagrant VM (with fixes for #1 and #2) # Validated after the fix was applied that this failure no longer occurred. sudo go test -v -count 100 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/(TestShortWrite|TestTCPRTT)' === RUN TestTracerSuite/eBPFless/TestTCPRTT 1771024042054119651 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled tracer_linux_test.go:316: Error Trace: /git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:316 Error: Received unexpected error: write tcp 127.0.0.1:44304->127.0.0.1:43449: write: bad file descriptor Test: TestTracerSuite/eBPFless/TestTCPRTT --- FAIL: TestTracerSuite (1.30s) --- FAIL: TestTracerSuite/eBPFless (1.30s) --- PASS: TestTracerSuite/eBPFless/TestShortWrite (0.15s) --- FAIL: TestTracerSuite/eBPFless/TestTCPRTT (1.16s) ``` Co-authored-by: jim.wilson <jim.wilson@datadoghq.com>

Summary of Changes HIGH Priority Issues Fixed: #1: Write lock held across network I/O (impl/delegatedauth.go:270) - Refactored refreshAndGetAPIKey to release the lock before making network calls (authenticate) - The lock is now only held briefly to check/update state, not during network I/O #2: Context not propagated to signer.SignHTTP (aws.go:195) - Updated generateAwsAuthData to accept a context parameter - Changed signer.SignHTTP(context.Background(), ...) to signer.SignHTTP(ctx, ...) #3: Context not propagated to getCredentials IMDS call (aws.go:119) - Updated getCredentials to accept a context parameter - Removed ctx := context.Background() and now uses the passed context for IMDS calls MEDIUM Priority Issues Fixed: #4: No response body size limit (api/delegated_auth.go:97) - Added maxResponseBodySize = 1 * 1024 * 1024 constant (1 MB) - Wrapped response body with io.LimitReader to prevent memory exhaustion #5: No overall HTTP client timeout (api/delegated_auth.go:82) - Added httpClientTimeout = 30 * time.Second constant - Added Timeout: httpClientTimeout to the HTTP client #6: config.Set called while holding write lock (impl/delegatedauth.go:341) - Moved updateConfigWithAPIKey call outside the lock in startBackgroundRefresh - Captured the API key while holding the lock, then released it before calling config.Set #7: Blocking IMDS calls while holding write lock (impl/delegatedauth.go:127) - Refactored initializeIfNeeded to perform cloud detection without holding locks - IMDS calls now happen outside any lock, then state is updated with a brief write lock #8: Regex fails silently for non-standard formats (api/delegated_auth.go:36) - Added debug log when endpoint doesn't match known Datadog domain pattern - Updated function documentation to clarify behavior #9: Uncached IMDS credential fetch (aws.go:104) - Added documentation explaining the trade-off (refresh interval is typically 60 minutes, so caching is not critical) #10: Auth proof format undocumented (aws.go:98) - Added detailed comment documenting the auth proof format: <base64-body>|<base64-headers>|<method>|<base64-url> LOW Priority Issues Fixed: #11: Unnecessarily exported types (aws.go) - Changed SigningData to signingData (unexported) - Changed AWSAuth.AwsRegion to AWSAuth.region (unexported) - Updated all references in aws.go and aws_test.go #12: Tests exercise copy of goroutine (impl/delegatedauth_test.go:19) - Added documentation explaining why tests use a simplified goroutine pattern - Clarified that integration tests cover the actual startBackgroundRefresh function #13: Subsequent Config param silently ignored (def/delegatedauth.go:24) - Updated documentation to clearly state that only the first Config is used - Added warning log when a different Config is passed on subsequent calls

### What does this PR do? Adds on-disk persistence for Health Platform issues so they survive agent restarts. The component now writes a JSON state file under `<run_path>/health-platform/issues.json`, restores issues on startup, and tracks per-check issue lifecycle state (`new`, `ongoing`, `resolved`) with timestamps. ### Motivation Health issues were previously stored only in memory, so restarting the agent cleared the current health view and made troubleshooting harder. Persisting issues improves continuity for diagnostics, local endpoint visibility, and support workflows. ### Describe how you validated your changes CI + manual QA: ``` cat /opt/datadog-agent/run/health-platform/issues.json { "updated_at": "2026-02-12T13:31:47Z", "issues": { "docker-socket-permissions": { "issue_id": "docker-file-tailing-disabled", "state": "new", "first_seen": "2026-02-12T13:31:47Z", "last_seen": "2026-02-12T13:31:47Z" } } } ``` ``` cat /opt/datadog-agent/run/health-platform/issues.json { "updated_at": "2026-02-12T14:26:51Z", "issues": { "docker": { "issue_id": "check-execution-failure", "state": "ongoing", "first_seen": "2026-02-12T13:35:25Z", "last_seen": "2026-02-12T14:26:51Z" }, "docker-socket-permissions": { "issue_id": "docker-file-tailing-disabled", "state": "resolved", "first_seen": "2026-02-12T13:31:47Z", "last_seen": "2026-02-12T13:31:47Z", "resolved_at": "2026-02-12T14:26:33Z" }, "logs-docker-file-permissions": { "issue_id": "docker-file-tailing-disabled", "state": "ongoing", "first_seen": "2026-02-12T13:35:08Z", "last_seen": "2026-02-12T13:35:08Z" } } } ``` ### Additional Notes Writes are atomic (temp file + rename). Persistence is updated on issue updates and clears. Issues restored from disk are rebuilt from the registry; resolved issues are not rehydrated into the active issues map. #### Workflow example: Agent start #1: - Detect issue 1 → {issue_id: "issue-1", state: "new", first_seen: T1} - Detect issue 2 → {issue_id: "issue-2", state: "new", first_seen: T1} - Later, detect issue 3 → {issue_id: "issue-3", state: "new", first_seen: T2} - Issue 1 detected again → {state: "ongoing", last_seen: T3} - Issue 2 resolved → {state: "resolved", resolved_at: T4} Agent stops Agent start #2: - Load file, restore active issues (issue 1, issue 3) - Check runs, issue 1 is now resolved → {state: "resolved", resolved_at: T5} - Check runs, issue 3 still present → {state: "ongoing", last_seen: T5} - Final state: issue 1 resolved, issue 2 resolved, issue 3 ongoing ✓ Co-authored-by: louis.coquerelle <louis.coquerelle@datadoghq.com>

- Run benchmarks across comp/dogstatsd/server, pkg/aggregator, and comp/forwarder/defaultforwarder; results saved to plans/bench-baseline-forwarder.txt - Generated pprof profiles (mem.out, cpu.out) from aggregator flush benchmarks - Created scripts/profile_pipeline.sh documenting exact reproduction commands - Documented top 10 allocation sites and CPU hotspots in plans/profiling-baseline.md - Added benchmark test files for aggregator (time_sampler, context_resolver), forwarder, and dogstatsd/server that will be used in subsequent stories - Typecheck passes (go build ./comp/... ./pkg/aggregator/... ./comp/forwarder/...) Key findings: - pkg/metrics.(*Gauge).flush is #1 allocator (26.97% of objects) — target US-004 - contextResolver.trackContext is #4 allocator and 26.58% cumulative CPU — target US-003 - GC overhead accounts for ~22% of CPU — directly reducible via alloc reduction - Forwarder: 15 allocs/op per transaction creation — target US-007 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

### What does this PR do? Adds an explicit `--platform=linux/amd64` run argument when the devcontainer image is an amd64 image, so Docker runs it with the correct architecture. ### Motivation On hosts where Docker may default to a different platform (e.g., Apple Silicon), amd64 devcontainer images can be pulled/run with the wrong architecture/emulation unless the platform is specified. ### Describe how you validated your changes - Ran the devcontainer setup task and verified the generated `runArgs` includes `--platform=linux/amd64` when using an amd64 devcontainer image. ``` 28976e101240:/workspaces/datadog-agent-primary$ uname -a Linux 28976e101240 6.12.72-linuxkit #1 SMP Mon Feb 16 11:19:07 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux ``` ### Additional Notes - Change is limited to `tasks/devcontainer.py` and only applies when `"amd64"` is present in the image name. Co-authored-by: wassim.dhif <wassim.dhif@datadoghq.com>

…parity Three fixes that bring optimized output from 18/52 to 49/52 matches (0 mismatches, 3 skipped VLAN): 1. vstore: remove erroneous IsConst check. The C vstore NOPs a statement when the new value equals the existing value, regardless of whether the value is a constant. Our version had an extra os.IsConst(newval) condition that prevented NOP-ing redundant packet loads (e.g., duplicate ldh [12] across blocks). 2. FindInedges: only add edges from reachable blocks. After jump threading, unreachable blocks still have stale JT/JF pointers to reachable blocks. These phantom predecessors cause value merging to produce VAL_UNKNOWN, preventing redundant load elimination. Fix: iterate levels[] lists (populated by FindLevels with reachable blocks only) instead of the full blocks[] array. 3. GenMulticast: use JSET directly for Ethernet multicast check, matching C libpcap's gen_mac_multicast() which generates "ldb [0]; jset #1" instead of "ldb [0]; and #1; jeq #1". Golden test results (optimized, vs C filtertest default): - 49/52 exact instruction-for-instruction matches - 0 mismatches - 3 skipped (VLAN not implemented)

…w options - Move partitionEnumInFlight.Store(false) before the channel send so sequential calls within the same check run never hit the guard. The previous receiver-side Store introduced a race where call #1's defer could clear the flag while call #2 was active. - Disable tag_by_physical_storage and collect_physical_metrics at config time on Windows, where gopsutil ignores the all parameter and both syscalls return identical results. - Add new options to conf.yaml.default with platform support notes. - Add Windows gate test.

…parity Three fixes that bring optimized output from 18/52 to 49/52 matches (0 mismatches, 3 skipped VLAN): 1. vstore: remove erroneous IsConst check. The C vstore NOPs a statement when the new value equals the existing value, regardless of whether the value is a constant. Our version had an extra os.IsConst(newval) condition that prevented NOP-ing redundant packet loads (e.g., duplicate ldh [12] across blocks). 2. FindInedges: only add edges from reachable blocks. After jump threading, unreachable blocks still have stale JT/JF pointers to reachable blocks. These phantom predecessors cause value merging to produce VAL_UNKNOWN, preventing redundant load elimination. Fix: iterate levels[] lists (populated by FindLevels with reachable blocks only) instead of the full blocks[] array. 3. GenMulticast: use JSET directly for Ethernet multicast check, matching C libpcap's gen_mac_multicast() which generates "ldb [0]; jset #1" instead of "ldb [0]; and #1; jeq #1". Golden test results (optimized, vs C filtertest default): - 49/52 exact instruction-for-instruction matches - 0 mismatches - 3 skipped (VLAN not implemented)

…w options - Move partitionEnumInFlight.Store(false) before the channel send so sequential calls within the same check run never hit the guard. The previous receiver-side Store introduced a race where call #1's defer could clear the flag while call #2 was active. - Disable tag_by_physical_storage and collect_physical_metrics at config time on Windows, where gopsutil ignores the all parameter and both syscalls return identical results. - Add new options to conf.yaml.default with platform support notes. - Add Windows gate test.

…ctions Replace the monolithic batcher (5 ring buffers sharing one transport) with a generic pipeline[T] struct. Each pipeline owns its own ring buffers, flush goroutines, and dedicated UDS connection: metricsPipeline = pipeline[metricPoint] + unixConn #1 logsPipeline = pipeline[logEntry] + unixConn #2 tracePipeline = pipeline[capturedTraceStat] + unixConn #3 Pipelines are fully independent — one slow pipeline (e.g. logs sending large frames) cannot block or starve another. Key changes: - pipeline[T]: generic struct with AddEntry(T), AddContextDef, Stop. 1-2 flush goroutines per pipeline (entries + optional contexts). flushChunked reused unchanged. - unixConn: simple per-connection transport replacing pooledTransport. Lazy dial, mutex held during Send, reconnect once on error. - activate(): creates 3 pipelines with 3 independent connections. sync.Once coordinates teardown when any transport disconnects. Testbench results (all 0 drops): - dogstatsd-p99: 2.8M metrics sent - logs-high-throughput: 40M logs at 10 MiB/s, 3 GB Parquet - metrics-logs-combined: 743K metrics + 1M logs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ll detectors PR 49939 had 30 iters, all detector-targeted (bocpd/scanmw/scanwelch tweaks). The proposer's diversity guideline ("at least 3 distinct families") doesn't enforce CATEGORY diversity — three families of detector tweaks still satisfies "diverse families" while staying in the same structural surface. Adds a runtime check: if every candidate in the last 10 had only detector-style target_components (no name containing 'correlator' or 'extractor'), inject a STRUCTURAL DIVERSITY REQUIRED clause demanding at least one correlator candidate this round. Auto-disables once correlator candidates appear in recent history. Cooperates with operator steering (#1+#2) — explicit user directives still take precedence; this is the autonomous backup when the operator isn't watching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

masci force-pushed the massi/check_runner_logic branch from 100aadd to fdbef7a Compare May 23, 2016 13:33

remh merged commit 7123bd7 into master May 23, 2016

masci deleted the massi/check_runner_logic branch May 30, 2016 14:17

masci added a commit that referenced this pull request Feb 12, 2017

experiment #1

01ed162

blakeneyops mentioned this pull request Dec 13, 2017

[logs-agent] Container discovery via label not working #946

Closed

storytime mentioned this pull request Dec 10, 2018

HTTP check encoding bug #2758

Closed

hkaj mentioned this pull request Apr 1, 2019

Explore Agent's rewrite options #3236

Closed

masci added a commit that referenced this pull request Apr 1, 2019

Merge pull request #1 from Alien1993/code-formatting

4f40c65

Added code formatting style for clang-format

hush-hush pushed a commit that referenced this pull request Apr 17, 2019

Merge pull request #1 from Alien1993/code-formatting

96b4c90

Added code formatting style for clang-format

prognant mentioned this pull request Jan 31, 2020

Embed templates inside dd agent binary #4782

Merged

safchain added a commit to safchain/datadog-agent that referenced this pull request May 5, 2020

Merge pull request DataDog#1 from lebauce/use-agent-logging

32d12af

Use agent logging module

safchain added a commit to safchain/datadog-agent that referenced this pull request Jun 4, 2020

Merge pull request DataDog#1 from lebauce/open-benchmarks

8fe7fff

Add benchmarks on open

brycekahle mentioned this pull request Sep 21, 2021

[networks] Improve memory allocation pattern #9206

Merged

5 tasks

xinfenglee mentioned this pull request Mar 19, 2022

invoke system-probe.build fail! #11371

Closed

DharveshAtish mentioned this pull request Jun 30, 2022

polling metrics postgresql.locks through datadog agent #12588

Closed

mnot mentioned this pull request Mar 11, 2023

[BUG] systemd unit checks failing; old godbus dependency? #16058

Closed

AliDatadog added a commit that referenced this pull request Apr 26, 2023

fix integration tests attempt #1

3af4750

iglendd added a commit that referenced this pull request Apr 26, 2023

Fix lint issue #1

32c2313

akarpz mentioned this pull request Oct 26, 2023

do DNS lookups in require.eventually #20444

Closed

10 tasks

leeavital mentioned this pull request Nov 6, 2023

migrate snooper tests to use new local DNS server #20615

Merged

10 tasks

iglendd mentioned this pull request Apr 30, 2024

Improve agent starting and simultaneously restarting (common for Agent install) concurrency handling #25282

Merged

iglendd mentioned this pull request May 8, 2024

WINA-747 Complete solution to solve startup/shutdown race condition #25453

Merged

karlhenselin mentioned this pull request Feb 7, 2025

[BUG] Deadlock / BLOCKED threads starting tomcat with dd-agent #33821

Closed

DanielLavie mentioned this pull request Apr 15, 2025

SUSM-94: Handle Same-Node K8s Service #35644

Closed

mbertrone mentioned this pull request Feb 16, 2026

flatten ActivityTreeNodeStats to reduce heap allocations #46469

Closed

BaptisteFoy mentioned this pull request Feb 24, 2026

fix(fleet/installer): fix ~90s daemon stop delay during installer setup #46757

Closed

matt-dz mentioned this pull request Feb 26, 2026

Implement Agent Safe Shell — POSIX commands as safe builtins #46945

Closed

4 tasks

AliDatadog mentioned this pull request Mar 9, 2026

Add new guidelines to AGENTS.md #47576

Merged

BaptisteFoy mentioned this pull request Mar 16, 2026

fix(fleet): Skip subprocess for getStates on Windows to fix OOM errors #47875

Merged

chatgpt-codex-connector Bot mentioned this pull request Apr 1, 2026

Improving Disk Metrics: Distinguishing Real Disks from Pseudo-Filesystems #48766

Merged

This was referenced Apr 17, 2026

fix(gohai): fall back to numeric UID when username lookup fails #49557

Closed

[PRMS-3140] fall back to numeric UID when username lookup fails in Gohai #49559

Merged

julesmcrt mentioned this pull request Apr 23, 2026

[ACTP] PAR: rshell allow-list redesign + per-env paths #49825

Open

ellataira mentioned this pull request Apr 23, 2026

Coordinator run log — observer AD iteration #49678

Draft

chenww mentioned this pull request Apr 24, 2026

[BUG] #49859

Closed

julesmcrt mentioned this pull request Apr 27, 2026

[ACTP] PAR: fix rshell allow-list intersection bugs #49945

Open

ellataira mentioned this pull request Apr 27, 2026

coord run-log (full) — 2026-04-27 16:44 #49939

Draft

This was referenced Apr 28, 2026

coord run-log (full) — 2026-04-28 14:55 #50011

Draft

coord run-log (full) — 2026-04-28 14:59 #50013

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Check runner#1

[core] Check runner#1
remh merged 1 commit intomasterfrom
massi/check_runner_logic

masci commented May 12, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

masci commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

masci commented May 12, 2016 •

edited

Loading