Skip to content

[core] Check runner#1

Merged
remh merged 1 commit intomasterfrom
massi/check_runner_logic
May 23, 2016
Merged

[core] Check runner#1
remh merged 1 commit intomasterfrom
massi/check_runner_logic

Conversation

@masci
Copy link
Copy Markdown
Contributor

@masci masci commented May 12, 2016

Quite large PR implementing a concurrent architecture to execute Agent checks, along with the embedding of a CPython interpreter to transparently run Python checks.

More details here: https://github.com/DataDog/datadog-agent/wiki/Python-Check-Runner

runner logic

added yaml dep

load configuration files for checks

better code layout

do not use cgo when possible

using Check interface, begin split source modules

moar ignores

better check and tests

fixed import paths, added tests

forgot testcase

splitted function, better error cleaning, tests

pass the path to conf dir

spare some boilerplate

docs

test fixtures

get init_config from kwargs

minor

changed Config type

convert YAML output to python dict

removed debug prints, fixed tests

more tests

added basic benchmark

fixed tests, temp solution to lock goroutine to OS thread

run go_expvar check

temporary use dd api to post metrics from the go binary directly

post real metrics to staging

run processes check

use Dogstatsd instead of DD api

fixed histogram payload

extracted common interface for checks

run the Go checks in the same scheduler as Python checks

better code layout

docs
@masci masci force-pushed the massi/check_runner_logic branch from 100aadd to fdbef7a Compare May 23, 2016 13:33
@remh remh merged commit 7123bd7 into master May 23, 2016
@masci masci deleted the massi/check_runner_logic branch May 30, 2016 14:17
masci added a commit that referenced this pull request Feb 12, 2017
masci added a commit that referenced this pull request Apr 1, 2019
Added code formatting style for clang-format
hush-hush pushed a commit that referenced this pull request Apr 17, 2019
Added code formatting style for clang-format
safchain added a commit to safchain/datadog-agent that referenced this pull request May 5, 2020
safchain added a commit to safchain/datadog-agent that referenced this pull request Jun 4, 2020
truthbk added a commit that referenced this pull request Sep 18, 2020
* Add partial flags when tailing k8s container logs

* parser interface compliance

* Adjust UT

* [logs] Extract parsing from lineHandler logic

* [logs] Adjust UT, minor fixes

* [logs] UT covering previous changes

* [logs] fix async test

* Docker parser handles partial itself

* Rebase

* Address reviews

* Address reviews

* [logs] rebase collateral fixes

Co-authored-by: Jaime Fullaondo <jaime.fullaondo@datadoghq.com>
AliDatadog added a commit that referenced this pull request Apr 26, 2023
iglendd added a commit that referenced this pull request Apr 26, 2023
@akarpz akarpz mentioned this pull request Oct 26, 2023
10 tasks
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Feb 18, 2026
### What does this PR do?

Fixes several flaky eBPFless tests in `TestTracerSuite`.

1. **`TestTracerSuite/eBPFless/TestShortWrite` — "couldn't find connection used by short write"**
   * (Real bug.)  Fixes `updateTCPStats()` so it no longer overcounts `SentBytes` when the kernel retransmits TCP segments that partially overlap with previously-counted data. We now only add the new bytes beyond the sequence high‑water mark.

2. **`TestTracerSuite/eBPFless/TestShortWrite` — "resource temporarily unavailable"**
   * (Test-only bug.)  Fixes the `TestShortWrite` write loop to correctly handle the case where the send buffer is completely full and a write returns zero bytes with `resource temporarily unavailable`.

3. **`TestTracerSuite/eBPFless/TestTCPRTT` — "write: bad file descriptor"**
   * (Test-only bug.)  Fixes an fd lifecycle issue in `TestShortWrite`: The original TestShortWrite had a triple-close on the socket fd (defer syscall.Close, explicit unix.Close, and the os.NewFile GC finalizer). After the first close, the kernel could reuse the fd number, so later closes would clobber unrelated fds in other tests. Fixed by consolidating ownership into a single t.Cleanup that does f.Close().

### Motivation

`pkg/network/tracer.TestTracerSuite` has many flaky test failures in CI. 

### Describe how you validated your changes

1. `TestShortWrite` connection‑not‑found

```
# Reproduced by running the following on a vagrant VM.
# Validated after the fix was applied that this failure no longer occurred.

sudo go test -v -count 1000 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/TestShortWrite'

=== RUN   TestTracerSuite/eBPFless/TestShortWrite
1771024760834621303 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled
    tracer_linux_test.go:1979: sent: 5000
    tracer_linux_test.go:1979: sent: 10000
    tracer_linux_test.go:1979: sent: 15000
    tracer_linux_test.go:1979: sent: 20000
    tracer_linux_test.go:1979: sent: 25000
    tracer_linux_test.go:1979: sent: 30000
    tracer_linux_test.go:1979: sent: 35000
    tracer_linux_test.go:1979: sent: 40000
    tracer_linux_test.go:1979: sent: 42741
    tracer_linux_test.go:1997:
        	Error Trace:	/git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:2001
        	            				/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.7.linux-arm64/src/runtime/asm_arm64.s:1268
        	Error:      	Should be true
    tracer_linux_test.go:1997:
        	Error Trace:	/git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:1997
        	Error:      	Condition never satisfied
        	Test:       	TestTracerSuite/eBPFless/TestShortWrite
        	Messages:   	couldn't find connection used by short write
--- FAIL: TestTracerSuite (10.16s)
    --- FAIL: TestTracerSuite/eBPFless (10.16s)
        --- FAIL: TestTracerSuite/eBPFless/TestShortWrite (10.16s)
```

2. `TestShortWrite` resource temporarily unavailable

```
# Reproduced by running the following on a vagrant VM (with fix for #1)
# Validated after the fix was applied that this failure no longer occurred.

sudo go test -v -count 1000 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/TestShortWrite'

=== RUN   TestTracerSuite/eBPFless/TestShortWrite
1771025049854166657 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled
    tracer_linux_test.go:1981: sent: 5000
    tracer_linux_test.go:1981: sent: 10000
    tracer_linux_test.go:1981: sent: 15000
    tracer_linux_test.go:1978:
        	Error Trace:	/git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:1978
        	Error:      	Received unexpected error:
        	            	resource temporarily unavailable
        	Test:       	TestTracerSuite/eBPFless/TestShortWrite
--- FAIL: TestTracerSuite (1.22s)
    --- FAIL: TestTracerSuite/eBPFless (1.22s)
        --- FAIL: TestTracerSuite/eBPFless/TestShortWrite (1.21s)

```

3. `TestTCPRTT` write: bad file descriptor

```
# Reproduced by running the following on a vagrant VM (with fixes for #1 and #2)
# Validated after the fix was applied that this failure no longer occurred.

sudo go test -v -count 100 -failfast -tags linux_bpf,test ./pkg/network/tracer -run 'TestTracerSuite/eBPFless/(TestShortWrite|TestTCPRTT)'

=== RUN   TestTracerSuite/eBPFless/TestTCPRTT
1771024042054119651 [Warn] not starting resolv.conf container store, because it depends on process event monitoring which is disabled
    tracer_linux_test.go:316:
        	Error Trace:	/git/datadog-agent/pkg/network/tracer/tracer_linux_test.go:316
        	Error:      	Received unexpected error:
        	            	write tcp 127.0.0.1:44304->127.0.0.1:43449: write: bad file descriptor
        	Test:       	TestTracerSuite/eBPFless/TestTCPRTT
--- FAIL: TestTracerSuite (1.30s)
    --- FAIL: TestTracerSuite/eBPFless (1.30s)
        --- PASS: TestTracerSuite/eBPFless/TestShortWrite (0.15s)
        --- FAIL: TestTracerSuite/eBPFless/TestTCPRTT (1.16s)
```



Co-authored-by: jim.wilson <jim.wilson@datadoghq.com>
wynbennett added a commit that referenced this pull request Feb 23, 2026
Summary of Changes

  HIGH Priority Issues Fixed:

  #1: Write lock held across network I/O (impl/delegatedauth.go:270)
  - Refactored refreshAndGetAPIKey to release the lock before making network calls (authenticate)
  - The lock is now only held briefly to check/update state, not during network I/O

  #2: Context not propagated to signer.SignHTTP (aws.go:195)
  - Updated generateAwsAuthData to accept a context parameter
  - Changed signer.SignHTTP(context.Background(), ...) to signer.SignHTTP(ctx, ...)

  #3: Context not propagated to getCredentials IMDS call (aws.go:119)
  - Updated getCredentials to accept a context parameter
  - Removed ctx := context.Background() and now uses the passed context for IMDS calls

  MEDIUM Priority Issues Fixed:

  #4: No response body size limit (api/delegated_auth.go:97)
  - Added maxResponseBodySize = 1 * 1024 * 1024 constant (1 MB)
  - Wrapped response body with io.LimitReader to prevent memory exhaustion

  #5: No overall HTTP client timeout (api/delegated_auth.go:82)
  - Added httpClientTimeout = 30 * time.Second constant
  - Added Timeout: httpClientTimeout to the HTTP client

  #6: config.Set called while holding write lock (impl/delegatedauth.go:341)
  - Moved updateConfigWithAPIKey call outside the lock in startBackgroundRefresh
  - Captured the API key while holding the lock, then released it before calling config.Set

  #7: Blocking IMDS calls while holding write lock (impl/delegatedauth.go:127)
  - Refactored initializeIfNeeded to perform cloud detection without holding locks
  - IMDS calls now happen outside any lock, then state is updated with a brief write lock

  #8: Regex fails silently for non-standard formats (api/delegated_auth.go:36)
  - Added debug log when endpoint doesn't match known Datadog domain pattern
  - Updated function documentation to clarify behavior

  #9: Uncached IMDS credential fetch (aws.go:104)
  - Added documentation explaining the trade-off (refresh interval is typically 60 minutes, so caching is not critical)

  #10: Auth proof format undocumented (aws.go:98)
  - Added detailed comment documenting the auth proof format: <base64-body>|<base64-headers>|<method>|<base64-url>

  LOW Priority Issues Fixed:

  #11: Unnecessarily exported types (aws.go)
  - Changed SigningData to signingData (unexported)
  - Changed AWSAuth.AwsRegion to AWSAuth.region (unexported)
  - Updated all references in aws.go and aws_test.go

  #12: Tests exercise copy of goroutine (impl/delegatedauth_test.go:19)
  - Added documentation explaining why tests use a simplified goroutine pattern
  - Clarified that integration tests cover the actual startBackgroundRefresh function

  #13: Subsequent Config param silently ignored (def/delegatedauth.go:24)
  - Updated documentation to clearly state that only the first Config is used
  - Added warning log when a different Config is passed on subsequent calls
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Feb 24, 2026
### What does this PR do?
Adds on-disk persistence for Health Platform issues so they survive agent restarts. The component now writes a JSON state file under `<run_path>/health-platform/issues.json`, restores issues on startup, and tracks per-check issue lifecycle state (`new`, `ongoing`, `resolved`) with timestamps.

### Motivation
Health issues were previously stored only in memory, so restarting the agent cleared the current health view and made troubleshooting harder. Persisting issues improves continuity for diagnostics, local endpoint visibility, and support workflows.

### Describe how you validated your changes

CI + manual QA:
```
cat /opt/datadog-agent/run/health-platform/issues.json
{
  "updated_at": "2026-02-12T13:31:47Z",
  "issues": {
    "docker-socket-permissions": {
      "issue_id": "docker-file-tailing-disabled",
      "state": "new",
      "first_seen": "2026-02-12T13:31:47Z",
      "last_seen": "2026-02-12T13:31:47Z"
    }
  }
}
```

```
cat /opt/datadog-agent/run/health-platform/issues.json
{
  "updated_at": "2026-02-12T14:26:51Z",
  "issues": {
    "docker": {
      "issue_id": "check-execution-failure",
      "state": "ongoing",
      "first_seen": "2026-02-12T13:35:25Z",
      "last_seen": "2026-02-12T14:26:51Z"
    },
    "docker-socket-permissions": {
      "issue_id": "docker-file-tailing-disabled",
      "state": "resolved",
      "first_seen": "2026-02-12T13:31:47Z",
      "last_seen": "2026-02-12T13:31:47Z",
      "resolved_at": "2026-02-12T14:26:33Z"
    },
    "logs-docker-file-permissions": {
      "issue_id": "docker-file-tailing-disabled",
      "state": "ongoing",
      "first_seen": "2026-02-12T13:35:08Z",
      "last_seen": "2026-02-12T13:35:08Z"
    }
  }
}
```

### Additional Notes
Writes are atomic (temp file + rename). Persistence is updated on issue updates and clears. Issues restored from disk are rebuilt from the registry; resolved issues are not rehydrated into the active issues map.

#### Workflow example:

Agent start #1:

- Detect issue 1 → {issue_id: "issue-1", state: "new", first_seen: T1}
- Detect issue 2 → {issue_id: "issue-2", state: "new", first_seen: T1}
- Later, detect issue 3 → {issue_id: "issue-3", state: "new", first_seen: T2}
- Issue 1 detected again → {state: "ongoing", last_seen: T3}
- Issue 2 resolved → {state: "resolved", resolved_at: T4}

Agent stops

Agent start #2:

- Load file, restore active issues (issue 1, issue 3)
- Check runs, issue 1 is now resolved → {state: "resolved", resolved_at: T5}
- Check runs, issue 3 still present → {state: "ongoing", last_seen: T5}
- Final state: issue 1 resolved, issue 2 resolved, issue 3 ongoing ✓

Co-authored-by: louis.coquerelle <louis.coquerelle@datadoghq.com>
StephenWakely added a commit that referenced this pull request Mar 10, 2026
- Run benchmarks across comp/dogstatsd/server, pkg/aggregator, and
  comp/forwarder/defaultforwarder; results saved to plans/bench-baseline-forwarder.txt
- Generated pprof profiles (mem.out, cpu.out) from aggregator flush benchmarks
- Created scripts/profile_pipeline.sh documenting exact reproduction commands
- Documented top 10 allocation sites and CPU hotspots in plans/profiling-baseline.md
- Added benchmark test files for aggregator (time_sampler, context_resolver),
  forwarder, and dogstatsd/server that will be used in subsequent stories
- Typecheck passes (go build ./comp/... ./pkg/aggregator/... ./comp/forwarder/...)

Key findings:
  - pkg/metrics.(*Gauge).flush is #1 allocator (26.97% of objects) — target US-004
  - contextResolver.trackContext is #4 allocator and 26.58% cumulative CPU — target US-003
  - GC overhead accounts for ~22% of CPU — directly reducible via alloc reduction
  - Forwarder: 15 allocs/op per transaction creation — target US-007

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request Mar 20, 2026
### What does this PR do?
Adds an explicit `--platform=linux/amd64` run argument when the devcontainer image is an amd64 image, so Docker runs it with the correct architecture.

### Motivation
On hosts where Docker may default to a different platform (e.g., Apple Silicon), amd64 devcontainer images can be pulled/run with the wrong architecture/emulation unless the platform is specified.

### Describe how you validated your changes
- Ran the devcontainer setup task and verified the generated `runArgs` includes `--platform=linux/amd64` when using an amd64 devcontainer image.
```
28976e101240:/workspaces/datadog-agent-primary$ uname -a
Linux 28976e101240 6.12.72-linuxkit #1 SMP Mon Feb 16 11:19:07 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
```

### Additional Notes
- Change is limited to `tasks/devcontainer.py` and only applies when `"amd64"` is present in the image name.

Co-authored-by: wassim.dhif <wassim.dhif@datadoghq.com>
YoannGh added a commit that referenced this pull request Mar 31, 2026
…parity

Three fixes that bring optimized output from 18/52 to 49/52 matches
(0 mismatches, 3 skipped VLAN):

1. vstore: remove erroneous IsConst check. The C vstore NOPs a
   statement when the new value equals the existing value, regardless
   of whether the value is a constant. Our version had an extra
   os.IsConst(newval) condition that prevented NOP-ing redundant
   packet loads (e.g., duplicate ldh [12] across blocks).

2. FindInedges: only add edges from reachable blocks. After jump
   threading, unreachable blocks still have stale JT/JF pointers
   to reachable blocks. These phantom predecessors cause value
   merging to produce VAL_UNKNOWN, preventing redundant load
   elimination. Fix: iterate levels[] lists (populated by
   FindLevels with reachable blocks only) instead of the full
   blocks[] array.

3. GenMulticast: use JSET directly for Ethernet multicast check,
   matching C libpcap's gen_mac_multicast() which generates
   "ldb [0]; jset #1" instead of "ldb [0]; and #1; jeq #1".

Golden test results (optimized, vs C filtertest default):
- 49/52 exact instruction-for-instruction matches
- 0 mismatches
- 3 skipped (VLAN not implemented)
YoannGh added a commit that referenced this pull request Apr 1, 2026
…parity

Three fixes that bring optimized output from 18/52 to 49/52 matches
(0 mismatches, 3 skipped VLAN):

1. vstore: remove erroneous IsConst check. The C vstore NOPs a
   statement when the new value equals the existing value, regardless
   of whether the value is a constant. Our version had an extra
   os.IsConst(newval) condition that prevented NOP-ing redundant
   packet loads (e.g., duplicate ldh [12] across blocks).

2. FindInedges: only add edges from reachable blocks. After jump
   threading, unreachable blocks still have stale JT/JF pointers
   to reachable blocks. These phantom predecessors cause value
   merging to produce VAL_UNKNOWN, preventing redundant load
   elimination. Fix: iterate levels[] lists (populated by
   FindLevels with reachable blocks only) instead of the full
   blocks[] array.

3. GenMulticast: use JSET directly for Ethernet multicast check,
   matching C libpcap's gen_mac_multicast() which generates
   "ldb [0]; jset #1" instead of "ldb [0]; and #1; jeq #1".

Golden test results (optimized, vs C filtertest default):
- 49/52 exact instruction-for-instruction matches
- 0 mismatches
- 3 skipped (VLAN not implemented)
jose-manuel-almaza added a commit that referenced this pull request Apr 1, 2026
…w options

- Move partitionEnumInFlight.Store(false) before the channel send so
  sequential calls within the same check run never hit the guard. The
  previous receiver-side Store introduced a race where call #1's defer
  could clear the flag while call #2 was active.
- Disable tag_by_physical_storage and collect_physical_metrics at config
  time on Windows, where gopsutil ignores the all parameter and both
  syscalls return identical results.
- Add new options to conf.yaml.default with platform support notes.
- Add Windows gate test.
jose-manuel-almaza added a commit that referenced this pull request Apr 1, 2026
…w options

- Move partitionEnumInFlight.Store(false) before the channel send so
  sequential calls within the same check run never hit the guard. The
  previous receiver-side Store introduced a race where call #1's defer
  could clear the flag while call #2 was active.
- Disable tag_by_physical_storage and collect_physical_metrics at config
  time on Windows, where gopsutil ignores the all parameter and both
  syscalls return identical results.
- Add new options to conf.yaml.default with platform support notes.
- Add Windows gate test.
YoannGh added a commit that referenced this pull request Apr 2, 2026
…parity

Three fixes that bring optimized output from 18/52 to 49/52 matches
(0 mismatches, 3 skipped VLAN):

1. vstore: remove erroneous IsConst check. The C vstore NOPs a
   statement when the new value equals the existing value, regardless
   of whether the value is a constant. Our version had an extra
   os.IsConst(newval) condition that prevented NOP-ing redundant
   packet loads (e.g., duplicate ldh [12] across blocks).

2. FindInedges: only add edges from reachable blocks. After jump
   threading, unreachable blocks still have stale JT/JF pointers
   to reachable blocks. These phantom predecessors cause value
   merging to produce VAL_UNKNOWN, preventing redundant load
   elimination. Fix: iterate levels[] lists (populated by
   FindLevels with reachable blocks only) instead of the full
   blocks[] array.

3. GenMulticast: use JSET directly for Ethernet multicast check,
   matching C libpcap's gen_mac_multicast() which generates
   "ldb [0]; jset #1" instead of "ldb [0]; and #1; jeq #1".

Golden test results (optimized, vs C filtertest default):
- 49/52 exact instruction-for-instruction matches
- 0 mismatches
- 3 skipped (VLAN not implemented)
jose-manuel-almaza added a commit that referenced this pull request Apr 6, 2026
…w options

- Move partitionEnumInFlight.Store(false) before the channel send so
  sequential calls within the same check run never hit the guard. The
  previous receiver-side Store introduced a race where call #1's defer
  could clear the flag while call #2 was active.
- Disable tag_by_physical_storage and collect_physical_metrics at config
  time on Windows, where gopsutil ignores the all parameter and both
  syscalls return identical results.
- Add new options to conf.yaml.default with platform support notes.
- Add Windows gate test.
misteriaud added a commit that referenced this pull request Apr 16, 2026
…ctions

Replace the monolithic batcher (5 ring buffers sharing one transport)
with a generic pipeline[T] struct. Each pipeline owns its own ring
buffers, flush goroutines, and dedicated UDS connection:

  metricsPipeline = pipeline[metricPoint]        + unixConn #1
  logsPipeline    = pipeline[logEntry]            + unixConn #2
  tracePipeline   = pipeline[capturedTraceStat]   + unixConn #3

Pipelines are fully independent — one slow pipeline (e.g. logs sending
large frames) cannot block or starve another.

Key changes:
- pipeline[T]: generic struct with AddEntry(T), AddContextDef, Stop.
  1-2 flush goroutines per pipeline (entries + optional contexts).
  flushChunked reused unchanged.
- unixConn: simple per-connection transport replacing pooledTransport.
  Lazy dial, mutex held during Send, reconnect once on error.
- activate(): creates 3 pipelines with 3 independent connections.
  sync.Once coordinates teardown when any transport disconnects.

Testbench results (all 0 drops):
- dogstatsd-p99: 2.8M metrics sent
- logs-high-throughput: 40M logs at 10 MiB/s, 3 GB Parquet
- metrics-logs-combined: 743K metrics + 1M logs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@chenww chenww mentioned this pull request Apr 24, 2026
ellataira added a commit that referenced this pull request Apr 28, 2026
…ll detectors

PR 49939 had 30 iters, all detector-targeted (bocpd/scanmw/scanwelch
tweaks). The proposer's diversity guideline ("at least 3 distinct
families") doesn't enforce CATEGORY diversity — three families of
detector tweaks still satisfies "diverse families" while staying in
the same structural surface.

Adds a runtime check: if every candidate in the last 10 had only
detector-style target_components (no name containing 'correlator' or
'extractor'), inject a STRUCTURAL DIVERSITY REQUIRED clause demanding
at least one correlator candidate this round. Auto-disables once
correlator candidates appear in recent history.

Cooperates with operator steering (#1+#2) — explicit user directives
still take precedence; this is the autonomous backup when the operator
isn't watching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants