Scheduler [first draft] by masci · Pull Request #19 · DataDog/datadog-agent

masci · 2016-09-05T14:15:00Z

Please refer to the README for the overall design, here follows random notes on the changes.

TL;DR

The Scheduler is responsible to send a set of Checks to the execution pipeline consumed by the Runner at certain intervals. We keep a set of timers, one for each interval needed, and we append every Check to the relevant timer.

API

The Scheduler API exposes plain functions but uses channels under the hood to keep things in sync, this should allow to write cleaner code in the callers - for example scheduler.Start() instead of scheduler.start <- true or stuff like that that assume a caller knows the internals.

Test

Some of the tests rely on the asynchronous design of the components, so I had to provide an helper function that polls (instead of sleeping) until a certain condition is met (e.g. the scheduler has fully started). These kind of tests can be very flaky so we might need to adjust timeouts at some point.

remh · 2016-09-08T21:12:12Z


+// DefaultCheckInterval is the interval in seconds the scheduler should apply
+// when no value was provided in Check configuration.
+const DefaultCheckInterval = 20


should be 15

remh · 2016-09-08T21:18:13Z

That's a really good first draft. It seems like interval customization at the configuration level is not implemented though ?

masci · 2016-09-09T12:19:58Z

Customizing the interval is responsibility of the Configure method: the configuration payload may or may not contain that information, in the latter a default value, hardcoded (for now, but could be provided by datadog.conf) will be used. If an interval was passed, it's up to the Check to store it so it can be returned by the Interval method. As you have noticed, this strategy was not implemented yet in this PR, will update it with a working example for the python check.

this method should be called GetDefaultInterval i believe ?

If the Check receives the desired interval during the Configure, this would return the configured value, falling back to the default

Can we avoid implementing this method if we just want to use the default ?

That's the problem with this approach: if we have checks that don't expect an interval from configuration, still they need to implement this method like:

func (c *SomeCheck) Interval() int {
  return check.DefaultCheckInterval
}

remh

Just a tiny nitpick otherwise feel free to merge

remh · 2016-10-05T16:06:48Z

+
+var log = logging.MustGetLogger("datadog-agent")
+
+const defaultTimeout = 5000


Let's define it as time duration from the get go

Removed unneeded code

Add version to json output, and cli option

This change ensures that Go dependencies are using the new org as the canonical source.

### What does this PR do? This PR fixes the handling of Subject Alternative Names (SANs) in certificates generated for the Datadog Cluster Agent when using a cluster trust chain to also include K8s DNS records: - `serviceName.namespace`: short form, namespace-qualified name - `serviceName.namespace.svc`: includes the subdomain `svc` - `serviceName.namespace.svc.cluster.local`: complete fully qualified domain name (FQDN) ### Motivation Enables proper TLS certificate validation when agents connect to DCA using the added DNS names. This is particularly relevant for securing Agent communication on EKS Fargate (serverless) where the sidecar Agent is running in another k8s namespace and queries for the Cluster Agent using the FQDN: https://github.com/DataDog/datadog-agent/blob/e7f11ebec9b328b2f418adadd8026086db79727c/pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go#L420-L428 [CONTP-1152] ### Describe how you validated your changes Deploy the Agent and Cluster Agent using the new CA Cert. First generate a certificate: ``` openssl req -x509 -new -nodes -days 3650 \ -newkey ec:<(openssl ecparam -name prime256v1) \ -keyout tls.key \ -out tls.crt \ -subj "/" \ -addext "basicConstraints=critical,CA:true,pathlen:0" \ -addext "keyUsage=critical,keyCertSign" ``` Then create a k8s secret from the certificate: ``` kubectl create secret tls my-dd-tls \ --cert=tls.crt \ --key=tls.key \ -n datadog-agent ``` Deploy the Agent with the secret mounted and envvars configured to use the cert: ``` datadog: kubelet: tlsVerify: false clusterName: gabedos-dev envDict: DD_CLUSTER_TRUST_CHAIN_ENABLE_TLS_VERIFICATION: "true" DD_CLUSTER_TRUST_CHAIN_CA_CERT_FILE_PATH: "/etc/datadog-agent/certificates/tls.crt" DD_CLUSTER_TRUST_CHAIN_CA_KEY_FILE_PATH: "/etc/datadog-agent/certificates/tls.key" volumeMounts: - name: tls mountPath: /etc/datadog-agent/certificates readOnly: true volumes: - name: tls secret: secretName: my-dd-tls clusterAgent: envDict: DD_CLUSTER_TRUST_CHAIN_ENABLE_TLS_VERIFICATION: "true" DD_CLUSTER_TRUST_CHAIN_CA_CERT_FILE_PATH: "/etc/datadog-agent/certificates/tls.crt" DD_CLUSTER_TRUST_CHAIN_CA_KEY_FILE_PATH: "/etc/datadog-agent/certificates/tls.key" volumeMounts: - name: tls mountPath: /etc/datadog-agent/certificates readOnly: true volumes: - name: tls secret: secretName: my-dd-tls ``` Connect to the node agent and try querying using the FQDN: ``` curl --cacert /etc/datadog-agent/certificates/tls.crt https://datadog-agent-linux-cluster-agent.datadog-agent.svc.cluster.local:5005/version -H "Authorization: Bearer $DD_CLUSTER_AGENT_AUTH_TOKEN" ``` ### Additional Notes For whatever reason, the `trace-agent` still has the `kubeapiserver` build tag. When the IPC component uses the namespace util in `pkg/util/kubernetes/apiserver/common`, it ends up pulling in 200+ additional dependencies. Thus, this PR also moves the util method into its own package such that unnecessary dependencies also aren't pulled in. However, when we start using the namespace util, it complicates things with the OTEL build as seen by the following CI job logs. `github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace` <details> <summary> Full Job Logs when Extracting Namespace Util to Separate Package </summary> ``` #19 48.64 Downloaded to /tmp/tmp8jdve2k8/ocb_0.142.0_linux_amd64 #19 48.64 Running command: /tmp/tmp8jdve2k8/ocb_0.142.0_linux_amd64 --config ./comp/otelcol/collector-contrib/impl/manifest.yaml --skip-compilation #19 48.64 Binary output: #19 48.64 #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main_others.go #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main.go #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main_windows.go #19 48.64 ./comp/otelcol/collector-contrib/impl/components.go #19 87.69 go: downloading k8s.io/api v0.35.0-alpha.0 #19 87.71 go: downloading k8s.io/client-go v0.35.0-alpha.0 #19 88.84 go: downloading github.com/openshift/api v3.9.0+incompatible #19 89.25 go: downloading github.com/ProtonMail/go-crypto v1.3.0 #19 89.86 go: downloading github.com/google/gopacket v1.1.19 #19 90.42 go: downloading k8s.io/component-base v0.35.0-alpha.0 #19 90.47 go: downloading google.golang.org/genproto v0.0.0-20240311173647-c811ad7063a7 #19 96.26 Encountered a bad command exit code! #19 96.26 #19 96.26 Command: 'cd /workspace/datadog-agent/comp/core/configsync && go mod tidy ' #19 96.26 #19 96.26 Exit code: 1 #19 96.26 #19 96.26 Stdout: #19 96.26 #19 96.26 #19 96.26 #19 96.26 Stderr: #19 96.26 #19 96.26 go: finding module for package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace #19 96.26 go: downloading github.com/DataDog/datadog-agent v0.0.0-20260113211521-d91eb7342549 #19 96.26 go: github.com/DataDog/datadog-agent/comp/core/configsync/configsyncimpl imports #19 96.26 github.com/DataDog/datadog-agent/comp/core/ipc/mock imports #19 96.26 github.com/DataDog/datadog-agent/pkg/api/security/cert imports #19 96.26 github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace #19 96.26 #19 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/components.go #19 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/collectorcontrib.go #19 ERROR: process "/bin/sh -c dda inv collector.generate" did not complete successfully: exit code: 1 ------ > [builder 11/14] RUN dda inv collector.generate: 96.26 96.26 go: finding module for package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace 96.26 go: downloading github.com/DataDog/datadog-agent v0.0.0-20260113211521-d91eb7342549 96.26 go: github.com/DataDog/datadog-agent/comp/core/configsync/configsyncimpl imports 96.26 github.com/DataDog/datadog-agent/comp/core/ipc/mock imports 96.26 github.com/DataDog/datadog-agent/pkg/api/security/cert imports 96.26 github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace 96.26 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/components.go 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/collectorcontrib.go ------ ``` </details> This then led me to going through the series of `dda inv` commands to generate go modules for the package as well as updating all of the related OTEL packages. [CONTP-1152]: https://datadoghq.atlassian.net/browse/CONTP-1152?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: gabe.dossantos <gabe.dossantos@datadoghq.com>

### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>

Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com> (cherry picked from commit 40d1f09) ___ Co-authored-by: Théo Putegnat <theo.putegnat@datadoghq.com>

Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: axel.vonengel <axel.vonengel@datadoghq.com>

Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: YoannGh <yoann.ghigoff@datadoghq.com> Co-authored-by: florent.clarret <florent.clarret@datadoghq.com>

### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>

Replaces the direct dependency on github.com/hectane/go-acl (v0.0.0-20230225031251-cdfc9e3acf94, the head of upstream's unmerged PR #19) with github.com/DataDog/go-acl v1.0.0, a tagged release of a DataDog-owned fork that contains the same code (upstream master HEAD plus the golang.org/x/sys 0.1.0 bump from PR #19). Why: - Upstream hectane/go-acl is inactive, has no semver tags, and the commit we depended on lives on an unmerged PR branch — fragile ground for Renovate, which fell back to digest updates that produced malformed go.mod entries and time-regressing "updates" (see #49574). - Owning a tagged fork lets Renovate resolve real semver versions and guarantees the source we depend on cannot vanish or be force-pushed. Scope: - Two Go imports rewritten (pkg/util/filesystem/permission_windows.go and pkg/security/probe/probe_auditing_windows_test.go). - All affected go.mod/go.sum updated via dda inv tidy. - Bazel manifest updated (deps/go.MODULE.bazel, pkg/util/filesystem/BUILD.bazel). - LICENSE-3rdparty.csv regenerated. The hectane/go-acl // indirect entries that remain come from old datadog-agent submodule versions pinned by opentelemetry-collector- contrib. They will disappear once OTel bumps its datadog-agent pin past this PR.

masci added the component/collector label Sep 5, 2016

masci force-pushed the massi/scheduler branch 6 times, most recently from 8725e56 to e27f2a0 Compare September 7, 2016 09:45

first scheduler draft

18139e8

masci force-pushed the massi/scheduler branch from e27f2a0 to 18139e8 Compare September 7, 2016 10:04

remh reviewed Sep 8, 2016
View reviewed changes

use time.Duration instead of int

d3be890

Massimiliano Pippi added 2 commits September 9, 2016 14:24

take changes from master

6fe271e

use time.Duration instead of int, implement python check configuration

82542e8

remh reviewed Oct 5, 2016

View reviewed changes

use time.Duration on constants too

4365032

masci merged commit bba7dac into master Oct 6, 2016

masci deleted the massi/scheduler branch October 6, 2016 21:14

masci added a commit that referenced this pull request Apr 1, 2019

Merge pull request #19 from DataDog/massi/cleanup

6bd9e7a

Removed unneeded code

hush-hush pushed a commit that referenced this pull request Apr 17, 2019

Merge pull request #19 from DataDog/massi/cleanup

85406af

Removed unneeded code

pgimalac pushed a commit that referenced this pull request Apr 5, 2023

Merge pull request #19 from DataDog/olivielpeau/version

914d2e8

Add version to json output, and cli option

pgimalac pushed a commit that referenced this pull request Apr 24, 2023

Merge pull request #19 from DataDog/olivielpeau/version

a5d6a4f

Add version to json output, and cli option

songy23 mentioned this pull request May 21, 2025

[otel] BYOC: dockerfile updates + invoke task fixes #37187

Merged

s-alad pushed a commit that referenced this pull request Nov 21, 2025

Update import paths to point to the new repo (#19)

259ae4b

This change ensures that Go dependencies are using the new org as the canonical source.

chouetz mentioned this pull request Apr 27, 2026

chore: switch from hectane/go-acl to DataDog/go-acl fork #49932

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler [first draft]#19

Scheduler [first draft]#19
masci merged 5 commits intomasterfrom
massi/scheduler

masci commented Sep 5, 2016 •

edited

Loading

Uh oh!

remh Sep 8, 2016

Uh oh!

remh commented Sep 8, 2016

Uh oh!

masci commented Sep 9, 2016 •

edited

Loading

Uh oh!

remh left a comment

Uh oh!

remh Oct 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		var log = logging.MustGetLogger("datadog-agent")

		const defaultTimeout = 5000

Conversation

masci commented Sep 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

API

Test

Uh oh!

remh Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

remh commented Sep 8, 2016

Uh oh!

masci commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remh left a comment

Choose a reason for hiding this comment

Uh oh!

remh Oct 5, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

masci commented Sep 5, 2016 •

edited

Loading

masci commented Sep 9, 2016 •

edited

Loading