Conversation
8725e56 to
e27f2a0
Compare
e27f2a0 to
18139e8
Compare
|
|
||
| // DefaultCheckInterval is the interval in seconds the scheduler should apply | ||
| // when no value was provided in Check configuration. | ||
| const DefaultCheckInterval = 20 |
|
That's a really good first draft. It seems like interval customization at the configuration level is not implemented though ? |
|
Customizing the interval is responsibility of the
If the Check receives the desired interval during the Configure, this would return the configured value, falling back to the default
That's the problem with this approach: if we have checks that don't expect an interval from configuration, still they need to implement this method like: |
remh
left a comment
There was a problem hiding this comment.
Just a tiny nitpick otherwise feel free to merge
|
|
||
| var log = logging.MustGetLogger("datadog-agent") | ||
|
|
||
| const defaultTimeout = 5000 |
There was a problem hiding this comment.
Let's define it as time duration from the get go
Add version to json output, and cli option
Add version to json output, and cli option
This change ensures that Go dependencies are using the new org as the canonical source.
### What does this PR do? This PR fixes the handling of Subject Alternative Names (SANs) in certificates generated for the Datadog Cluster Agent when using a cluster trust chain to also include K8s DNS records: - `serviceName.namespace`: short form, namespace-qualified name - `serviceName.namespace.svc`: includes the subdomain `svc` - `serviceName.namespace.svc.cluster.local`: complete fully qualified domain name (FQDN) ### Motivation Enables proper TLS certificate validation when agents connect to DCA using the added DNS names. This is particularly relevant for securing Agent communication on EKS Fargate (serverless) where the sidecar Agent is running in another k8s namespace and queries for the Cluster Agent using the FQDN: https://github.com/DataDog/datadog-agent/blob/e7f11ebec9b328b2f418adadd8026086db79727c/pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go#L420-L428 [CONTP-1152] ### Describe how you validated your changes Deploy the Agent and Cluster Agent using the new CA Cert. First generate a certificate: ``` openssl req -x509 -new -nodes -days 3650 \ -newkey ec:<(openssl ecparam -name prime256v1) \ -keyout tls.key \ -out tls.crt \ -subj "/" \ -addext "basicConstraints=critical,CA:true,pathlen:0" \ -addext "keyUsage=critical,keyCertSign" ``` Then create a k8s secret from the certificate: ``` kubectl create secret tls my-dd-tls \ --cert=tls.crt \ --key=tls.key \ -n datadog-agent ``` Deploy the Agent with the secret mounted and envvars configured to use the cert: ``` datadog: kubelet: tlsVerify: false clusterName: gabedos-dev envDict: DD_CLUSTER_TRUST_CHAIN_ENABLE_TLS_VERIFICATION: "true" DD_CLUSTER_TRUST_CHAIN_CA_CERT_FILE_PATH: "/etc/datadog-agent/certificates/tls.crt" DD_CLUSTER_TRUST_CHAIN_CA_KEY_FILE_PATH: "/etc/datadog-agent/certificates/tls.key" volumeMounts: - name: tls mountPath: /etc/datadog-agent/certificates readOnly: true volumes: - name: tls secret: secretName: my-dd-tls clusterAgent: envDict: DD_CLUSTER_TRUST_CHAIN_ENABLE_TLS_VERIFICATION: "true" DD_CLUSTER_TRUST_CHAIN_CA_CERT_FILE_PATH: "/etc/datadog-agent/certificates/tls.crt" DD_CLUSTER_TRUST_CHAIN_CA_KEY_FILE_PATH: "/etc/datadog-agent/certificates/tls.key" volumeMounts: - name: tls mountPath: /etc/datadog-agent/certificates readOnly: true volumes: - name: tls secret: secretName: my-dd-tls ``` Connect to the node agent and try querying using the FQDN: ``` curl --cacert /etc/datadog-agent/certificates/tls.crt https://datadog-agent-linux-cluster-agent.datadog-agent.svc.cluster.local:5005/version -H "Authorization: Bearer $DD_CLUSTER_AGENT_AUTH_TOKEN" ``` ### Additional Notes For whatever reason, the `trace-agent` still has the `kubeapiserver` build tag. When the IPC component uses the namespace util in `pkg/util/kubernetes/apiserver/common`, it ends up pulling in 200+ additional dependencies. Thus, this PR also moves the util method into its own package such that unnecessary dependencies also aren't pulled in. However, when we start using the namespace util, it complicates things with the OTEL build as seen by the following CI job logs. `github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace` <details> <summary> Full Job Logs when Extracting Namespace Util to Separate Package </summary> ``` #19 48.64 Downloaded to /tmp/tmp8jdve2k8/ocb_0.142.0_linux_amd64 #19 48.64 Running command: /tmp/tmp8jdve2k8/ocb_0.142.0_linux_amd64 --config ./comp/otelcol/collector-contrib/impl/manifest.yaml --skip-compilation #19 48.64 Binary output: #19 48.64 #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main_others.go #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main.go #19 48.64 Removing file: ./comp/otelcol/collector-contrib/impl/main_windows.go #19 48.64 ./comp/otelcol/collector-contrib/impl/components.go #19 87.69 go: downloading k8s.io/api v0.35.0-alpha.0 #19 87.71 go: downloading k8s.io/client-go v0.35.0-alpha.0 #19 88.84 go: downloading github.com/openshift/api v3.9.0+incompatible #19 89.25 go: downloading github.com/ProtonMail/go-crypto v1.3.0 #19 89.86 go: downloading github.com/google/gopacket v1.1.19 #19 90.42 go: downloading k8s.io/component-base v0.35.0-alpha.0 #19 90.47 go: downloading google.golang.org/genproto v0.0.0-20240311173647-c811ad7063a7 #19 96.26 Encountered a bad command exit code! #19 96.26 #19 96.26 Command: 'cd /workspace/datadog-agent/comp/core/configsync && go mod tidy ' #19 96.26 #19 96.26 Exit code: 1 #19 96.26 #19 96.26 Stdout: #19 96.26 #19 96.26 #19 96.26 #19 96.26 Stderr: #19 96.26 #19 96.26 go: finding module for package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace #19 96.26 go: downloading github.com/DataDog/datadog-agent v0.0.0-20260113211521-d91eb7342549 #19 96.26 go: github.com/DataDog/datadog-agent/comp/core/configsync/configsyncimpl imports #19 96.26 github.com/DataDog/datadog-agent/comp/core/ipc/mock imports #19 96.26 github.com/DataDog/datadog-agent/pkg/api/security/cert imports #19 96.26 github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace #19 96.26 #19 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/components.go #19 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/collectorcontrib.go #19 ERROR: process "/bin/sh -c dda inv collector.generate" did not complete successfully: exit code: 1 ------ > [builder 11/14] RUN dda inv collector.generate: 96.26 96.26 go: finding module for package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace 96.26 go: downloading github.com/DataDog/datadog-agent v0.0.0-20260113211521-d91eb7342549 96.26 go: github.com/DataDog/datadog-agent/comp/core/configsync/configsyncimpl imports 96.26 github.com/DataDog/datadog-agent/comp/core/ipc/mock imports 96.26 github.com/DataDog/datadog-agent/pkg/api/security/cert imports 96.26 github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace: module github.com/DataDog/datadog-agent@latest found (v0.0.0-20260113211521-d91eb7342549), but does not contain package github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/common/namespace 96.26 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/components.go 96.26 Updated package name and ensured license header in: ./comp/otelcol/collector-contrib/impl/collectorcontrib.go ------ ``` </details> This then led me to going through the series of `dda inv` commands to generate go modules for the package as well as updating all of the related OTEL packages. [CONTP-1152]: https://datadoghq.atlassian.net/browse/CONTP-1152?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: gabe.dossantos <gabe.dossantos@datadoghq.com>
### What does this PR do?
Skip the SSH session patcher and add a test to illustrate the current issue.
In addition, adds the possibility to check specific fields in the json returned for ssh_session events.
### Motivation
The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved.
Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events.
### Describe how you validated your changes
Added a test that illustrate the issue : `TestSSHUserSessionBlocking`
With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent.
Error without commenting the patcher :
```
Error: Received unexpected error:
All attempts fail:
#1: not found
#2: not found
#3: not found
#4: not found
#5: not found
#6: not found
#7: not found
#8: not found
#9: not found
#10: not found
#11: not found
#12: not found
#13: not found
#14: not found
#15: not found
#16: not found
#17: not found
#18: not found
#19: not found
#20: not found
#21: not found
#22: not found
#23: not found
#24: not found
#25: not found
#26: not found
#27: not found
#28: not found
#29: not found
#30: not found
Test: TestSSHUserSessionBlocking/second_ssh_no_auth
```
Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>
Skip the SSH session patcher and add a test to illustrate the current issue.
In addition, adds the possibility to check specific fields in the json returned for ssh_session events.
### Motivation
The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved.
Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events.
### Describe how you validated your changes
Added a test that illustrate the issue : `TestSSHUserSessionBlocking`
With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent.
Error without commenting the patcher :
```
Error: Received unexpected error:
All attempts fail:
#1: not found
#2: not found
#3: not found
#4: not found
#5: not found
#6: not found
#7: not found
#8: not found
#9: not found
#10: not found
#11: not found
#12: not found
#13: not found
#14: not found
#15: not found
#16: not found
#17: not found
#18: not found
#19: not found
#20: not found
#21: not found
#22: not found
#23: not found
#24: not found
#25: not found
#26: not found
#27: not found
#28: not found
#29: not found
#30: not found
Test: TestSSHUserSessionBlocking/second_ssh_no_auth
```
Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>
(cherry picked from commit 40d1f09)
___
Co-authored-by: Théo Putegnat <theo.putegnat@datadoghq.com>
Skip the SSH session patcher and add a test to illustrate the current issue.
In addition, adds the possibility to check specific fields in the json returned for ssh_session events.
### Motivation
The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved.
Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events.
### Describe how you validated your changes
Added a test that illustrate the issue : `TestSSHUserSessionBlocking`
With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent.
Error without commenting the patcher :
```
Error: Received unexpected error:
All attempts fail:
#1: not found
#2: not found
#3: not found
#4: not found
#5: not found
#6: not found
#7: not found
#8: not found
#9: not found
#10: not found
#11: not found
#12: not found
#13: not found
#14: not found
#15: not found
#16: not found
#17: not found
#18: not found
#19: not found
#20: not found
#21: not found
#22: not found
#23: not found
#24: not found
#25: not found
#26: not found
#27: not found
#28: not found
#29: not found
#30: not found
Test: TestSSHUserSessionBlocking/second_ssh_no_auth
```
Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>
(cherry picked from commit 40d1f09)
___
Co-authored-by: Théo Putegnat <theo.putegnat@datadoghq.com>
Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: axel.vonengel <axel.vonengel@datadoghq.com>
Backport 40d1f09 from #45437. ___ ### What does this PR do? Skip the SSH session patcher and add a test to illustrate the current issue. In addition, adds the possibility to check specific fields in the json returned for ssh_session events. ### Motivation The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved. Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events. ### Describe how you validated your changes Added a test that illustrate the issue : `TestSSHUserSessionBlocking` With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent. Error without commenting the patcher : ``` Error: Received unexpected error: All attempts fail: #1: not found #2: not found #3: not found #4: not found #5: not found #6: not found #7: not found #8: not found #9: not found #10: not found #11: not found #12: not found #13: not found #14: not found #15: not found #16: not found #17: not found #18: not found #19: not found #20: not found #21: not found #22: not found #23: not found #24: not found #25: not found #26: not found #27: not found #28: not found #29: not found #30: not found Test: TestSSHUserSessionBlocking/second_ssh_no_auth ``` Co-authored-by: YoannGh <yoann.ghigoff@datadoghq.com> Co-authored-by: florent.clarret <florent.clarret@datadoghq.com>
### What does this PR do?
Skip the SSH session patcher and add a test to illustrate the current issue.
In addition, adds the possibility to check specific fields in the json returned for ssh_session events.
### Motivation
The retry mechanism could cause the agent to send no more than one event per minute if an SSH session was not properly resolved.
Previously, the event was not sent and the agent would wait one minute before sending it with the `unknown` type. However, this `authtype` would never be resolved because the session was initialized before the agent started processing events. As a result, every subsequent SSH event would wait one minute for nothing, causing a significant delay in agent events, potentially blocking all the other events.
### Describe how you validated your changes
Added a test that illustrate the issue : `TestSSHUserSessionBlocking`
With this change, the ssh_session event is now sent with `authtype` set to `unknown` and directly sent.
Error without commenting the patcher :
```
Error: Received unexpected error:
All attempts fail:
#1: not found
#2: not found
#3: not found
#4: not found
#5: not found
#6: not found
#7: not found
#8: not found
#9: not found
#10: not found
#11: not found
#12: not found
#13: not found
#14: not found
#15: not found
#16: not found
#17: not found
#18: not found
#19: not found
#20: not found
#21: not found
#22: not found
#23: not found
#24: not found
#25: not found
#26: not found
#27: not found
#28: not found
#29: not found
#30: not found
Test: TestSSHUserSessionBlocking/second_ssh_no_auth
```
Co-authored-by: theo.putegnat <theo.putegnat@datadoghq.com>
Replaces the direct dependency on github.com/hectane/go-acl (v0.0.0-20230225031251-cdfc9e3acf94, the head of upstream's unmerged PR #19) with github.com/DataDog/go-acl v1.0.0, a tagged release of a DataDog-owned fork that contains the same code (upstream master HEAD plus the golang.org/x/sys 0.1.0 bump from PR #19). Why: - Upstream hectane/go-acl is inactive, has no semver tags, and the commit we depended on lives on an unmerged PR branch — fragile ground for Renovate, which fell back to digest updates that produced malformed go.mod entries and time-regressing "updates" (see #49574). - Owning a tagged fork lets Renovate resolve real semver versions and guarantees the source we depend on cannot vanish or be force-pushed. Scope: - Two Go imports rewritten (pkg/util/filesystem/permission_windows.go and pkg/security/probe/probe_auditing_windows_test.go). - All affected go.mod/go.sum updated via dda inv tidy. - Bazel manifest updated (deps/go.MODULE.bazel, pkg/util/filesystem/BUILD.bazel). - LICENSE-3rdparty.csv regenerated. The hectane/go-acl // indirect entries that remain come from old datadog-agent submodule versions pinned by opentelemetry-collector- contrib. They will disappear once OTel bumps its datadog-agent pin past this PR.
Replaces the direct dependency on github.com/hectane/go-acl (v0.0.0-20230225031251-cdfc9e3acf94, the head of upstream's unmerged PR #19) with github.com/DataDog/go-acl v1.0.0, a tagged release of a DataDog-owned fork that contains the same code (upstream master HEAD plus the golang.org/x/sys 0.1.0 bump from PR #19). Why: - Upstream hectane/go-acl is inactive, has no semver tags, and the commit we depended on lives on an unmerged PR branch — fragile ground for Renovate, which fell back to digest updates that produced malformed go.mod entries and time-regressing "updates" (see #49574). - Owning a tagged fork lets Renovate resolve real semver versions and guarantees the source we depend on cannot vanish or be force-pushed. Scope: - Two Go imports rewritten (pkg/util/filesystem/permission_windows.go and pkg/security/probe/probe_auditing_windows_test.go). - All affected go.mod/go.sum updated via dda inv tidy. - Bazel manifest updated (deps/go.MODULE.bazel, pkg/util/filesystem/BUILD.bazel). - LICENSE-3rdparty.csv regenerated. The hectane/go-acl // indirect entries that remain come from old datadog-agent submodule versions pinned by opentelemetry-collector- contrib. They will disappear once OTel bumps its datadog-agent pin past this PR.
Please refer to the README for the overall design, here follows random notes on the changes.
TL;DR
The
Scheduleris responsible to send a set ofChecks to the execution pipeline consumed by theRunnerat certain intervals. We keep a set of timers, one for each interval needed, and we append everyCheckto the relevant timer.API
The
SchedulerAPI exposes plain functions but uses channels under the hood to keep things in sync, this should allow to write cleaner code in the callers - for examplescheduler.Start()instead ofscheduler.start <- trueor stuff like that that assume a caller knows the internals.Test
Some of the tests rely on the asynchronous design of the components, so I had to provide an helper function that polls (instead of sleeping) until a certain condition is met (e.g. the scheduler has fully started). These kind of tests can be very flaky so we might need to adjust timeouts at some point.